 Today we're going to talk about parallel short-merge joint algorithms. Again, me here in my home office, although I did up my lighting game quite a bit, and I have the tier down here who's going to be asking questions as we go along. So let's get right into this. So before we get into the material for today, we want to talk about some logistics about what's expected for you going forward, starting next week. So next week we're going to be doing the Project 3 status updates. So we'll do the same setup we did last time. We'll do it over Zoom, and then everybody will go for five minutes and just sort of give an update to the class of where they're at with their project. Prior to that though, I want every group to reach out to me and schedule a time later this week to meet one-on-one to discuss the current progress of your project, to see where you're having troubles, what you need help with, and give you a sense of what you'll present to the class on next week. So those in-class project presentations will be on Wednesday, April 8th. So in addition to doing the presentation, you're also going to need to provide a design document that discusses a bit more detail of what your implementation is going to look like. You're also going to need to submit what you've written so far for your project, submit it as a pull request to us on GitHub, so that we can do the first round of code reviews. Again, I'll explain what's expected in a code review next week, but the idea is that you will be assigned another group to look at their project and they'll look at your project, and you'll go through and try to understand what their implementation is trying to do, and then give them feedback on the quality and sort of ideas that they're pursuing. So again, that'll be sort of part of the participation grade for this project. Every team member is expected to participate, and when we do the second round of code reviews, you'll be looking at the same group's code again, so that way you're not starting over from scratch. You've looked at the code the first time, and now you're looking at it the second time as they're getting ready to submit the final implementation to get a grade, and so you'll provide them with additional feedback on what things they can do to fix them, what they can do to fix up their code. So I'll post this on Piazza about reaching out and setting up the meeting time, I will sign up for slots on the administration spreadsheet for the class, and then we'll go in reverse order next Wednesday or next week. We'll go in the reverse order that we did in the first time when we did the project proposals. So if you went first last time, you get to go last next time, okay? So this lecture and the previous lecture have been about payload join algorithms. And so last class, again, we focused on hash joins because I said that was the most commonly used join algorithm in an OLAP system, and that's the one we're going to try to get the best performance benefit by paralyzing or vectorizing it because, again, in most OLAP systems, most of the times we'd be executing hash joins. The other major approach to do joins is the sort merge join, and that's what I'll focus on today. And so we'll first start off with a little background information about what a sort merge join looks like. Again, this would just be a refresher from what we discussed in the introduction class. Then we'll talk about different sorting algorithms you can use, or sort of one in particular approach we'll focus on from Intel, which would be how to vectorize and paralyze the sorting algorithm as much as possible. Then we'll talk about how to combine it together to do a parallel sort merge join, and then we'll finish up with the evaluation that was provided that was in the assigned reading from ETH, okay? So the sort merge join basically has two phases. It's the same thing as the hash join. You prepare the data in the first phase in such a way that in the second phase, when you want to go try to find matches, the data is laid out or indexed in the case of the hash join, laid out in such a way that you can easily find matches without having to do that brute force sequential scan that you would have to do in a nested loop join. So again, in the first phase, we're just going to sort the tuples on the RNS based on the join key or the join keys if it's multiple keys. And then in the merge phase, we'll just scan through the two tables in the sorted order with these two iterators that are sort of going in lockstep and doing comparisons between them. And the idea here is because we sorted things ahead of time, as the iterators are walking through, we would know that we don't have to ever backtrack to look for a matching tuple or go scanning down farther than we'd actually need to if we come across a tuple or a key that is different than the last key we saw, then we know that that key can't exist anywhere else. So again, the idea is that by sorting this, it's just avoiding having to do a brute force search to find the matching key. We know that as the iterator moves down, we have everything that we need. So at a high level, it conceptually looks like this. Again, so in table R and table S, first phase is I'm going to sort them by the join key. I'm not going to describe what that sorting algorithm is just now, but this is what we'll cover today. And then now once we things are sorted, we'll now do the merge phase. Well, we'll have, again, these two iterators will go down and scan through the tables usually once. But depending on whether there's an equa join or what kind of join you're doing, you may have to scan for multiple times. We can ignore that for now. Just every single time I look at one tuple, when I'm done with it, I move on to the next tuple. And then I just like before in the hash join, if I have matching tuples on the inner and the outer table, I'll combine their attributes together and write it into my output buffer. Now, the confusing thing that we're going to hit today is that we're going to be doing a sort merge join. And the algorithm we're going to do to do the sort will be a merge sort. So I'll try to be careful when I use the term merge to make sure that I'm telling you whether it is the merge phase of the merge sort or the merge phase of the sort merge join. So I'll just take it out. As we go along, we'll try to make it clear what kind of merge we're talking about. But a high level of essentially doing the same thing. So this would have, in the introduction class, we discussed how to do this in a sort of basic high level approach. And we don't worry about workers. We don't worry about threads or newer regions. But now in a modern system, we need to be aware of these things. So we need to talk about how we're actually going to speed things up by paralyzing everything. So of course, we're going to want to paralyze the sorting because that's going to be the most expensive part. And again, just like we saw when we talked about vectorization, we can paralyze this step of the algorithm both in terms of having it execute on multiple threads at the same time. But then on each thread, try to use vectorized operations so that we're operating on multiple keys or multiple elements of data in parallel with a single instruction or a single set of instructions. So there's some basic rules that we're going to try to apply in our algorithm to try to get the best performance as much as possible. And again, this is also relevant to the hash join. But I just want to bring this up here again. So we obviously want to use as many CPU cores as possible that our database system has allotted to us for this execution. Some systems only execute one query at a time. So you'll get all the cores that are available on your machine as you're doing the join. And then when your query is done, it then switches over to the next query. In other systems, they'll want to try to run multiple queries at the same time. So the system scheduler has to be aware of how many threads that are allowed to a lot to do a sort and make sure that it only runs on those cores. Obviously also we need to be mindful of the NUMA boundaries because, again, we don't want to pay this big penalty to have reads and writes go across NUMA regions because that's much more expensive than reading data that's local to our threads. So we'll see in the case of Hyper, they're actually going to go against this second rule here. And they're going to argue that because they're doing sequential reads, when you're doing the merge phase of the join, that hardware is actually going to help hide these latencies of going across NUMA regions. But when we look at the results from the ETH paper, we'll see that it's actually not the case. And then the last one, again, we said this before, we're trying to use SIMD instructions as much as possible so that the amount of data we're processing per instruction or per cycle is maximized. I'm going to try to compare this with the SISD instructions where it can operate on one key at a time. So the parallel sort merge join is like the parallel hash join where, again, you have now three phases. Because now the first phase is this partitioning step where you're going to divide the data up across the workers or cores. But unlike in the hash join case where you would partition both sides, in this case here, we could just partition one side. And some algorithms will partition both, some algorithms will partition one. And then for this one, again, we'll say in a second, but you can do the same kind of partition we saw last time. The next one will be the sorting phase. And again, this is now where we take both the inner table and the outer table, and we just sort them on our join key. Then in the last phase, we do the merge where we just scan through the sort of relations and compare the tuples. And if we have a match, let me write them into our output buffer. So for today, we're going to spend most of our time on these two. There's not much more else we can say about this. But we'll see what choice we can make in here will depend on what kind of merge we want to do later on. So I didn't talk about partitioning at a conceptual level last class, but what I'll say here is also still relevant for hash joins. So there's this notion of sort of implicit and explicit partitioning when we want to do a join. So implicit partitioning would be how the data is already partitioned when it got loaded into the database. And then if we know that the data was partitioned on our join key for the query we're trying to run right now, then we don't actually need to do an extra step of partitioning because the data is already partitioned on our join key and we're good to go. So this would be like if I load the table and I can, as I load the table or create the table, I can declare in the DDL that I want to partition my table over some set of attributes or a join key. It might be like in the case of TPCC, like the warehouse ID or in TPCH could be like country or whatever you have in the light item, Tatum. So this is something that the application or the database administrator has to do for us. We have to be told, the database system has to be told, here's how I want to partition things. And in our catalogs, we can keep track of what that partitioning key was because since we had to write the data out through different numeric regions. And so now the query optimizer can say, oh, where I see that you're trying to do a join on the key that I've already partitioned my table one. So therefore I don't need to do the, an extra step to partition things because it's already partitioned. But this won't always work for us in an OLAP environment because people are going to want to join their tables on all sorts of columns and attributes where it may not be, may not know it ahead of time and you may not actually be able to declare the right partitioning key for a particular query because it may change from one query to the next. You don't see this so much in OLAP queries, or sorry, OLTP workloads because typically the partitioning key is sort of the partition scheme of the table follows a sort of hierarchy. You can say, here's our customer and for a given customer ID, here's all the orders of that customer ID and here's all the order items for that customer or that order ID. So you have this nice hierarchy where you can take sort of a slice of the data across tables and put them on a single partition and most of the times you're joining across those foreign key dependencies. In OLAP queries, like I said, people join on all sorts of crazy things so it's hard to actually get this right to cover all possible queries, especially impossible. So what we're talking about instead when we say the partitioning phase of a join algorithm would be explicit partitioning where we're going to divide the relations based on the join key and then redistribute them across different cores, right? So you could use the radix partitioning that we talked about last class. In practice though for the sort join, for the merge sort, the sort merge join algorithm, that's not gonna be a good idea because typically you wanna do range partitioning because you know how to divide up the data and you know what the boundaries are of that data from one partition to the next. And then roughly you know that there's no key with the same partition, sorry, there's no key with a different value could be in a different partition that you don't expect. All right, so that's again, the partitioning phase, we could do radix partitioning, we could do the range partitioning, it doesn't really matter, it's the same stuff that we talked about last time. But there's nothing really different that we would be doing then because we're doing a sort merge join versus a hash join. All right, so now we get into the sort phase. Again, this is gonna be the most expensive part for us. So the key thing I understand about sorting here is in the introduction class, when we talked about sorting, the main bottleneck, the main thing we had to deal with was writing pages in and out from disk. And so we would wanna use an algorithm like external merge sort that was designed to do as much sequential access to sequential reads and writes to disk as possible. And it made sure that when it brought a page or a chunk of data from disk into memory, we did all the operations we needed to do on that chunk of data before we moved on to the next chunk. So we didn't have to go read and write it back multiple times. But the key thing though is that when we brought the block of memory, a block from disk into memory, and then maybe need to sort that data that was in memory, we said that something like quick sort was good enough for what we needed to do, because we weren't going into a little bit of detail, we weren't learning about cache locality and parallelization for those algorithms in the introduction class. It was all about minimizing disk IO. But now in this semester, when we're talking about in-memory databases and now our database is already in memory, now we need a sorting algorithm that is aware of where the data is located and what our hardware looks like. Meaning it needs to know where, if it's reading, writing to a certain memory location, is that memory location in the same numeration as where my thread is running. So it also means that we need to be aware of what the size of the data that we're dealing with and what our hardware looks like in terms of the cache sizes. So we may end up choosing a different algorithm for chunks of data of a certain size. If things can fit in my L1 cache, I may wanna do one thing differently than if I'm spilling out to DRAM. Because again, that's an order of magnitude slower going from the cache to DRAM. So I wanna be very clear that like what we're gonna talk about today is a better way in some cases to do sorting for a join if your data looks a certain way, in particular if your data is gonna fit into 64 bit values. Quick sort is still very, very good. It can be paralyzed, it can be implemented to avoid branch mispredictions, which is one of the things we're gonna try to overcome today. Like I said, it's almost like the pickup truck of sort algorithms. It can almost handle everything pretty good. And certainly you can do better. And so what we'll talk about today is a way to actually do better over a quick sort. So the thing we need to understand though going forward is this notion of essentially doing we're gonna do divide and conquer similar to quick sort, where we're gonna split the data up into smaller and smaller chunks called runs, which is sort of a disjoint segment of the total table that has some number of elements that we're gonna then now sort. And then over time, what'll happen is as we accumulate more and more of these sort of runs, we'll start combining them together into larger sort of runs and then combine other larger sort of runs with other larger sort of runs until we get progressively larger. And at some point now we have the entire table or entire key space for our join completely sorted. So this is sort of the atomic unit that we're dealing with when these notion of runs. But these runs are gonna grow in size as we get further along in our execution of the sorting. So the approach we're gonna focus on is this cash conscience sorting from Intel. So remember last class, I talked about there was these six papers going back from 2009, discussing the sort of various progressions on doing modern sorting on today's hardware. And the first paper was this collaboration between Intel and Oracle on doing hash joins or sort of rich joins with SIMD. So that's where this technique comes from. And so the reason why it's called cash conscience is because the algorithm is aware of how big the runs are as it's going along. And then it's gonna use a different algorithm to sort those runs that's gonna target the fastest storage level that's available that can store the run in its entirety. So then when a run gets too big, then we move to the next level in our storage hierarchy, like going from registers to caches to memory. And how a different algorithm is designed or optimized for that approach. So at the first level, we're gonna do in-register sorting because again, CV registers are the fastest memory you can have, but of course there's very limited in size. So we're gonna deal with runs that can fit into our CV registers, sort the entire table space into these small chunks, and then now we'll then spill into in-cache sorting at the next level where we're taking all the runs we generate from the first one and have it fit into, sort them into runs that fit into our CPU caches. In this case here, we're gonna target the last level cache, like L3. And therefore we need to make sure that the, we're gonna keep using this approach at level two until our runs are half the size of our L3 cache. Because again, for this particular algorithm, you need to store the input data and then an output buffer the same size that contains the sorted run. So for this one also too, as I said, we're targeting L3 cache, which is roughly on modern Zions, maybe up to 40 megabytes. We're not gonna be worried about trying to target L1 versus L2, L3. We'll let the Harbor take care of that for us. We'll just say, long as we don't spill into DRAM and we're still hanging out L3, then we're good to go. Once we go beyond L3, then we'll use add a cache sorting where now we're gonna do a completely parallel execution of the sorting where we're gonna be aware of what's in cache, which is what's not in cache. And we're gonna have our threads actually be able to jump around and operate on different parts of the run or the table space depending on what's in cache or not. Again, we'll go through each of these one by one. And we also say too that this idea of level one, level two, level three, this is not actually in the original Intel paper. I think Intel calls them phases or stages and that obviously conflicts with the term I'm using to describe our joint algorithm phases. So I'm using this term level here. So again, let's look at a higher level, what it looks like, what happens. So here's our unsorted key space. And then in level one, we're gonna do, we're gonna sort these now into four element runs. Then once that exceeds our CB register sizes, then we switch into L2. We're now gonna combine together multiple sort of runs or at least two sort of runs at a time and then combine them into larger sort of runs until we have ones that don't fit in our CBU caches and at which point we enter L3 where now we wanna start using a different sorting approach to combine them together until we get to some point where we have our complete key space in sorted order. Right, so again, high level here is we'll go, we're gonna focus on these two. The level three is a bit more of a architectural approach to doing sorting across multiple cores rather than an actual sorting algorithm itself. So again, like I said, we'll go through these one by one. So the, in the first level, again, what we're gonna do here is we're gonna combine together or we're gonna sort together runs containing four elements here. And we're gonna use what is called a sorting network. So a sorting network is an old idea. It goes back to the 1940s like some of the first computers but back then when they describe sorting networks, they use the term wires that carry values. They literally meant like physical wires, like actually in the copper wires and hardware. In our case, we're obviously not doing that. Everything's all transistors. So this is just, this is a conceptual model to describe this approach. But the high level idea is the same thing. Just we're just doing this in software now. So what's gonna happen is we're gonna have our input sequence of four keys. And then the output buffer for each, or sorry, for each key element, there'll be this wire coming out of it that's gonna carry the value of whatever came before it across until we get to an output buffer or one of these comparator operations. So in this case here, we have nine, five, three, six. So in the first step, right, again we're all just gonna carry the value going forward. So in this first comparator here, we're gonna see which value is the min, which value is the max. So we'll write the min value on the top wire and then the max value will go on the bottom wire. So in this case here, five is less than nine. So we're gonna swap the location. So now five will be carried on this wire and nine is carried on the wire below it. Same thing down in front here for three and six. Three is less than six. So this three will be carried off on this wire and six will be carried along on this wire. So now we'll repeat this down, but now we're gonna do comparisons across different sets of wires. So now we're gonna do a comparison between five and three. Three's less than five. So three comes out here, five comes out here. And now in the case of this wire here, there's no other comparator we have to do. So we can write out three to our output buffer. In the case of the next comparator is nine and six. Same thing, six and nine. Nine has no other comparators we need to do. So we read that to the output buffer. Then we compare with five and six and we produce our output here. So we were able now to again, through the sorting network, we're able to take an arbitrarily ordered set of keys and then produce an assorted output buffer, right? So what's really interesting and cool about this is that no matter what our input sequence looks like, what keys there are, and what sort of order they start off with, we're always gonna do the same set of comparisons in the same order every single time. Because again, it's a sorting network that's set up to work this way. So if you now take this sort of conceptual model and actually write up code, a really simple implementation, like as I said, you're gonna do the exact same steps no matter what our input sequence actually is, right? So here's the first set of comparisons, and the second and the third like that. So what does this mean? This means that it's super fast to do because there's no branches, there's no if clauses, like you have to do in quicksort where you have conditionals to decide where the pivot point is. You're just always gonna execute this in the exact same order every single time, right? So you can sort of think of this as like loop enrolling, instead of looking at one iteration from one level to next, I just unroll it. Here's the exact instructions I wanna execute. So why does this matter? Well, now this means that we can actually vectorize this because as I said before, the vectorized instructions, the SIMD instructions don't support conditional branches. And so here's a simple, here's some, you know, a sequence of code we can execute that doesn't require conditional branches. It's gonna always require execute the same instructions in the exact same order every single time, just for just moving values around. So we can easily vectorize this. So this is how we do this. So now what's gonna happen is instead of sorting one SIMD register of four keys, we're gonna sort four registers containing four keys in parallel. And so for this one, we're gonna assume that we have 512 bit SIMD registers where we're gonna have 128 bit lanes. So every lane is 128 bits. So we can store four keys in one 512 bit register. And so in this example here, right? I'm only showing the key, and that's our join key attribute, but implicitly also there's also gonna be a 64 bit pointer back to the tuple that this key corresponds to. So you have to store the join key when you do this sorting in SIMD because if I now store these keys and come up with a different order, I have no way to have the key map back to the tuple that it belongs to. So again, for illustration purposes, I'm not showing the tuple pointer here, but assume that it's actually both. And then the join key is in the higher level bits because now when I do a comparison of whether five is less than or equal to one or between five or one, which the min, which the max, I essentially ignore this part of the key and I'm only looking this other upper part here. So this is also another reason why, as I said last class, the Intel paper talked about how you can implement a SIMD or vectorized sort of merge join algorithm that I'm talking about here, but they couldn't do it for real at the time because you needed 512 bit registers, or SIMD registers, which we actually have now since 2017. In the case of the Columbia paper, that came out in 2015, 2016. AVX 512 wasn't around at the time, so they only operated on 32 bit keys and 32 bit pointers, which in a real system actually wouldn't work. But now in today's hardware, we can actually do this. All right, so now the way we're gonna do this sorting, we want to come up with, we want to produce four element or four key sorted runs. So our sorted run is contained four elements, but we want to sort four runs simultaneously at the same time. So now the first thing we need to do is do load the data into our SIMD registers. So assuming this data is contiguous in memory, we can execute that with four load instructions into the registers. But now the way we're gonna sort this is that we're gonna sort this in a columnar fashion. And so what that means is I can't sort within a single register because the SIMD instructions that are available to us don't work that way. But I can sort across the registers at the same time. So I'm gonna sort in a columnar fashion. So in this case here, 9, 8, 6, 7, I'm not gonna sort that, I'm gonna sort instead this column of 21, 8, 14 and 11. So now I can just do that same min and max that we did and I showed in the last slide, but now I'm just doing this with SIMD. So in this case here, I need to do 10 min and max instructions that are vectorized to produce output that gives me now in a columnar fashion the sorted order. Again, think of each of these elements or lanes within a register as one of those wires going into my sorting network. And I just invoke the min, max instructions to compare these two and these two and then these two and so forth, right, the same way that I did in the last slide. So again, that only requires 10 min, max instructions. But now the problem is again, I wanna produce in memory a sort of run of four elements. So I need to do a little magic now to get this column now in a row fashion because if I write out this into memory, it's not a sort of run, right? Four is less than 11, but 11 comes first. So I wanna do a transpose now to take this column and convert it into a row. So there's transpose operations to do this in SIMD and then now I end up with my four sorted runs, right? So now again, across one register, it's sorted. So it takes me now eight shuffle instructions to do that transpose and then now four store instructions to write out the registers out to memory. So in quick sort, the number instructions we'd have to execute to do this would be way more than what we're doing here. So this is only 26 instructions, right? Assuming the load operation only took four instructions. We take 26 instructions to end up sorting 16 keys, right? It'll have four four element sorted runs. And we can do this because it's deterministic. So this is actually, this is pretty phenomenal, right? This is a big win to definitely do this. But now what do we have, right? Now we have a bunch of four element sorted runs across entire keys, which could be a billion keys. And now we need to start putting these things together. So at this point now, once we sort every single key in the table, then we enter now level two where we wanna start combining these together into larger sorted runs. So to do this at level two, we're gonna use what's called a big tonic merge network. And at a high level, it's gonna look like a sorting network, but now we're just gonna be able to sort larger runs together into a locally sorted runs into a globally sorted run that's a little bit larger. And again, we keep doing this and expanding the network which just means more phases, more sort of steps and shuffling and min max instructions until we hit the half the size of our last level cache. Because then we fall down into level three. And again, on a 2020 Xeon, the L3 cache size is around 36 to 40 megabytes, I think. I think AMD would be less than that. So this technique we're gonna talk about here is also from Intel. It came out in 2008, so it was one year before the hash join paper from Intel came out. And the reason why I like this paper and I like this technique is because this is actually a big deal because they were able to show that by using SIMD instructions in the tonic merge networks, you can get almost up to a 3.5X improvement over a SISTI or sort of non-vectorized implementation. And so 3.5X is for an algorithm that's for an old algorithm that's all you're really doing is now getting a constant factor speed up because they're using hardware correctly, that's actually a big deal. I mean, you think of like quick sort of quick source from like the 1970s, there's no magic wand we can do to make that sort of fundamental core algorithm we use all the time in computer science to go faster through theory. It's just by making sure we use the hardware correctly, do we get the better improvement that we're looking for. So I'll say also too, this paper was published in VLDB, one of the major database conferences, Intel is obviously not a database company, so they're not in the business selling database, they're business of selling hardware. And so the way that Intel stays competitive, no, I do not want to restart. Thanks, whatever, Windows. So Intel is not in the business of selling database, they're selling hardware, and so the way they stay competitive is they add new features like SIMD instructions or the non-volta memory stuff that we'll talk about later in the semester, the additional things in the hardware that you as the application programmer or the database system developer can take advantage of and get actually, you know, sort of justify buying new Intel hardware. And so the issue is that if these things are super complicated and nobody knows how to use them, then Intel's not gonna sell more chips. So they actually put out, spend a good amount of time and actually do good research on putting out papers that are easy to file and easy to read, showing you how you can apply, you know, the latest enhancements in Intel's hardware to databases. So the papers are always a good read and I look forward to them. So again, at a high level, the Bitonic merge network is just gonna look like the sort merge, or sorry, of the sorting network, it's just now we're gonna do multiple steps because we're dealing with larger sort of runs. So say this is now on our input size side and we're gonna take two, four element sort of runs from level one and we're gonna combine them together into a eight element output buffer. So to take two, two sort of runs, the first one will be in the same order that was generated in level one. The second though, we're gonna put in reverse order. So in that case here, the smallest element for this sort of run is the last element. And for this one, the largest element is the last element. And the reason you just do this is just when you start doing the evaluations, it's correct to do it in this way. So now, we're gonna have it a bunch of these min and max, the same that we have before which we can vectorize. And then now we're just doing more shuffles to do comparisons between them until we produce our final output and a completely sorted run, right? And the key thing about this is that this shuffle in particular is gonna keep everything in CB registers for as long as possible without having to bring it back into the CPU cache. I guess that's gonna slow us down. All right, so then now once we run out of our cache base, we're gonna fall back into or fall down into level three. But again, this is gonna use the same step, the same procedures we did before, is just now there's gonna be this higher level orchestration keeping track of what data is available in our CPU caches and having operated on that data first before we go jump to another region of memory. And so we let that get fetched into our CPU caches and before we go ahead and start executing it. And the idea here is that we're gonna be doing some extra bookie being to keep track of what's in memory or say what's in CPU caches and where we left off in our pipeline. And all those extra instructions could potentially slow us down. The cost of doing that extra work is much less, it's gonna be much less than having to have stalls in our threads while we go wait to fetch things from memory. All right, so we're not burning it. Every thread when it runs always has stuff in it's CPU caches and it can execute very efficiently. So we're not ping ponging our thread from going from being CPU bound to memory bandwidth bound. It's as always as everything, every time a thread runs, there's always data in CPU cache and it's running as fast as possible. The instructions per cycle will be much higher in that case here. So again, we're gonna run this in on parallel on multiple cores, keep track of what all the different cores are doing, what threads are doing. And within the pipeline, as data is coming from level two or moving along through, progressing through our sorting network here, we can have our threads jump around and work on different parts and just keep track of where this all is. So there's no synchronization between threads. Every thread knows what it needs to operate on and we don't need to do any sort of global, there's no global coordination. Every thread can figure out on its own what it needs to do. So this is actually very difficult, very convoluted and complicated. And to the best of my knowledge, no database system actually implements this because it makes this big assumption that all the threads or all the cores are only being used for sorting, meaning there's no other queries running, there's no other sort of background tasks running, like networking or garbage collection or indexing or things like that. And so I think the reason because if you now could have different cores doing different things that aren't involved in sorting, it's hard to have that sort of precision that you would need to recognize the data that I need is in my CBU cache or is not in my CBU cache. I think all that becomes, like I said, more complicated when there's things that are outside of the sorting process. So the conceptually just looks like this. So these are all the sort of runs we've produced from level two. And then we just want to start merging them. And what'll happen is as a thread starts executing, say in this case here, it does the merge, starts running out data into this queue. And before we can now start doing the next merge, we have to wait for all this data to be done. Instead of having a thread here to sort of spin, thread can pick up and jump to another part of this multi-way network and process the data that's there. And then when this data is finally all available in our caches, then the thread can come back and pick up where it left off, right? So basically there's this flag at every single stage that we set to say there's nothing to do and the thread shouldn't check for work. And then when the thread is, when it's actually available, it's sort of like a pub sub notification to tell somebody, hey, come and get, start processing the data that's in my queue that's available. And again, this seems like this would be bad for CPU caches because now we have our thread jumping around, different parts of the program, so to speak, or the network, so to speak, and processing different things. But again, the penalty of having to wait for things to sit in the cache or to be available in the CPU cache before I start running on them is gonna be much less in terms of the amount of work, the amount of cycles you're gonna have to spend. Again, this assumes that I think it makes a big assumption that you have complete control over all the sockets and all the threads on each socket and I think in a real system that that's not the case. Okay, so as I said, as far as I know, nobody does this. I also think that the various in-memory database vendors that are out there, everybody's doing something slightly different. For the disk-based vendors, oftentimes you'll see a fancy version of QuickSort if everything's in memory, otherwise they do extra merge sort when you have to spell the disk. But I briefly wanna talk about what we do in our system and at some point we should go and look to see what are the other in-memory database vendors are actually doing. So we use something called in-place superscalar sample support. This was a paper that came out in 2017 by these other Germans in Karlsruhe and the main thing I wanna talk about is that it's an open source GitHub library that we just link in into our system. So it's not like we're re-implemented this. We're just using their implementation. But the basic way to think about this is that it's using SampleSort, which is a generalization of QuickSort. QuickSort only has one pivot point, divides it up to two parts in each step. In SampleSort, you sample some keys and make a decision about how many partitions and how many buckets you wanna generate. And you sort of recursively do that. But the key thing about this one that makes it work really well is that it's doing this all in place, meaning it's not truly in place but it just means that the amount of extra storage space you would need to, as you start moving it around is a constant factor to the total infant size. And so what'll happen is as you start splitting the data up, when you're in partition phase and you start writing into your output buffer, when that output buffer gets full, rather than allocating a new output buffer, you then try to write it back into your key space to overwrite another part of the key space that's already been partitioned so that you're not wasting space or you're not allocating it more memory than you actually need. It's also gonna be optimized for super scalar architectures, which means that they're gonna avoid conditional branches in the same way that we talked about before. And the way to do this is by comparing keys to usually conditionally executed instructions which the compiler can generate for us. So again, we use this in our implementation and the researchers also, at least from this paper, shows that it clearly outperforms some of the more optimized versions of a quick sort. All right, so that's the sorting phase. Again, we've taken our relations and we sorted them on the join key, but now we wanna have our iterators walk through the two tables and compare tuples from the outer and the inner. And if there's a match, then we make a copy and put it into our output buffer. So in our, what we'll talk about here today, we're not gonna assume we have the backtrack, but if you have to recognize that the, if I have multiple keys, sorry, multiple tuples to the same key, I may need to backtrack on the inner table, but we can ignore that for now. At a high level, all the algorithms we'll talk about here today work all the same way. So just like in the sorting phase, when we wanna split this up and run those multiple threads, we wanna do the same thing here. We wanna have multiple threads scan through the outer inner table in parallel so that we can do this more quickly. And of course that also means now we wanna to, we wanna have them not require any synchronization during this process, so that they basically can run at almost bare metal speed. So there's no sort of global coordination to decide, you know, who's reading what piece of data, right? So we can do this if we're having everything right to separate output buffers, if we have to write to the same output buffer, then you have to do like compare and swap, take out a slot that you wanna write into. And so far as I know, everyone always does this in parallel. So what I'll also say too is if you just wanna do an order by in your query, so not a join, just an order by, or sorting for an aggregation or distinct, you stop at the sort phase that we just talked about, you don't have to do this merge step. This merge step is only to do the sort and merge join. So I wanna talk about now three different approaches to do a sort and merge, right, so to do this merge phase and put it all together. So the first two are from the ETH paper that you guys read, right, from the guys in Switzerland. And then the last one here is from the HyperGuys. It was the paper that they wrote in 2012 that said, this is the best way to do joins, this is even better than hash joins. And then the next year they abandoned that and switched over to be entirely in to do hash joins. And so the paper you guys read basically shows that this approach for as much as the hyper Germans touted it was amazing, it's gonna get crushed by the first one here, the multi-way sort merge. So we're gonna go through each of these one by one, see how they set things up and see how they're different, and then we'll do an evaluation and look at the performance numbers. What's that? All right, so the tarry asks whether the hyper Germans are the same ones as the in place super scalar merge sorting algorithm, these are different Germans. The hyper Germans are in Munich, the sorting algorithm Germans are in Karlsruhe, different people, different Germans. Okay, so the first one is gonna be this multi-way search. And so the multi-way sort merge. So the idea here is that for the outer table we're gonna have all the cores sort the data in parallel using the level one, level two approach that we talked about before. Then they're gonna redistribute the data across the cores in business doing another round like a range partitioning and do the multi-way merge sort and the multi-way merge from level three that we talked about before. We'll do the same thing now on the outer table. So that means now we have at the end of the sorting phase we have at every core we have a chunk of we have a partition of data where we know that for any tuple in the outer table it has to either exist or not exist in the corresponding partition from the inner table. And that tuple cannot exist in any other partition. So again, it's like the gray hash join by breaking it up into buckets or partitions. I know that the data I'm looking for has to be in this other partition and if it's not there and then doesn't exist and I know I don't need to check anything else. So this is actually gonna turn out to be the best approach. The important thing I was gonna say about too is that in this case here, I said that the multi-way merge is complicated and no real system actually implements this. So for this paper here, this is a test bed system similar to the Columbia paper. As far as I know, it only does the join so they're not worried about interference from other threads in the system running at the same time. It's just doing the join. So let's see what it looks like at a high level. So say this is our outer table here. So at the very beginning we're just gonna have that local numeral partitioning the same way we talked about in hyperframorsals. These are just chunks of data of the table that's unsorted that is local to each of our cores, to each of our threads. So now each thread's gonna take its local partition and it's gonna do the local sort. So that's gonna be the level one, level two sort that we talked about before. Then now in the next step we wanna do the multi-way merge where we're going to combine together the values within a given range at each core and move them to be on a single core. So for this first sort of chunk here, say this first sort of range which we know the ranges are because we've already scanned the data once. All the data that corresponds in that same range will be then getting written to this core here. Which then does now the multi-way or the level three merging that we talked about before where we're sort of jumping around different parts of the sort of execution flow based on what's in our CPU caches, all right? So we do the same thing now for all the other data. This would happen in parallel. It's not like we're doing this one at a time but every core is gonna get all the data that they need from the other cores and do that multi-way merge here. So then now what do we have? Now we have for our outer table we have a globally sorted table. So now on the inner table, we're gonna do the exact same thing as the outer table for the sake of space because the screen's only so big. I'm just gonna say that there's this little sort box but that's doing the same multi-way merge or multi-way merge that we saw here. So now what do we have? We have that every single core. Again, we have a partition, range partition chunk of data where we know that a key either exists in this partition or it doesn't exist at all because key five should be mapping here. We're not gonna find it at any other partition when we try to do the join. So now we do this local merge join where we have each thread just rip through the data that's local to it, having two iterators run at the same time, do our comparisons and then every thread writes out a matching tuple to its own output buffer, right? So a way to think about this algorithm what's happening here is that we're paying a penalty in the beginning to do remote writes when we do this merging across different cores, right? But that means now when we do the merge of the sort merge phase or the merge phase of the sort merge phase, in fact, the merge phase of the sort merge join here we don't have to do any remote reads across the numerous regions. We're only reading data that's local to us and that's gonna be super fast. The next one from ETH is the multi-pass sort merge join. So the outer table will do the same thing that we did in the last one at level one for level one, level two, we'll sort our data locally but now instead of redistributing the data across different cores with that multi-way merge we're just gonna do comparison across the entire table on the inner side to see whether we have a match. And that may require us to do multiple passes hence the name multi-pass over the table for every single two but to find the data that we're looking for, right? So in this case here, the merge phase is just looking for matching pairs of chunks of the outer table on the inner table and that maybe could be across different new regions. So conceptually it looks like this, same thing before we have that same local NUMA partitioning we use that same local sorting and every partition using L1, L2 methods. Then now we'll do a global merge join where for every single thread or every single chunk of data on the outer table we gotta scan through every single chunk of data on the inner table. And again, all the threads are gonna do this doing this at parallel at the same time. So you have sort of end-way connections going around and everybody's reading data from everyone else, right? So we'll see now in the next slide that the hyperguides are gonna claim that this is not gonna end up being a big deal because the Harvard pre-fetcher will help us but it doesn't turn out to be the case at all. All right, the last one again is from the hyperguides. So this is massively parallel sort merge. So we're gonna range partition the outer table redistribute it across the cores and then now each core is gonna sort that their local data and their partition in parallel. The inner table we're not gonna redistribute at all we're just gonna sort it locally for whatever data that they have. Then now when we do a merge phase the scan across the different threads we still have to scan across different threads for every single thread on the outer table every single partition on the outer table. But since we know our data is sorted we know what boundary on the inner table we actually need to deal with and we don't have to scan the entire partition of the inner table every single time we wanna do a lookup. So looks like this same cross-neuma partitioning I'm sorry, it's different than before. So now the every thread is gonna write out the data that it belongs to data within a range to a different partition. So you want to remote writes in the beginning and then now locally you're gonna sort these and these will be globally sorted the outer table will just sorry the inner table will just sort locally and then now when I do my join I have to go across partition so I'm gonna scan here scan down for every single tuple in the inner table but only scan a portion sorry, every scan through the entire partition on the outer table but only scan now a portion of the partition on the inner table but I gotta do that for every single partition to find all the data I'm looking for and of course again I can do this in parallel across all my threads at the same time but everybody's gonna have to do the exact same thing. Right? So what hyper, the hyper guy's gonna argue is that in all of these cases you're doing sequential reads. And so we haven't really talked about harbor prefetching but we talked about software prefetching when we talked about the relaxed operator fusion where that's where we have special instructions that tell the CPU, hey I'm gonna need this data pretty soon go bring it in my CPU caches. At the same time the harbor itself is trying to figure out what your access patterns look like and if it recognizes that you're doing sequential scans over some stride of memory or region of memory it's gonna try to prefetch that data for you because it thinks you're gonna keep scanning along. So the hyper guys argue that in this step here when you start doing remote reads different numer regions the harbor's gonna recognize that I'm doing sequential scans over these remote regions and start prefetching it over the interconnect and bringing it to your local CPU cache in your numer region so that this is gonna hide any penalties you would have from doing these remote reads. Truth is though this doesn't actually work out to be the case and the Swiss guys will show this in the result. One additional thing that came out of the hyper paper which I think is interesting is that they sort of laid out some rules that you should try to follow when you, if you wanna implement an efficient sort merge algorithm, a joint algorithm that is aware of the hardware and then that can be paralyzable. So the first thing they're gonna argue is that you don't wanna have any random writes done on local memory or contrast this with the multi-pass sort merge where you did those random writes. And instead of what you wanna do is you sort of break things up and then redistributed them and have each core operate on the local data. The second rule is that you wanna try to perform it anytime you gotta read data that's not local to your thread always do a sequential scan so that the hardware manufacturer can bring things into CBU caches and hide the penalty. Doesn't always work out the case for these joins. And then the last one which related to the multi-way merge in level three. And in general, this is always good advice for parallel systems anyway. You don't wanna have any core have to wait for another thread to do something. That means you wanna avoid fine-grained latches or synchronization barriers and just have every thread devoid to operate on data immediately without as much as possible that haven't accorded with anybody else. All right, so for the to finish up with results we're gonna discuss the, from the paper you guys are signed to read from ETH where they're actually gonna compare the three different sort of complete joint algorithms that I've talked about just now on a pretty beefy machine at the time that had four sockets with a half a terabyte of DRAM. So again, everything's gonna fit in memory. And so they're gonna compare against the three, again the three sort merge joins. So the two that they developed and then the one from Hyper and then they'll also compare against the Radex partition hash join that we talked about last class. I think in this paper they refer to it as a Radex join. Just means that it's a Radex partition hash join. So the first thing they wanna compare against is or evaluate is how the SIMD sorting algorithm that we talked about before from Intel how that sort of compares against a sort of non vectorizer non SIMD implementation when you're running on a single thread. So they're gonna pair against the C++ STL's standard sort which is a hybrid sort that's using quick sort in the beginning and then they switch over to heap sort as you get further along. And this is just showing you along the X axis as you increase the number of tuples you wanna sort you can show that the throughput you can get and how it varies. So this is actually a great result because this is actually corroborating the 2009 Intel paper that did the same comparison. So this is sort of like further justification or evidence that the SIMD sort is preferable over to assist in sorting implementation. And this almost matches exactly with the speed up that Intel reported which was about, you know, about three X faster. So, you know, this is good science. This is really cool that they were able to do this. So then now you wanna do that you compare this one and the sort. So for this one, if they're gonna have the outer table be 1.6 billion tuples and the inner table is 128 million tuples and they're gonna be just sorting eight by tuples. So the way this is divided up is that we have the multi-way, the multi-pass and the mass they parallel from hyper and then it's broken up into the amount of time or amount of cycles that you're spending per tuple for the different phases. So the partition phase across all of them is about the same but the big difference you now see is for the sorting and the merge. So this is the sort phase of the actual sorting algorithm and then this is the merge phase of the merge algorithm and then this is the merge phase of the join algorithm. Right, so this is obviously very confusing because we're doing merge twice but this is for the sorting and this is for the join. So the main takeaway here is that the multi-way actually performs the best. Partitioning is the same, the sorting is a little bit slower than the multi-pass but then the merge pass is nothing. And then the merge itself for the join is super fast because that's always gonna be in, that's operating on data that's local to my NUMA region so the number of cycles I'm spending to do that comparison is super short. Where in the case of Hyper, they're spending a lot more time, more cycles accessing remote memory and they pay a big penalty. So this is showing you that the Harbor prefetcher that Hyper claims is gonna help them here, it doesn't. Another way you can plot this is include the throughput graph. So in this case here when you're doing cycles for output, obviously cycles for tuple output lower is better. In the case of the throughput line, higher is better because you're processing more data quickly and so as expected, fewer instructions to compute the join, the throughput will be higher. So again, this is a great result that's just showing you that breaks down at what phase of the join itself are you spending all your time and if you're aware of, if you're designing your algorithm to be aware and conscious of the cache sizes and my NUMA regions, I'm able to get the better performance by minimizing the amount of the remote reads I have to do. So the big penalty to do more writes in the beginning but then the remote reads is you don't have any at the end. All right, so the next one we're gonna compare is now just a little more detail of scaling now the number of threads for the multi-way and the Hyper join. So in this case here, I think this one was, this was all the threads on the machine but this is now scaling up the number of threads. So when you're down on one thread, it's not that, there's no big difference between the two of them although the multi-way is still faster but as you now you scale up the number of threads you see the larger gap in performance. And so this is showing you now though in log two scale, log two scale, like you're doubling the number of threads as we go along the x-axis and so if you're able to achieve linear scalability which is the gold standard which you want in a parallel system and if I double the number of threads or double the number of cores I double the amount of throughput or double the amount of work I'm able to accomplish. So in case of the multi-way merge sort, right wet at 16 threads, I'm doing 130 million tuples per second and then if I double the number of threads to 32 I'm doing 259, so it's slightly less than 260 but it's almost exactly double. And of course once you have hyper threading there aren't, you know, they're not real true cores you have two program counters that you're switching in between on the core. And so if you're CPU bound it doesn't really help in this case here we're probably memory bound. So it's still able to get a little bit improvement but it's not gonna scale linearly. In the case of the hyper approach they're not able to scale linear at all. So going from 54 to 90, like this should be 108 but we're only at 90 so even though we're adding more more threads we're getting worse performance and then when we go into hyper threading then it just all falls apart. Right, so the main takeaway here is that the extra instructions we're gonna spend for the multi-way sort merge in level three turns out to pay off for us because we don't have to, you know, we're not doing remote reads when we do the merge phase of our join algorithm, right? And it's the overhead of reading across numerous regions is what's hurting the hyper performance, right? In this case here what's happening is for hyper the reason why it's falling off and getting worse performance is because if I now split my data up across multiple threads more and more threads then the likelihood that a, when a thread goes and tries to access a tuple the likelihood that that thread is not, the likelihood that my thread is trying to access is not in my numer region increases as I add more threads is now the data is split across multiple threads so therefore I'm doing more remote reads. So for this graph, this graph is comparing the multi-way sort merge join with the redic hash join and what I'm showing you here is that you're varying different sizes of the inner table and the outer table and what it shows here is that for smaller table sizes the performance gap between the two of them is quite larger but then as you increase the larger and larger and larger tables the partition gets more and more expensive for the hashing and that sort of negates some of the difference you would get performance benefit you get over the sort of merge join algorithm but it's still going to beat it here, right? And so for this I'm combining together the build and probe phase and just doing a single number but it shows you that the cost of building a hash table and then probing it to find a match in some cases can be less than the cost of sorting the outer table and the inner table and the sort of merge join because then I still have to go do the merge after the sorting and then do the merge to the join like the cost of just doing the sorting phase in sometimes is greater than just to build a hash table. So this is sort of showing you why the using a efficient hash table implementation paralyzing cross multiple threads even though we can't SIMD all of that, all of that process to in the hash join it's still gonna be much faster than sorting. Where sort merge actually would beat the hash join is if I might the output of my query has my query has an order by clause that's the same that means to sort the data on the same key that I'm joining on then if I do a sort merge, I can just use the output of the join which is already sort of that in the same way that I need for my order by and I don't have to do an extra sorting. So in this case here, so say I have to do the sorting for this, I use this example. So I have to do this an order by for this query if the order by is the same as the same as the joint key I'm using my sort merge joining I have one then I don't need to do anything extra after I do the join in this case here if I need to sort it after my hash table then I gotta add that same chunk of time up here. It'll be a bit less depending on the output of the join but it's still an extra step I have to do and this is something I need to account for in my query optimizer. So this graph is just explaining the last graph, last slide of how you vary the size of the tables when you're trying to join what is the, how does the performance gap between the two approaches close down? And again, this is just showing you that with radix hash join, when you have larger tables you have to do more passes to do the partitioning and the performance benefit you get is reduced but still in the end it's gonna be preferable over the sort merge join. And I don't forget why this was plat toes. I think at some point you're just paying the penalty of reading things in and out of memory the same way you would do in a dis-based system. Okay, so what are the main thoughts about this? As I said, in a modern commercial or enterprise or high-end OLAP database system that wants to be competitive you're gonna need both the hash join and the sort merge join algorithms. They implement both and the optimizer will figure out which one you actually wanna use. And so, but if you're building a noose from scratch unless you're targeting OLAP workloads if you're trying to target OLAP workloads the first join implementation you're gonna wanna build is the hash join because the research shows that it's ideally clearly preferable and faster than the sort merge join. And as I sort of said at the end here we also did not consider the impact of having the data already sorted on the join key as you may need by the order by clause and in that case here you wouldn't have to pay that extra sorting stuff and that the sort merge join actually may end up being better than the hash join but again this is something the query optimizer can figure out for you. So that's it for today for Wednesday's class. Now we'll get into actually the part of database systems that I admittedly know the least about but it's something that I find the most fascinating and so we'll do an expanded lecture series this semester we're gonna have additional lecture on query optimizers and the idea here is that just try to understand how we can take a SQL query and convert it into the best query plan we would wanna execute on our system so we can use all the various techniques that we've talked about so far. The join algorithms, the vectorized execution, the scheduling methods, all the indexes all these different things now we can consider when we try to build our query plan. And so the paper that's assigned for Wednesday reading is just an overview of the various problems that we have to deal with in query optimizer and then on next week we'll then talk about actual implementations and the main difference will be this sort of dynamic programming approach from IDM versus the cascade approach from the volcano guy, okay? All right, that's it for now and I'll post on Piata about getting set up for getting prepared for next week's project presentations. Bank it in the side pocket. What is this? Some old Porsche, what? Look a sip and have this bit cause I ain't with that beer called the OE cause I'm old cheap ice cube. Aye, you look, then it was gone. Grab me a 40 just to get my buzz on cause I need it just look like a fish I ripped the top off, they brought the drop though this ain't them feet the same after ice cube, take a same eye to the brain.