 Today's the second lecture we're gonna do on parallel joint algorithms. As I said, there's the two main algorithms are Sort merge versus hash join. So hash join was last class today. We're gonna do sorting Just real quick for project two So there's a bunch of stuff happening next week on Monday. So this week is it's been the status meetings I've met with some of you guys today meet with people on Tuesday and Wednesday, and this is is a precursor coming up for Sort of the first major checkpoint on Monday next week So there'll be three things you have to do the first is that in class or do an update presentation Basically describing what the status is of your of your project And what are the you know, what are the what major challenges or issues that you face? What has changed in what you're proposing to do again the website will lay out exactly what's expected? And then you'll have to turn in two things You are you'll do a code review submission Which basically is that you send a pull request to the master branch on on github for our repo And then I will sign another group in the class to then do review your your code and everyone has to provide feedback and comments about you know The quality of the code or what makes sense. What does make sense? But then this year we're doing something different where you also have to turn in a design document Which is basically a there's a template file We've arrived to you that says it's you writing in English description of what your what your what your PR? It looks like what your implementation actually does and again, there'll be documentation on the website It says that I do this so what I'll do for the cover you submissions I'll I'll go over what's expected for the code review on Wednesday in Class but the basic idea is that you'll list your PR On on the on the spreadsheet on the Google doc for the class and then I'll sign another group to To look at it and then you'll look at theirs I am you're gonna do this twice you do this now as a checkpoint and then you'll do this again With you know with the same team pairings at the end of the semester, so you actually you're not looking at brand new code each time You're looking at the same The same project from the other group and again the idea is that it's just right you feedback to say You know what you're doing stupid or what some other things maybe not thought about Right and you start to take these suggestions and actually make your code better And this is super useful because it can help you actually look at other people's code in In you know in classes in university almost never right because you're not supposed to because that's cheating So this is like is a way to sort of see how people do other people do things so okay, so any questions about any of us? Okay, cool. All right, so as I said the their focus here today is on Pale a joint algorithms and again the the big distinction between the introduction course in this class is that in the introduction course We never talked about how we're actually going to execute the join in terms of the low-level operations that were that we're doing on the CPUs Where the data is actually being stored it was all about you know reading reading writing blocks from disk and then the two main approaches that That you care about in in a in a fast data system is doing the hash join the servers join and as I said We don't really care about nested loop joins. We need to have them for some workloads But they're primarily for for old TV things So for today's agenda, we're gonna do a little bit background information on SIMD because you need that to understand how we're gonna do the parallel search merge quickly and and then we'll talk about the you know the different ways to actually implement the The the merge phase which is the main sort of main part It's different than this and then we'll have the evaluation which was in the paper that you guys read All right, so who here has taken 418 618? All right, actually less than less than Less than in previous years. So here here here knows what SIMD is or who actually who here doesn't know what SIMD is It's okay to raise your hand. Okay, perfect. So SIMD is a sensor single instruction multiple data and so this is from this Flynn's taxonomy. So we'll cover SIMD in more detail Next week when we talk about vectorized execution So what I'm providing you here is like the basic you need to know to understand how we're gonna do the sorting quickly On modern hardware, but then we'll cover the the way you apply SIMD to vectorize a bunch of other Operators in in our database system next week. I said, this is sort of a crash course But the way to do a SIMD is that it's a category or a class of the CPU instructions That is going to allow the the CPU to perform the same operation On multiple data items at the same time like within the same instruction And I'll show an example of this in the next slide So this is not a new idea. This actually goes back again. There's this thing called Flynn's taxonomy that laid out Here's a different type of parallel architectures. You can have right sissy be single instruction single data We'll see that in the next slide and then SIMD is it's another example of this You can also have MIMD or you know multiple instructions on multiple data We're focusing on a single socket or single core doing this So pretty much every modern CPU architecture now has SIMD operations So but it comes under a bunch of different names So an x86 and from Intel it was originally called MMX and then now it's SSC SSC and then AVX is the modern variant of this So MMX actually doesn't stand for anything Right, this is something that Intel made up in the in the 1990s And they explicitly say that it doesn't mean anything right because they don't want to get sued for you know For stealing a trademark of somebody else For SSC I think it stands for streaming something instructions are free actually this stands for And power PC is called all the back and then arm now is the thing called neon again a high level that they all They all basically do the same thing although the actual instructions you would use of the commands you would use are going to vary per architecture So let's look at a really simple example here. So say we want to take two vectors X plus Y or X and Y when add them together and write them out to a a new vector Z So the way we would write this in like simple C++ code which be just a for loop Assuming that X and Y are the same size and then we're just going to iterate over every Index of X add it with the same offset and Y and write it to the Z Right pretty straightforward So if you were an execute with with this with sissy so single instruction single data Again, you just run this for loop and you're going to go through and just go grab two numbers Added together and write it out to your output buffer Right so every addition operator Operation in our for loop is one instruction all right So the way to do this in SIMD though is that Instead of actually going through and looking at every single offset one by one We're actually going to collect a bunch of data items together put it into a special SIMD register. That's on the CPU Right in this case here. We're doing 12-bit SIMD registers. So we have Each of these are 32 bits so we can store for 32 bit values into this register Right and then now it's a single SIMD instruction to then dump out the addition operator From from adding these two registers together So you think about again the thing it's just adding based on offsets here So the first offset in the top register is added with the first offset here and that writes out to the first offset there right So we just do this down the line and for the other half of the input vector and produce our final output here So now what what was doing? I think we have eight items here We was doing eight addition instructions to add these together. We're now down to two SIMD instructions So this is the basic idea of what SIMD does It's just taking much of data putting into these special registers and use invoke these special instructions That then do whatever operation you want to do across registers and then write it out to a new register Again, we'll cover this more detail when we talk about vectorized operations But then the in order to get that data out of the SIMD registers You have to store it back in your CPU caches put it back in memory And then you can then you can do whatever else it is that you want to do it So not all instructions can be SIMD eyes are vectorized like this That's right. It's really for the special case things or not special case things It's very for sort of sort of basic primitive operations, which is the kind of things we're going to care about in a database system, right? so the First invitations of SIMD instructions in the 90s were actually really crappy Like the end of the next thing from Intel like it was it was The way it would work is like you couldn't do regular instructions at the same time you did SIMD instructions That would sort of you put some things in the SIMD registers Then you then you do whatever is the SIMD operation you want on it and that stopped the regular CPU from executing regular instructions But now in modern systems you actually can do them in parallel so that they can in a super scalar architecture The CPU can stage a bunch of stuff in its regular instruction pipeline and then push one and if we're sending instructions Excuse them at the same time is executing regular instructions and things have gotten way better so as We'll see today, and we'll see again Next week the performance gain you can get from being able to vectorize things with with SIMD in a data system It can be quite significant Right. I got a in that case here. It was a 4x speed up or 4x reduction in the number of additional instructions I did Because I was using the SIMD registers The downside is for as great as they are Implementing an algorithm using these vectorized instructions is still a manual process I there's Intel has a compiler that you can provide some hints to help it try to vectorize certain things But for general purpose database systems It's us the database developers who have to figure out how to actually do this and not all operators can be Vectorized or they can be vectorized, but you may not actually get the benefit You'd expect because you have to stage things in registers and write them out, right? We have to worry about data alignment issues as well. So my example back here, right? These are all 32 bit integers and I had a 120 bit 20 128 bit register So I could stick for them in here if I had like 10 bit Values and that's not going to work in here because my the instruction to do this operator to do the vectorized edition It's suspecting 32 bit values So they'll be for the different size of the registers in different lanes These are called we're actually showing the values. We have to make sure they're aligned to what the instruction expects And then getting the data in and out as I said it can be tricky and we'll see this again We'll see this later on with the vectorized instructions Like it's basically how to take a bunch of disparate regions and memory and then write them out So one one CP register then get them out and then you know going reverse and put them back in different locations That can be expensive All right, so again, we're not going to talk about The low-level SIMD stuff today But again, we'll see this in a second when we do the low-level Sorting networks because that's how we're going to get this to run fast But and everything has to fit in our CPU caches here All right, so The simple sort merge it just has two phases Right sort of like the same thing as a hash join, right? So in the first phase you do the sort And you're going to sort the the tuples of the out of relation are in the interrelation s you're going to sort them on the join key Now I'm not saying what sorting algorithm we're using here. I got a high level. It doesn't matter It can be quick sort it can be heat sort it can be bubble sort if you're crazy, right? It doesn't matter. We just have to get everything in our and in the in the sort of order based on the join key Then in the merge phase we just have two iterators that are going to walk through the two relations and Just do comparisons across as you as you step along to see whether you have a match And the idea by presorting it you avoid the problem of having to do a complete scan over the data for every single Every single tuple as you normally have to do in a nest of loop join so the Visually looks like this again. We have our interrelation and our out of relation They're both going to sort They're there you know on the joint key and I'm not saying what the algorithm is it doesn't matter And then in the merge phase again, we just have this iterator They just scan scan down and locked up and this do our comparisons, right? So that's all sort merges now where things are gonna get confusing is that we're gonna what we're talking about today Is that we're gonna use merge sort to do our sort merge join? Okay, so I'll try to make very be very careful to talk about when I say a Merge do I mean the merge phase of the merge sort or the merge phase of the the sort merge, okay? It's it is what it is, but Don't do the best we can okay, so What's the most expensive part here should be obvious right? Sorting right the merge is easy right because it's just it's just sequential scan and again depending whether I have foreign keys Or not I may have to backtrack, but in general you don't have to right? So the sorting is what we want to speed up So that's for to do this we want to figure out a way to paralyze it right and There's and this is going to boil down to Making sure we design the database in such a way that it's using the hardware meaning the CPU and memory Correctly or in the most efficient manner Right and so in general this means that we're going to try to paralyze The the process as much as possible using all the cores that are available to us right We'll be aware of our numerous boundaries so that we try to reduce the amount of you know cross interconnect traffic of One core or one one socket reading data from another socket. We're trying to only operate on on Localized data and then the reason why I talked about simdian beginnings that in order to get the speed up the actual processing We're doing it at a single core. We want to use simdian instructions as as much as possible so we'll see in the case of Hyper and their sort merge algorithm They're going to violate this last one because when they do the in the merge phase They're just going to be doing sequential reads and therefore In sequential reads on local data and therefore they're going to claim the harbor prefetchers going to hide the mass the latencies of Doing these reads and therefore simdian is not going to help us Right, but they when you look at the eth paper results that you guys read that they lose all right, so I If you're doing sorting, I think all of these make sense So all these all these are good things the other things they say also too is like again We're talking about how to do parallel sorting in the context of a sort merge Join but this still we still need sorting anyway to do order buys So all the things we'll talk about at least in the in in the sorting phase and the partitioning phase in the parallel version Of sort merge join we still can use for order by just whether you merge or not depend the way that you need it for the joint So now in the parallel versions just like in the hash join that we're going to introduce a new optional first phase We're a new partitioning right and this is just splitting up our relations and signing them to To different workers and then that way they operate them on and parallel And the questions mean where do they actually write out their data? They're gonna write it locally or write it globally all the same issues We talked about last time with with hash join still apply here And then everything else is still the same we still sort sort based on the join key and then we're going to merge them together and again how the How we sort of set things up in the previous phases will determine how we're going to do the merge Like are they gonna be reading data on its local core? Are they reading data from a from another core? Right sort of how we describe these parts here will determine what we do here But so for partitioning phase, there's not really much we can say more than we talked about last time Right the only thing I would sort of add which is I should have said this for the hash joining stuff as well But there's this notion of an implicit partitioning versus explicit partitioning so implicit partitioning is when the The data got loaded into the database it could have already partitioned it ahead of time in such a way that was Ideal for the join key we want to use for our query So let's say if I knew what if I if I knew for these two tables I want to load in I knew ahead of time what key I was going to join them on Then as I load the data I can make sure I partition it based on that join key Right whether it's rank partitioning or hash partitioning. It doesn't matter. So now when it comes time to do my my join algorithm I'm already been split up nicely based on the join key So therefore I only need to have I know where the data is being located and I make sure that that This you know this block of data that corresponds to this join key range only looks at this other block that the matching range Right, but again, this requires you to have the join key ahead of time So so in theory we can do this but this always this almost never actually plays out in practice because people always want to do joins on on weird things or It's things that were unexpected the The algorithm the radix join we talked about before or the range partitioning stuff will talk about today These are examples of explicit partitioning So the data is just sort of stored in whatever way at once and then when our query shows up We're good then going to on the fly Generate partitions that are our that are ideal for our one query at that moment in time Then we we after we do the join then we just throw it away and then the next week comes along We'll do the same thing over again So in theory if you see the same join query over and over again the same join key You can then sort of reorganize your data to essentially have have this right in that way You're not doing on a per query basis In practice though for really large the data sets that actually comes quite hard and again said for arbitrary queries It's impossible to do right, you typically see This mostly happens in Like ultimately workloads because the queries are the same thing over and over again And there's sort of this nice natural hierarchy of how the scheme is laid out This is what we were about here and again So it can be the radix partition we talked about last class or it can be the range stuff We'll talk about today All right, so as I said the most important thing we have with the word about and do you sort merge is Sort merge join is the sorting phase so When we talked about this in the for disk base system The bottleneck wasn't sorting right bottleneck was writing writing data in and out from from disk into memory So we talked about it in that class We talked about external merge sort where the idea was breaking things up into chunks and making sure we stage That we operate on chunks or one block at a time in memory and then when we're done we write it back out and bring in the next block right, so when we brought a block into memory the joint argument or sorry the sorting algorithm didn't actually matter Quick sort was good enough But now everything's in memory quick sort is actually not gonna be good enough We actually want to be smarter and do something better Right because we need to be aware of where our data is actually being stored and what our architecture looks like the quick sort The sort of the textbook definition of it has no notion of locality or data of data's physical location It just swaps things where it needs right and that means it could be reading across new new and boundaries so we want to pick a sorting algorithm that's aware of where data is being stored and Actually the size of the data so we know at what sort of CPU cache we can operate in and that'll change what algorithm We want to use to do the sorting all right, so What's gonna happen is we're gonna do this in sort of stages or levels and At each level we're gonna generate what are called runs sorted run And this is just me some segment of the table. We want to we want to do our join on That is in sorted order locally So when I mean my best say I have four tuples I say say my table has a thousand tuples I can grab four of them and I can sort just those four and that'll be that that's my run So globally it's not sorted but within the just the local thing that I looked at that's being sorted So the idea is that we're gonna build up the size of these sort of runs Incrementally until we generate the total global globally sorted run for the entire table So again, the way we're gonna do this is being aware of how big our runs are and Then using a different method or different algorithm that'll fit in the sort of the The storage level we're at so we can do that as fast as possible And then once we exceed that size then we go to a next level and choose a different algorithm That's optimized for that storage level So at the very top in level one We're gonna do trying to in register sorting because CP registers are the fastest thing Right. It's the fastest memory. We can access get access to but we can't store that much so we're gonna operate on that Then once we exceed the capacity of our CP registers for our sorted runs Then we go to level two and now we're gonna do in cash sorting and this is a technique where It's using techniques from the first one But now it's looking at larger things and it can be mindful over our cash boundaries are Then once we exceed our CPU caches all bets are off and we just use a sort of out of cash sorting technique And the approach I'm gonna show you is actually from an Intel paper where it seems like it's going to be a bad thing to do or slow Because they're gonna execute more instructions but because they're sort of staging the The sorting operations you're doing in a clever way You avoid stalls because things you're operating on are not in your CPU caches So all this comes from a paper again written by Intel in 2009 this is that remember I showed us sort of six papers in that table before and I said it sort of started with It's Intel Oracle paper and then it led to hyper and then it led to the paper You guys are read from it from the Germans This is the this is that paper and basically it says here's how to do Hash joins really fast on modern hardware But if you had larger SIMD registers in the way we can assume here because we have now then this is how this is how to do The sort merge very efficient, so I'll say also to you I'm not going to use the exact terminology that They use in this paper like this terminal level This is this is something I came up with because I think they call it either phases or stages and that overlaps with the phases We're talking about for our joint algorithm. So for simplicity. We're just going to call it levels all right so Again, the high level looks like this. This is called cash conscious sorting. It's again cash conscious means we're aware of the What level of the cash in our CPU that we're operating on therefore we know how much space we actually have So at the very beginning our tables unsorted and then level one We're going to do the in-register sorting and this is going to generate sort of runs of four elements and Then once we get larger than that then we go to level two And now we're going to start combining together these smaller sort of runs into larger sort of runs and then when we see that when we get down here and generate a Even bigger sort of run to each generate the final global sort of one right Again, this is the high level to go and we're going to go through Each of these steps one by one We're going to focus on the first one because I think this is this is the easiest one to understand But if we can build on top of that and do more complicated things All right, so in the first level they're going to use a technique. That's actually really old like 1940s old called sorting networks and The the the basic idea what it was that back then is that they wanted to build sort of specialized hardware that could sort you know sort data sets meaning like it was Like like literally like wires to take data in on the wire and then have it do swaps and then generate sorted output Right, so the way it's going to work is you're going to take your input and the input You know the inputs can be attached to a wire and that sort of carries along the value on that wire So at this point here from the nine along this wire the value is nine five three six Right, and then what'll happen is they'll have these Stages here. We're going to do a basically a min max and Whatever the min value is that's going to go on the top wire and the max value is going to go on the bottom wire here So I'm going to have nine five and what will happen is I'll do min goes on five here max goes nine there And then now that's the new value coming along that wire after that comparison Is it is this clear right same thing down here three six and then I keep doing this going along In this case here the three doesn't need any more comparison that produces our final output These guys keep going along and then I produce my my final sort of value Right and what's really cool about this is that no matter what input values I have here I'm always going to be doing the exact same comparisons Meaning like I again I can bake this in hardware with wires in the 40s and have this always you know take whatever input in and then produce the Whatever the min max value is and I'm always going to generate for the same sort of wire Configuration is always going to generate a correctly sorted input sequence output sequence right So I'm thinking guess what the advantage of this is for modern CPUs Exactly, no branch prediction think about how quick source in the lead if this is less than this then maybe right here Otherwise right at that right so I can write this without SIMD just really simple like this So I can just take my input sequence as an array and there's do three stages where I'm just doing min and max and writing out Whatever the output of this and you know this comparison is for not a comparison This min fact that function is and is right that to my output there Right as he said there's no branches. There's no branch from his prediction Right. I'm going to execute the exact same code every single time Yes So he says you can parallelize all these comparisons yes, SIMD next slide Okay So let's see how to do sorting networks with SIMD as he said to paralyze it or vectorize it the term I would use so So as I said for this one, we're gonna run we're gonna operate on Four values in our register right because The I'm only showing the keys here when we do joins in a real database system I need to store that I need to store the join key and the pointer to the tuple Because otherwise if I just sort these values, I don't know what the hell the tuple it belong to right so I'm not showing this here But implicitly it's the join key plus the Plus plus the the the pointer so now if you assume the pointer 64 bits And then you assume that the value will be also 64 bits. That means the value here is actually 128 bits So that's why in that Intel paper it's again from 2009 They said that you can't do sort merge in modern systems because at that time they didn't have 512 bit registers Right if you have 512 bit SIMD registers, you can store for 128 bit values We can do that now as a 2017 back then you couldn't do that So again, I'm not showing you the pointers here Just assume that you're they're implicitly stored and then you know how to not when you do the comparison You're you're just comparing the value portion of the join key not the actual pointer, right? right, so To vectorize this we're going to actually do Generate four sort of runs all at the same time And this is because the SIMD operations SIMD instructions can't do comparisons within the same register It does it across registers Right, so I'm not going to say 9 9 8 6 7. I'm not going to sort these guys Well, I'm I will end up these these guys sorted But I'm gonna be I need to flip it around so that I'm storing them almost like a column store Right, so the first thing that they do is load the data into our registers So again, this could either be a contiguous right or you know a chunk of memory you write it in or it could be a bunch of Single loads right in practice. You want to do one, you know per register because that's faster All right, so there's now the sort of cross registers right and like this we're going to do a transformation We're going to flip things around and then now Within this we just do the the same 10-minute max instructions we did in the in the last slide and then that'll generate Things in our sort of order and then we transpose it to put it back in the correct correct order like that So again, even though my register started out 9 8 6 7 I don't care where the best actually going to be sorted for this I just want to have one run that's sorted That makes sense Again, because all I care about is that that within the local run itself that sorted so think about this I went from I Went from You know Doing quick sort be way more instructions if they have way more cycle because I have branch mispredictions But now I have a deterministic algorithm I can execute and in in a vectorized fashion to sort 16 values with just You know with 20 instructions 24 24 25 instructions. That's crazy. That's awesome right so Again in the first level we're generating sort of runs of four elements and we're gonna do this for everything But now of course obviously one things being in the global sort of order So now we want to then pass these sort of runs to level two and let it do a sort of a larger sort of run By combining together these smaller chunks of smart smaller runs, right? And some well shuffle in store so for load 20 min max so we're 24 2028 for the store so 36 so 36 instructions to execute all these that's really good All right, so in the next next level We're gonna use what is called a big tonic merge network So this is gonna look a lot like this the sorting network from the last last last level But instead of sorting things within a single CP register We're gonna sort things that are much larger now. We can still use SIMD It's just we're gonna have to do much extra work to move things around and to put it in the right order that we need To then be able to process in our in it was with vectorized instructions Right and so for this the key thing we have to be careful about is because we want to make sure everything fits in our CPU caches L3 That we only sort runs or generate runs that are half the size or available cache Because we have we need one half for input and then the second half for our output So again, this is from a paper that came before the the 2009 paper from Intel on their sort merge your algorithm This is called efficient implementation of sorting a multi-core So this is another paper from Intel in a major database conference where they just show how to do sorting really fast and with vectors Which the SIMD and then they're reporting about 3.5x speed up over the CISD Implementation of this. So this is like crazy when you think about this, right? This is this is like think how old quicksort is and From an asymptotic analysis, it's not anything fast, but we're actually in real hardware in constant terms We're getting 3.5x speed up over what quicksort could do Like that's amazing for a sort of core fundamental algorithm that we use all the time in computer science to get that kind of speed Up like doesn't come along come around very often, right? Intel actually writes very good papers Again, we think that Intel is not in the business of selling You know, you know, they're not in the business of selling a database and in the business of selling hardware and In order for Intel to stay competitive that they always put out new features in their hardware like SIMD as one example But the problem is if nobody knows how to program these new features then Intel's not gonna sell more hardware So Intel writes actually some really good papers that are easy to read and follow that say hey look Intel does add its new feature Here's how to use it in databases. Here's how to use it in Bitcoin mining or whatever, right? They put out a lot of their open source libraries for the take advantage of Intel hardware So I like their work. I like their papers and this is one good example of this So I'm not going to go through again the details of all the different steps as we did in the sorting network But this is a high-level Visualization of what a bitonic merge network looks like so in this case here. We're going to sort to You know for element runs that we would generate from level one So in the first order run that's going to be just in the order that that comes out of the algorithm But in the second one we're actually put we have to put it in reverse order Because the way we sort of stage things out with our min and max is going along is that you have to have this Inverse order so that the the last element is the smallest one for this one And the last element is the biggest one for the top sort of run And then you do some swapping and then shuffle allows you to move things around again Keeping everything on the SIMD registers without bringing it back to the CPU caches And you do this in three stages and then you end up out with your with your sorted run for all the values here Again, it's exactly the same thing before because now we're just we're having more stages Because we're looking at larger runs and you can keep expanding the size of this run until you fill up your your CPU cache Or you have the size your CPU cache right so Now at this point Once we exceed half the size of our L3 cache now we fall down to level 3 and now we're going to do what's called a multi-way merging so the basic idea here is that the we're going to break up the The the merging process of our shorted runs Into these sort of tasks that we then form out across a bunch of different threads or cores and What will happen is we do some internal bookkeeping to keep track of whether the Whether we we we think the data that this this task is going to operate on is in our CPU cache and Then when it we think it's ready then we can schedule a thread to then do the actual merge process And we sort of stage this out until we get larger and larger Larger larger sort of runs So again, this is going to require more bookkeeping on our part inside the database system But the advantage is that we're going to pay off extra instructions in exchange for having fewer cache misses and memory stalls So it's sort of like a careful orchestration of what the CPU is actually doing in order to Have every every time we execute something all the all the data that instruction those instructions need are available in CPU caches Now this sounds amazing, right? Well, I'll show simple visualization of it But like this is sounds like of course, this is exactly what you want to do. I feel like this is really hard I don't think anybody actually implements this and the the Challenge also is that this assumes that the data system has exclusive access to the hardware and therefore you can kind of you know Prefetch things from your CPU caches and when you when you schedule a task to go get get you know get executed That it's guaranteed to be there. I feel like whenever this paper like it's a nice thing to do But like I I feel like this would be hard to implement In a real system with a lot of things going on at the same time, but that's fine All right, so it looks like this so these are all our sorted runs from LL to and We have these little merge operators And so these merge operators get put into these task use and the idea is that there's some Back on process that's keeping track of what are the data I'm processing on in a task queue And let me go ahead and try to prefetch this into my CPU cache so that When it's ready when does somebody actually then operates on this this task it's available for it, right? So the idea is that say this thing here is is This piece of memory that needs to merge is not in my CPU cache I don't want to execute this thing yet. I want to wait and then once It's in my CPU cache then I go ahead and schedule something so basically the threat is it's jumping around And I see all these different merges in these different spaces Which seems like again a bad idea because you have it's not a context which because you're still in the same thread But you're at least jumping the execution context all over the map And there's an extra work get to do to figure out how to how to schedule it But the idea is that this trade-off of extra instructions in exchange for better better cash locality So again, I'm sort of being very vague here But this is the basic idea of this that they're just doing the they're just merging things together, right? And so there's two nest of for loops Or you know two iterators to say is this thing less than this thing merged together But as each these sort of these these runs as they get near the front of the queue We go ahead make sure we prefetch them Right I'm seeing blank faces. Is this sort of clear? Okay All right, so this gets us the sorting Again the we could just do a quick sort and this example here. We're doing merge sort Because that that is something you can actually take advantage of SIMD All right, so now we have and to the merge phase because now we have a bunch of data that's been Sorted and again, we just want to iterate over them now and merge them together to produce our join output So we can do this in parallel If we sort of organize the data in the previous step in such a way that the We actually know sorry we can do this in parallel because this is essentially a read-only operation Because during the merge phase we're not we're only generating output Right and that that we need to possibly synchronize depending on whether we write to one buffer or it's partitioned But during the the the actual merge process itself during the join they don't they don't write anything They're not modifying because we've already sorted everything. So there's nothing we need to do as we're doing this, right? And as we already said depending if there's duplicates and depending where they're doing a left outer join or not We may have the backtrack, but for our purposes here, we're not going to do we're not going to worry about this So the there's gonna be three ways we're gonna do our our sort merge So there's the multi-way and the multi-pass sort merge from ETH paper Do you guys read and then there's the massively parallel sort merge from the the hyper guys Which I think the paper covers as well, and they leave these benchmark and I don't know if they describe it in detail so we're gonna go through these each these one by one but the The spoilers again that the the hyper guy say this is the way to do joins don't worry about hash joins do sort merge Here's how to do it very quickly that was 2012 2013 they came back said this is a bad idea And then the each paper you guys read basically shows this thing gets crushed It's it's not even close. So I don't know why they were they were so gung-ho about about four That's fine. Okay So the first one is we the multi-way sort merge and this is this is the best approach, right? So it's basically everything. We've already talked about so far, right? It's doing the cash con conscious sorting with the three levels again. It's not a full system It's just a little test bed system So the interference during the third level of different threads getting scheduled different times They're not worrying about that because they're running, you know, they're running by on the machine by itself, right? So the for the outer table we're going to do is that we're gonna have each core it's gonna sort all the data they have in their local partition with the Sorting networks at level one the botanical networks in in level two Then they do the multi-way merge sort and the entire table is going to do going to do the exact same thing so now what you're going to have is that the at For each partition on the inner table on the outer table Everything is localized and sorted together. So now I need to only operate on the data that that's when within my partition, right? And you can do this all within Within a single core because again all the data that could ever possibly be to use or join are located together So let's look at a visual stage of it. So the first stage I we're assuming that we have we're doing explicit partitioning So we can do the same thing we did with hyper just sort of split up our data into chunks or blocks or morsels whatever you want to call them and then Every core now is going to generate a locally sorted You know segment Run if you will yeah, it's a run a run for the data that it's core And then now we want to do a multi-way merge So for this we're gonna have all the cores right out the data within some range to a buffer Memory space at one core So this is this is doing range partitioning So I'm gonna say I know what the distribution of values are because I've already gone down a path to sort of that so now you can divide it up to Equal size chunks or ranges and say all right within this range It goes here and then for the next range It goes the next one and then within this we knew that this is how we do that multi-way merge sort So again, we'll do this for everyone here Right so again now within a single core We have a globally sorted run All right Now on the outer table the inner table And we're gonna do the exact same thing. I'm not gonna show it I'm just gonna say just the word sort right because I run out of space, but it's all of this So it's doing the the redistribution Sorry, it does it does the local sort sort level one level two Then does the redistribution to write the data within one core that within a range And then you do the multi-way multi-way merge sort to combine together to have a to now this thing is globally sorted So again now when I want to do my my join It's just local right the data at this partition or this core only needs to look at data at this one It doesn't need to talk to anybody else Right and I just rip through it in parallel So again, there's a bunch of communication we did before because we had to shuffle things around To do in that range partitioning phase, but once we're there this all happens Right, it's not gonna happen our CPU cache because this this chunk is actually quite big I mean it may not fit in our CPU caches, but all the there's no coordination across the different cores I just operate on the data that's local to me So the only part I need to synchronize is this part here They need to wait for everyone to be done The sorting then you figure out what the ranges are and then everyone's blasts them out and sends them out to everyone else right So Essentially another way to think about this is that we're only paying the penalty to do a remote write a read once This phase here. Everything else is all just local Again the spoiler this actually turns out to be the best Multi-pass merge is just like the multi-waste sort merge site the multi-pass sort merge is like the multi-way sort merge Except that you're not going to do that redistribution part Right, so you do the local sort level one level two But then it's every single core is now it's just going to have its locally sorted data And then they're just going to compare across the entire table on the other side of the join So this one here. You're not you're not doing the the the random of remote writes in The same way the multi-pass or the multi-way one does but you're doing more remote reads when you actually do the merge phase of the join right So it looks like this so again, we do the exact same thing on the both sides, right? So we do again do our local sort But now we're now we're just going to do a global merge on this so for every single thread on Every core is they're going to do iteration over its local data And it knows about I have a bunch of these sort of runs on my other side on my inner table So it needs to look at do a scan on every single one to try to find the match So this is doing remote reads because these you know these cores could be on different numeric numeric regions So the argument you can make and the hyper guy sort of make this is that I'm doing sequential scans here, right? I'm scanning through these regions of memory just from top to bottom and Obviously depending what I'm joining on I you know, I can truncate say I've read far I've read far enough. I can just stop but it's a lot of sequential reads So the hyper guys are going to claim that the harbor pre-fetcher is going to speed this up and amass the latency of having to go over a different Numer region because it's going to it's going to guesstimate that I'm going to need a bunch of data after I read a Bunch of another segment of it. So go ahead and bring that over for me They're going to what harbor pre-fetching is right, this is something that the low level hardware actually does we'll see Next class or next week. There's actually software pre-fetching you can provide hints to the hardware and say pre-fetch this for me But in this case here, we're just letting the Harvard figured out for us. It doesn't always get it right though I said this Claire so the multi-way does the Redistribution so that you have sort globally sort of runs at it You have locally sort of runs at each core But all the data you need is only at your core in this one here You're doing you're not doing that redistribution. So you have to do the global merge The last approach is the massively parallel sort merge from hyper So they're gonna do something different than the other two so in the outer table. They're gonna do range partitioning On on the outer table and then they're gonna sort it so that you have a globally sorted outer table and Then you but you don't do the same thing for the inner table Right, you just do a local sort sort at each core in the inner table and Then now when you do the merge phase Right, it's gonna be merging the sorted run of the outer table and a segment of the inner table that you know you actually only need to care about because it's it's It's it's locally sorted So let's look at visualization again on the outer table side. We're gonna do partitioning ahead of time So this is not sorting right? This is just saying like this could be the rate of partitioning we had before Right. We're just taking a bunch of data. Actually, I think it has to be It has to be range partitioning Yeah, if you do radix, I think it's gonna jumble. Actually, I don't know what they do. I forget it doesn't matter So somehow we're chunking things up. So data that's close to each other is all within the same core Right, actually it is range partitioning because you want to be globally sorted, right? So this whole thing now is globally sorted and then each partition or each core only has you know a chunk of that globally sorted range now on the On the inner table We're gonna sort but only sort locally The this could just be the level one level two Level three as well or quick. So it doesn't matter, but we're not trying to redistribute everything. So this is all globally sorted So now when we do our join the idea is that for this for any Any sort of run in the outer table We'll scan the whole thing, but then we only scan a subset of what's in the The sorted local run for each core So I'll do you know the same thing off to look a little bit of data for each each partition But I'm not scanning the whole thing the way I had to do in the multi-paths Right again same thing for the next one. You just do this do the same kind of scan going down and so forth I just clear right So again, this is this is This is not the sorting algorithm This is this is how we're going to organize the data as it's being produced in sort of the sorting phase And that determines what kind of implementation algorithm will use to do the merge So the hybrid guys came up with a bunch of rules as well in their paper that said, you know If you're going to do paralyzed execution first for a sort merge Here's some general rules that that are useful And I think again, even though we don't want to do sort merge. I think these are still Even beyond databases these are still good rules to live by but so the first thing to claim is that you don't never do any Random rights of the non-local memory. So again, they were only doing All the rights they were doing always at that their local partition, right? They mean the partitioning sort of counts as a random right But actually now as you're doing the the the joint algorithm itself, you don't have any random rights If you ever have to do a read on non-local memory Make sure there's always doing sequential reads sequential scans Because that allows the Harborview Feature to hide the mass the latencies of reading remote data and Then you want to avoid any synchronization primitives that require any core to to wait for another to finish before it can proceed again, this is sort of obvious but like Once everything's been sorted then the whole operation is read only so we don't they do any synchronization across across the different course right All right, so let's get the evaluation that is from the the from the eth guys in the paper you read So for this one, they're going to compare Both the sort merge and the three different implications that we talked about so far as well as that the radix based hash joins And we talked about last class, right? I think they might call it a radix join, right? But it's radix partitioning for a hash joint They're gonna run out on way beef much beef or machine Then we talked about last class and now they're running on a four socket Intel Xeon processor with eight cores of hyper threads So so 64 total threads and they're running on half a terabyte of DRAM. So everything's always going to fit So the first experiment they did was just compare the speed of the SIMD sort The level one level two level three stuff we talked about before with the the sort of C++ standard template library sort Well, it uses what's called a hybrid sort So they do they do quick sort in the beginning and then once the data gets a little bit sorted They have enough sort of runs then they switch over and do heap sort, right? So this is just along the x-axis They're scaling up the number of tuples that they want to sort and then the y-axis is just the number of tuples per second They're processing so this is nice So this is showing that you're getting roughly about about 3x speed up improvement for the SIMD sort in the best-case scenario Which actually corroborates the the Intel papers findings from from 2008 Right, so this is another example of reproducing work done by other people and showing that the science actually at you know matches up Right, so this is good. We this shows that again SIMD sort is clearly ideal. This is something that we'd want to do So now we want to get into the actual the sorting itself. So this part's confusing so There's the partition phase the sorting phase and then there's the break they were doing a breakdown between the The merge of the sorting phase and then the merge of the join phase so s merge and m join that's what those are So that the main takeaway here is that other than hyper approaches actually the worst Is again the multi-way one even though it's doing more work like it seems like to do that redistribution Would slow us down, but because we get better cash locality when we actually do the merge phase Right, that's why that they're getting the best performance here Right another way to plot this is that you know along this access in the bar charts We're measuring cycles the number of CPU cycles per output tuple We also just measure it in terms of throughput and a line graph like here and as expected right as you Actually, what is that saying sorry yeah as you execute fewer instructions you get higher throughput Right, so we're doing less work to execute this Operating the same amount of data, so that's why this one gets better throughput and executing lower cycles Again, this is expected because again We're actually more instructions, but the instructions we do execute take fewer cycles That's the sort of way to read that I for the next graph they want to compare the They're multi-way join versus the the hyper join just sort of getting a better understanding what's actually going wrong In the hyper approach, so for here as we're doubling on our threads We want to see double the throughput that's called linear scaling That's the ideal thing we want to have in a parallel database system So you see roughly is that when we're at 16 Harvard threads real threads. We're doing 130 million tuples a second And then when we double the 32 threads we're roughly almost you know 260 Almost exactly double. That's nice obviously when you get to hyper threading things fall off because these aren't real cores And you have cast contention and other issues in the case of hyper and once you get to these larger counts They're not scaling and I think the reason if I remember correctly was the just the The overhead of reading data across different numeric regions that just made everything slower So as you increase the number of threads you increase the likelihood that the data you need is on another numeric region So therefore they get slow performance and then the paper talks about how like the The you know the the claim that the harbor prefetch or it's going to save us didn't actually turn out to be correct at all all right, all right This one now they're comparing the the radix hash join versus the the the sort merge join for the multi-way sort merge and I Don't think that last class I said like you know Nobody actually does the partitioning stuff and here all these gray bars are the partitioning stuff and this sort of shows you like I Don't have no I don't have numbers to say if you didn't do the hash partitioning You know how much lower would it actually be but in this case here? They're they're partitioning 16 1.6 billion tuples You know you're paying a lot of overhead to do that in terms of the total cost I do in the join so this just shows you that the hash join is always faster I think the non-radix version of this would be even better so Every day decent every major day decent will still support both these algorithms But nine times out of ten or even probably even higher probability. It'll always want to do the hash join So this is like the best version of what the sort merge join could do We'll still lose to what I consider to be an inferior hash join. All right, and the last one It's again the same experiment here They're just showing you how the performance Falls out or goes down for the radix hash join as you increase the number of tuples That sort merge join would never do better and again the was curious about this is that the No matter how big the number of tuples are we're running. We're always possibly doing the same amount of work right and Just getting the overhead of doing that sorting stuff is quite a lot whereas that the hashing part Clean quite efficient So I say as you're going down here, then the bottleneck is really the the the partition phase because there's a bunch of extra rights all right, so Any questions about this I realize the room's like he's falling asleep Yes His question is do these sort merge these sort merges only work if you're doing integers primitive types fixed length so to do this on a very length thing you're screwed and this is why like Yeah, we should have talked about this but like You if you're sorting if you're sorting strings that are uncompressed It may make sense to just do a pass on them and do dictionary compression because then you can take advantage of all these things or The more common case would be like my my inner table and outer table are compressed in different different dictionaries So I decompress one then recompress it versus the outer table dictionary and then now I can do all of this So see there's no SIMD instructions to work on variable length data because you can't if you're char You're if you're a fixed length string and it fits in the SIMD you can do this But if not you have to re-encode it Okay, so again Both of these approaches you need to have because there may be some cases where the the sort merge actually might be better especially again if you have a If you're doing a join on a join key and then the order by clause is on that join key Instead of having to do the join and then sort you just do the sort merge Then it comes out to be already sorted and you don't do anything extra for the order by that's why some you know Data systems keep around both these algorithms all right But again getting a good hash join would be is the most important thing like if you're building a data system from scratch The first thing you should implement if you want to do joins as a hash join If you want to do a lap for OTP you do nested loop index joins, okay? All right, what is the current temperature in here? It went up okay, it was 70 it was 77 or it's 78 we're down to 77 now. Let's back up All right, so let's stop here on Wednesday, we're new query compilation, so This lecture will explain why Amadou's to news Katya and and when why their project is So this will teach you what query compilation is and they will show you the mistakes that we did in the old system Peloton and this also explained why what they're doing is the right way to go forward and I think it's pretty much us and Mem sequel are the only ones that are doing it the way we're doing now as far as I know, okay? All right guys. Let's get out here Get a grip take a sip, and you'll be picking up models First on tap on the bottom