 So this is the second lecture we're going to have on join algorithms. So remember last class was hash joins, which I said was the dominant join algorithm that everyone uses. Then today's class is the second major class of join algorithms, which is a sort merge. Again, in any real OLAP system that wants to get good performance and never does nested loop joins, it's going to be doing either hash join or sort merge join. But hash joins are definitely more common than sort merge join. So real quick, before we get into the material, there's some dates coming up that you need to be aware of. So as a reminder, everyone should be submitting their first code review on this Wednesday, right? So I sent the email out on Saturday. Every group's been teamed up with another group, except for one, which Lynn and I will take care of. And so you need to submit a PR to your partner group, and then they have a week to review it and give you feedback. And the idea here is then go through the code review process and to see what it's like to read other people's code and actually try to make sense of it on your own. Because this is something that you can need to be able to do when you go out into the real world. And then we're also not going to have class on Wednesday. We will meet, everyone will meet with me in my office. I sent an email also how to sign up. And the idea is that you come for 20 minutes. You talk about your current status of your project. What are the problems you're having? What do you need help with? What has changed since what you proposed originally? And then on Monday's class next week, so a week from today, we'll have the sort of five minute status update presentations from every group. And we'll go in reverse order than we did before, so Gus will go first, and then we'll go to again, this one by one. And the idea again is to basically tell the group what has changed or what have you done so far, okay? Any questions about any of these? Okay, cool. All right, so for today's class, we're going to start off talking about the sort of high level background about SIMD instructions. This is sort of the basic material you need to understand how to, we're going to do the sort of vectorized efficient parallel sort merge zone algorithm. We'll cover vectorized execution and SIMD in more detail next week when you guys read the paper on, we have two lectures on vectorized execution. And then we'll go through the background of the three different parallel sort merge join algorithms. Then we'll have an evaluation, and then I'll finish up if we have time of just sort of high level tips of what's expected for you when you guys do your code review. So real quickly before I begin, I will say that during this lecture, sometimes I will say sort merge, sometimes I will say merge sort. I mean the same thing, and I'll try to keep it as sort merge. Because the database people will say sort merge, the algorithm people will say merge sort, all right? And I'll be clear when something is the merge from the sort versus the merge from the join, okay? Okay, so everyone here should have probably taken either 418 or 618 or already have an understanding of what SIMD is. But again, I just want to go over this at a high level detail so that you can understand how we're going to do the vectorized big tonic sorting later on. So SIMD is a class of CPU instructions that are allow the processor to do multiple operations or sorry, a single operation on multiple data items all within a single instruction. So contrast this with the SISD instructions, single instructions, single data item. This is what people normally think of when you think of CPU instructions, where you take, you want to do some simple operation like an addition. You take one register and you add together with another register and it writes out to a third register. That's a SISD because it's a single operation on a single piece of data. Whereas in SIMD, we're going to allow us to perform the same kind of operation, but on multiple data items at the same time. So this is not a new idea. There's a famous work in the 1960s called Flynn's Taxonomy of Parallel CPU Architectures. So SIMD and SISD were one of the ones back in the 1960s that people had thought about. But it really didn't kind of come into vogue or come to the forefront of what we can get now in modern CPUs. Until the 1990s when AMD and Intel put out their own CPUs that each had updated micro-architectures to support SIMD. So for Intel, the first version of their SIMD instructions was called MMX. Then AMD came out with something a little bit later called 3D Now. In the case of Intel, MMX actually doesn't mean anything. Intel explicitly says it doesn't mean anything. People claim it means multimedia extensions, right? But in the legal proceedings, they explicitly say, MMX does not mean anything. And you'll also see this sometimes in other Intel release names or code names for their products, right? The latest ones are Xeons, you have Coffee Lake, KB Lake, Sky Lake. Intel is paranoid about getting sued, but people claim that Intel took their names. So in the case of those lake things, those are all physical lakes somewhere in the United States. So you can't claim that they stole your name because it's a geographical region. Same thing for MMX, they just claim that it's just three letters put together, doesn't mean anything, and that was good enough. So the first versions of the SIMD instructions, though, were actually really, really bad. So they only had 64-bit registers and they could only do operations on integers. And they had this weird thing where you couldn't do floating point operations and SIMD integer operations at the same time. You had to switch the CPU into a different mode. So they were difficult to program. Later on, Intel released the SSE instructions and then now we're up to AVX instructions. These are much better and these actually allow us to do more complex things. Not just because we have larger registers, but also because there's more operations that these things are going to support. So I remember when I was in middle school we had a Pentium CPU back in the day and Intel would put out these ads saying like, now with MMX. We didn't know what it meant. We just knew we wanted it because it sounded cool. But most of the software at the time didn't actually support it. But now with compilers gotten better, the software's gotten better. And now with the new instructions, there are some things that can take advantage of it. So let's go through a real high level example. Again, we'll cover SIMD in more detail in the next lecture. But for now, this will be enough for us to understand what's going on. So let's say that I want to add two vectors together. X plus Y and I want to put the output in a third vector called Z. So the way we would write our C code to actually do this would essentially just be a for loop, consuming that X and Y had the same length. And then for every element in position I and X and every element in position I for Y, then we add these two together and we get our output buffer here. So the way you would actually implement this in your C codes like this, in the way most of the time, if you're not doing SIMD, actually if you're not doing SIMD, the way you would actually execute this is through a SISD instruction, which is literally just again looping through one by one each of these elements, adding it together and then producing the output, right? So this is, there's no optimizations we can do to speed this up. Other than maybe unrolling the loop, right? We're still going to have to take two elements and run a single instruction to produce a single output. So now with SIMD, the way this is going to work is that we're going to take a run or multiple elements together and copy them out into a SIMD register. So the way to think of a SIMD register, it's a special location that we can expose to us as if it's a variable where we can write data into. And then we can write code to say invoke the SIMD instruction on these registers in these different locations. So in this case here, we'll take the first four elements and put it into one for the for X and put it into this register. Take the first four elements of the Y vector and put it in this one. So in this example here, assuming we have 32 bit numbers or 32 bit keys we want to add together, then we can put these in a 128 bit register, a SIMD register. We can put four of these 32 bit keys, right? We'll see this later on for the newer SIMD instruction sets. They're gonna have much wider registers and they'll be able to support floats and other data types. So now when we want to add these together to these vectors, essentially what's gonna happen is we're gonna take the first element here and the first element here, add them together and do this for all the other ones and write it out to another SIMD register. We're being put Z here, and this one has to be also 128 bits, right? So then we get down and do the next four, same thing. We feed them and we put them into our SIMD registers. We add them together and it produces our output in our SIMD register. Yes? What's the difference between this and the GPU? His question is what is the difference between this and the GPU? The GPU has a lot of little cores and they're sort of a limited instruction set. So they're not like a full general purpose CPU like a Xeon would be. So there's a limited number of operations you can do on them. Now within the core themselves, they could have SIMD instructions, right? But and that would be what they would be doing here. So the GPU is gonna be a lot of cores that can do a lot of the different things at the same time. This is now within a single core, we use the SIMD to speed things up. We'll cover GPUs and databases at the end of the semester. Yes? The result of the register determines how many, what is the result of the SIMD user? Say it again, sorry. The result of the register determines how many SIMD users are there on SIMD. So his question is the width of the register determines how many elements we can put into it? Yes, SIMD units. Yeah, so yes. So in this case here, again, I'm assuming I have 128 bit registers and I have 32 bit keys, so I put four of them in there. So everything has to be aligned nicely based on the type. You have to align the data based on the operation, the instruction you're gonna perform on it. So in this case here, I'm doing a SIMD addition on 32 bit integers, so I have to align them up to be one after another, right? If I had, say, a 48 bit integer, that would cause problems because the instruction is expecting everything to be 32 bits and you won't get the correct result. So yeah, so the latest version of SIMD, I think, is AVX 512. And they support, I think, 8 bit, 16 bit, 32 and 64 bit SIMD additions, right? So it should be obvious why this is a big win for us because before when I did this in the SISD way, I was adding these numbers one by one, so I had eight numbers here, so I had to do eight addition instructions, right? In the case of the SIMD case, I only had to do two instructions, right? So that was a 4X improvement. Now, you also may be saying, oh, don't you have to copy the data out from memory and put it into the SIMD registers? You have to do it anyway for SISD, right? You have to take it out of the CPU cache and put it into the register and then do the addition. We're just not showing that extra step, right? But just in terms of straight addition operations or addition instructions, we had a 4X reduction in the number of operations that we do. So the advantage of SIMD is that we get significant performance gains if our algorithm can be vectorized in such a way where we can express it in terms of SIMD instructions. And then just how exactly you write these vectorized or SIMD instructions in your algorithms, we'll cover that in a week from now. But the disadvantage is that, as we'll see, implementing an algorithm to take advantage of SIMD is still, by most accounts, a very manual process. So the Intel has the best compilers. They have a special compiler that can help try to vectorize code. And for simple things like for iterating over two vectors and adding them together, the compiler can figure that out. But for more complex things we want to do in our database system, like a join, it's not going to be able to automatically figure that out. And then we, as the highly paid and highly skilled database developers, we have to write that ourselves. And again, we'll see this in the next paper you guys read from Columbia. They basically show how to take every possible relational algebra operator and make a SIMD version of it. But the dirty secret of it, none of it actually works if it doesn't fit in your cache. But we'll get that later. All right, as he asked before, also SIMD could have restrictions on data alignment. So if the data you're trying to do an operational one does not fit into the predefined lengths that the SIMD instruction set specifies, then you can't use this very easily. And the last one is gathering data into the SIMD registers and then breaking them up or scattering them back into memory after we do our operations on that can be tricky. And it may be efficient depending on what CPU we're using. So we'll see this again in the next class. I think it's either the Gatter or the Statter operation. On the hardware they were running on, the CPU didn't actually support that instruction in SIMD. So they had to emulate it with assembly and that ended up being less efficient. But again, the main thing here is that it's just not like there's a magic O12 flag being passed to GCC to go vectorize everything. We're going to have to write our code carefully to make this work. Yes? So if you have E like French, then they have this. Yeah, so his comment also too is like, if you have a conditional in your code, that's going to cause problems and then the compiler tries to vectorize things. And we'll see this when we do our sorting when you try to get rid of the if clauses and that allows us to do everything entirely in SIMD. Again, I just want to sort of, I don't know how much you guys already know about SIMD. I don't know whether 618, 14 has covered this yet. This is just the bare minimum you need to understand the parallel sort merge algorithm we'll talk about today. Then we'll cover this in more detail in the next lecture. OK, so for sort merge, the basic algorithm looks like this. In the first phase, you sort both the inner table and the outer table based on the join key. And then on the second phase, you actually do the merge where you just basically have this iterator that scans through both the sort of relations and then you just compare the results with each other. And the nice thing about it is that you only have to take one pass over the outer table. You may have to go do multiple passes on the inner table if there are duplicates. So visually, it looks like this, right? So we have our relation R, relation S, right? And then so the first phase of the sort merge is you just sort. And I'm just showing a box here that says sort. I'm not defining what algorithm we're using, whether it's quick sort, heap sort, or the bitonic sorting stuff we'll talk about later. It doesn't matter. And then now we have our two relations that are sorted. And then in the merge phase, again, we just have an iterator on each side and then they just go scan down and do the comparisons with each other, right? And the idea here is that because you know everything's insorted in sorted order, if you get to a key here in the inner table and as you're scanning through in the iterator, if you exceed that key, all right, the value of this key is less than the value of this key, then you know you need to stop here because there's nothing else down below that could ever possibly join with it, all right? So to a high level, this is pretty easy to understand. But when we actually want to make this fast in a in-memory system, things get more tricky. And the key observation to make about how we can speed this up is that the sorting is always going to be the most extensive part of the sort merge. And so we're really going to focus on what we can do to speed that up. And just like before when we talked about hash joins, we had to identify what our hardware actually looks like and what its properties are in order to get the best performance. We need to do the same thing in order to make our join algorithm work efficiently here, right? So that means that we're going to have to try to utilize as many CPU cores as possible so everything's running in parallel. We want to be mindful of where the data is actually being stored, that we're accessing, so that we try to maximize the local access based on our Numer region. So we don't want to core how to go read data from some other socket because it has to go over the interconnect. And then where possible, we want to try to take advantage of SIMD because it's another way for us to sort of maximize the parallelism at a single core and get better performance. Yes? So like for the hash joins we used to be going to use like SIMD. Yes, it's a good question, I was just about to say this. So his statement is, I'm making a big deal about SIMD for the sort merge join. But is it also the case that you can vectorize and use SIMD for the hash join? Yes, we will show that. I would say it actually doesn't help, right? But that's a good point and we'll cover that next class. All right, so the one thing I'll say, though, and we'll see this later on, is that the way that Hyper used to do with their joins was through sort merge. And their sort merge join algorithm actually ignored this point for the merge phase. So they're going to argue that in their case, as they'll see in a second, they're going to do all the merging doing sequential scans at each thread on the internet side or intertable side. So in that case, if you're doing sequential scans, the Harbor Prefetcher can prefetch the data for you and you don't care about numerous boundaries or numerous regions because the Harbor Prefetcher hides those additional latencies. That's what they claim. We'll see in the results, it doesn't actually work out. And then they eventually gave up on the sort merge join algorithm, right? They use a hash join for everything. All right, so the parallel version of this looks a lot like the hash join where we can have an additional optional phase in the beginning that we can petition the outer table and assign them to workers and cores. And then we have, again, the same thing that we have before, where we have the sort and the merge phases separately. And again, all of these phases we're going to be able to run in parallel. So for the partitioning phase, there isn't really anything to say about this that's different than the hash join partitioning. I'll just define partitioning a little bit further and say, and I should have mentioned this last class, there's a tendency to types of partitioning we can have. There's the implicit partitioning, which would be as the data is loaded in, we do hash partitioning or we split it up based on some attribute in the table. And we assign tuples to numerous regions or blocks or morsels based on that. So if we now have a join query that wants to do with the join on tables that are already partitioned on how we split them up when we load the data in, then we don't need to do any explicit partitioning. The data is already divided nicely for us already, right? But this can happen. It's not common because oftentimes the query show up and they want to join on something that you would, it was unexpected. And so whatever partitioning you have before doesn't help you. The radix stuff that we talked about last time, this is an example of explicit partitioning where we don't care how the data was maybe divided up before. We're going to take a pass on it and split it up and divide it and partition it in preparation of doing our join algorithm. So again, for this, you can use the same radix partitioning approach that we talked about last time. But as I said, as far as I know, no major in memory database system actually does this because the overhead of taking this first pass to partition the data does not improve the performance significantly to warrant this. So again, but we can make the same design choice in the sort merge on algorithm. So what we want to do is actually focus more on actually the sorting phase. Because again, this is always going to be the most expensive part in this process. So in the sorting phase, the idea, the high level goal is that we want to create these runs that represent sort of these sorted chunks or segments of the inner table and the outer table. And the idea is that we're going to do this incrementally, where we start off with really small runs, right? In our example, we just have four elements or four keys. And then we'll progressively make larger and larger sort of runs until we eventually have a globally sorted list. So we'll sort of start off with these locally sorted lists. And then we combine two locally sorted lists together to make a larger locally sorted list. And then once we combine everything together, now we have a globally sorted list. So the thing that I want to stress in this lecture is that when everything was just, when we were talking about disk based database systems, where we said the disk was always the most expensive thing. Usually quick sort was good enough if everything just made it into memory, right? But now if you want to do this in parallel and you want to be aware of our underlying hardware, quick sort is not going to be good enough. And we now use something more complicated, right? So the technique we're going to use to sort that was in the paper that you guys read, but they did rely on this, is this thing called cache conscious sorting. So this came from the paper that I mentioned last class from Intel and Oracle from 2009, where they showed how you can take advantage of modern hardware and do some basic fundamental database operations more efficiently. So a lot of times the Intel papers are really, really good, especially the ones at least in database conferences. Because Intel's in the hardware business, right? They want to make money selling hardware. And they can't really keep incrementing the clock speed anymore because silicon melts and there's other problems. So they start throwing in a bunch of new features like SIMD and other accelerators and large number of cores as a way to get better performance. But if these things are so complicated that nobody knows how to get better performance out of them, then no one's going to buy the hardware. So Intel puts out some really good papers and shows that here's how to actually use the hardware correctly. So this is an example of one of them where they show how to do sorting efficiently using SIMD. So the terminology I'm going to use in describing this cache conscious sorting algorithm doesn't actually come from the original Intel paper. They don't use the term levels. This is something I'm going to use because it helped me to understand what the hell they're actually doing. And it's important to differentiate this between the phases of the joint algorithm, which technically right now we're in the sorting phase. So the way to think about how this cache conscious sorting is going to work is that they're going to break up the sorting algorithm into three different levels and at each level you're going to deal with data runs or runs of data that are at a certain size. And then you're going to do your sorting in a certain way. And then as your runs get larger and larger you end up with another level where you want to do sorting in another way. And then at some point you now exceed your CPU caches and at the last level you do this out of cache sorting. So the way we're going to start off is that in level one we'll do in register sorting. So it'll be where everything can fit into a single SIMD register. And then once we sort all of our entire data set in this way then we get down to level two where we now start combining the output of level one into sort of runs that can fit in our CPU caches. And we'll keep doing this incrementally until our sort of runs are one half our cache size or like L3 cache size. I'm going to take a guess why we have to do this. Why does it have to be half the size? Input and output, right? So we can't do in-place sorting. We'll take a sort of run. We sort it and then we write it out to another sort of run location. So we have to make sure that the input and the output fit in cache so our input can only be half the size. And then at some point when we exceed this list limitation then we come to the last level and we do our in-cache sorting. So pictorially it looks like this. Again we start off with our unsorted run. In level one we generate these sort of runs that fit in our SIMD registers. Then in level two we start combining these guys together to progressively larger and larger sizes until they are one half the size of our L3 cache. And then at level three we just start merging all these things together until we have a globally sort of run at the bottom. And if we're doing joins on really large tables, this will exceed our CPU caches. But we can try to come up with a technique in level three that balances cache misses with instructions, okay? So we're going to go through each of these levels one by one. So the first level is my favorite level because the idea is an old one but it's really kind of cool to see how it actually works on modern hardware. So in the first level they're going to use a technique called sorting networks. So this comes from the 1940s where people proposed about how to build actual hardware to do sorting, all right? And so the way it works is that we're going to build this network that has these wires that's going to go from the input to the output. So save in this case here we have four keys we want to sort. And so these wires represent the sort of thing of these wires is carrying the value that is coming from its source. So this wire here carries a value nine, this one carries five, three and six. And then what will happen is the wires will get joined together with these comparators that are essentially going to do a min and max operation. So what will happen is in this case here nine or five get fed into this comparator. So the min value will get put onto the top, the max value will get put into the bottom. So in this case here five goes to the top, nine goes to the bottom. And then now the value that's on this wire comes from whatever was the result of this comparator operator, all right? So same thing here for three and six, three goes to the top, six stays to the bottom. And then again they feed now into the next comparator. This one will be five and three, three goes to the top, five goes to the bottom. Now in this case for three, it doesn't have any comparators on its wire. So it writes out the value to the output buffer here. So three gets pushed out to the end. And then so we keep doing this for all the other comparators on the wire until we end up with a sorted array of our original keys, right? Yes? Is there only on the CPU cache? So we're not even there, so this question is are we running on CPU cache? This is just a high level diagram what we're doing. We haven't actually, we haven't described how we're actually going to implement those yet. That's next. So as he said earlier, SIMD only really works well if we have no conditionals or if branches. So the key thing to understand about this is that no matter what set of inputs we give it, it's always going to execute the same instructions to sort these four keys, right? It's always going to do the same min and max. It doesn't care what the values are, right? It just knows that actually this instruction to get the min, actually that instruction gets the max. So we can implement this really efficiently on a modern CPU because there won't be any if branches as you would have in a quicksort algorithm, right? It doesn't say if five is greater than nine, do this, otherwise do that. It always says, all right, take this value and this value, run min and max, and whatever the output is, I don't care what it is, write it out to this location. So this is really efficient to do on a modern CPU like Intel Xeon, because these are superscalar CPUs with really large instruction pipelines. So if you have a branch mesh prediction because of an if clause, then the CPU is going to flush out your instruction pipeline, and then it's going to be a really expensive operation to go fetch the next the actual instructions you do need and bring them into the CPU. Whereas in this case, because we have no conditionals, we know exactly what instructions we're going to execute every single time we do this run. Yes? Is that done by a single instruction? Because if you compare numbers, then you want to say if five is less than nine, you could apply this here. You're asking why we don't need if branch or if it's just done by a single instruction? So we'll see in the next slide. But take this first comparator. I have two inputs, nine and five. The one instruction says take the min, write it here. The second instruction takes the max, put it there. So it's provided by a single instruction, but after by a half. Yeah, this will be, you have to do two instructions, min and max to do that. Let me walk you through the example in SIMD and see if it makes sense. Okay, I think your question was, if you wanted to min, you would have to do some big condition in terms of min. Yeah, I mean, I mean, at the main level, you would have to compare two numbers and then. In the case of SIMD, there's min and max instructions, right? So you can just do that in a single operation. All right, so let's see how we do this now in SIMD. So for this, we're not going to sort, it's a single run at the same time. We're actually going to do a four by four comparison. So the first thing we always have to do is we have to do a load and put the data that's in our CPU caches, put it into our registers. But now when we want to do our sorting, we're not going to sort within a single register because that's not how SIMD works. Like SIMD, you can't say, take the first element in my register and take the second element and do something to them. I can only do operations across registers, right? So in this case here, I'm going to sort within a column for each of these here, right? So to do this, I can just do ten min and max instructions that I just talked about to end up with the sorted registers like this. But the problem is now we see that the sorting again is through the column, not within a single register. So if I want to put this now back into CPU memory or DRAM, I got to do a bunch of extra mem copies to go grab the actual things I need and put it into a line data region. But luckily there are transposed operations or instructions for SIMD that we essentially can do some shuffles to end up with taking what's in this column here and putting it now into a single SIMD register. And then now I can copy this thing out and put it just in some location in memory again as a sort of run. So it's not globally sorted, right? Like eight is greater than five, but five is over here. So, but again, at level one all we want to do is just take our keys that can fit into our single register and then do our sorting on that. So in the case of ADX 512, the latest version of the SIMD instructions on Xeon, we can only sort four keys at the same time, at least in a database system. Because our tuple IDs are always going to be 64 bits and we assume we have 64-bit keys, then the actual element we want to sort is 128 bits. So we can only sort four 128-bit keys at a time. So even though we have larger registers, we're still going to do four. We're still going to use this sorting network approach. All right, so again the main thing to show here is that with 26 instructions, I can sort four, have four sorted runs. Whereas if it was quick sort, it'd be much more expensive because again, there's conditionals and a whole bunch of other things that would be going on. If you had to write this in a SISD program. All right, so now at this point we have all of our sorted runs that fit in our SIMD registers. Now we want to go down to level two and now merge these guys together and generate larger sorted runs. If it's so big locally sorted, it's not the global list, but we want to generate larger and larger ones. And as I said, we'll do this incrementally until our sort of runs are half our cache sizes. So this technique comes from, again, another Intel paper. This is actually a precursor to the paper from, I just talked about before. This is purely about how to do sorting on a multi-core CPU. And I think this paper is a big deal because it shows how you can take existing old ideas and speed them up on modern hardware, right? So Intel talks about how they can speed up their sorting algorithm by up to 3.5x over a SISD implementation, right? This is actually really impressive because when you think about it, how old is quick sort, right? Before I was born from the 1970s, right? And there's no magic algorithm less P equals NP that's gonna make these sorting algorithms go faster. So just purely based on careful engineering and performance tuning, they can get a 3.5x speed up over a well-written quick sort algorithm. And to me, that's really impressive and that shows the advantage of what SIMD can do. So the way this level two stuff is gonna work is that they're gonna build what's called a bitonic merge network. And the idea here is that they're essentially gonna look a lot like a sorting network, but we're gonna do some slightly different and have some extra steps in order to deal with larger and larger sorted runs. So at a high level, it looks like this. So we're gonna take two sorted runs and merge them together and produce our new output. So in this case here, we're just gonna take say two four element or four key sort of runs and merge them together. And so what needs to happen is the first sort of run is gonna be listed in the same order that it was when it came out of the level one sorting process. So it's gonna have the lowest to highest. But the second sort of run is actually being reverse order. So it's gonna be from the highest to lowest. And the reason why we do this is because when we do our comparisons, we wanna know that the highest value that's in this one here is whether that's less than or equal to the lowest value from this other one. So as you can see, we're essentially doing a bunch of min and maxes again, just like as we did before, which we can implement entirely in SIMD. And then we do some more shuffling to put the data around in the right position that we want them to be. And we do this three times and then we end up with our sort of run output here. So I'm not gonna show you again exactly the details of how this algorithm works. If you understand how the sorting network works, you can see how to again have to extrapolate it by just doing more stuff to get into the form that we wanted to be. Yes? So does the half-cache side suggest that the comic network is performed on half-cache side suggests the comic network is performed on CPU cache? So his statement is because we're requiring that the size of the input has to be half the cache size, do we assume that all these instructions and operations will be performed entirely in CPU cache? Yes? Then where is level one performed? Where is level one performed? What do you mean? Since it's performed on CPU cache and level one is performing a space that is smaller than level two. Yes, SIMD registers. Like if you have your program run entirely in registers, that's ridiculously fast. But you can't, right? So again, we're dealing with large and large data sets. And we need to use large and large memory locations. And as you get larger, they get slower because they get more expensive. So at the very first beginning, just partitioning the data sets into very small places. I don't use the word partition. You just take four keys, put them in a SIMD register, and do the sorting network sort. Right? In string and order. It doesn't matter, yeah. And then now you're basically just grabbing random sort of runs from the previous level and then using the batonic merge network to make them larger and larger until your input size exceeds your CPU cache. So each block here represents a SIMD. Each block here is just a key. Together, I'm showing taking two sort of runs from level one and merging them together. If you want to now have a sort like eight elements with another eight element one, you have to have an even larger batonic merge network. So here's just like two free registers to merge the data. Like you can do, in this case here, you can do this part in SIMD. But at some point, you can't get everything in SIMD. It makes everything larger and wider. OK. So we can use the batonic merge networks to get our sort of runs up to a certain size. But then when we exceed our CPU cache, then we're going to hit level three, where we're going to do what's called multi-way merging. And the idea here is essentially we're just going to use more batonic merge networks. But we're going to split up the actual sorting process itself to combine all these things into larger sort of runs. And the way we're going to do this is break up our pipeline so that within a single core, instead of having it sort of, all right, I'm going to sort everything for these bunch of inputs and then switch over now to sort the next bunch of inputs, you can actually have the thread jump around and do different steps of the sorting process. And that ends up getting a better performance because you're not just switching the CPU into this mode of like, all right, let me be CPU bound and crunch all this data. And then I'm done and let me go try to fetch a bunch of data in from my memory and bring my CPU caches and let me crunch on that. You have this sort of balance of because you're jumping around where I can do much of work and then why the memory controller fetches some stuff. And then when I'm done that, then the data I need is ready. So then I'll go work on that while I fetch more things. We'll see in a second what I mean by this. But the idea is that instead of having the CPU go sort of all then CPU bound, all in memory bound and just oscillating back and forth, we can pay a little extra penalty from having additional instructions to jump around our pipeline. But in the end, it balances out and we're using our hardware more uniformly. So this will require more bookkeeping for us because we need to know when a thread is allowed to actually do something at a particular step in the pipeline. But again, this is in the end, this can be better for us. And I'll say too also, we're going to do this in parallel at all the different CPUs and the worker threads. I'm just going to show you an example of how to do it within a single thread. But everybody's essentially doing the same thing. But they're only processing the data that's sort of assigned to them. So we don't need to synchronize across threads and we can use lockless queues and everything. All right, yes? Are you trying to understand the bitoning network? Where is it exactly done? Is it on the CPU side or is it on the normal CPU side or is it on the SIMD side? So you can do the min-max stuff in SIMD registers, right? Because again, like in this case here. So I'm doing four comparators for this example here. So B1 needs to do min-max on A4, B2 needs to do min-max on B3. So I can put these elements into SIMD registers, do the min-max instructions that produce the output. Where it gets tricky now is if you have larger sort of runs, then you may not be able to do SIMD for all that. And the shuffle thing, again, as it gets larger, you can't use SIMD for that. So in this case here, it's going to be, as it gets larger, it's SISD, but you're doing it in parallel across different threads. Yes? Just a thought that it looks like it doesn't even have to run on SIMD. It can probably produce specialized hardware just for this. Okay, so the statement is this doesn't look like you have to use the SIMD. You can produce specialized hardware to do this. So again, that was the original idea of sorting networks from the 1940s. When they say wires, they literally mean wires that someone would put together to do sorting of things. It's super expensive to fab hardware, right? So no one's going to do that for sorting. Okay. So yes, you're probably true. You could bake this into, you could bake the sorting network into hardware, but again, for these larger, larger runs, it's not going to work, right? I did have somebody tell me once that, like, I'm going to go on a tangent here, but what's going to be really interesting with fabbing hardware in the future is that it costs billions and billions of dollars to build this new, like a new plant, right? And Intel's, I think, pushing like 10 nanometers now or nine nanometers, but there's still all these fabs, you know, in Asia that can do like 70 nanometers. And in some ways they're sort of idle. So like it doesn't cost that much money in quotes, like, you know, a couple of tens of thousands of dollars to go fab a one-off chip in these like 70 nanometer labs or fabs that are just idle anyway. So maybe in the future that if you have some crazy idea that you want to come up with a hardware accelerator for, it wouldn't be that much to try it out. Again, I'll mention this next class. We were actually having a hardware accelerated database seminar series in the fall, but it's mostly going to be GPU stuff. Yes. I said if the max run size is half the cache size, then like at the very end you're going to merge like two runs that are like each a quarter of the cache size. Yes. So then like how does that relate to the SIMD stuff? Because like if your cache is really big, like a quarter of the cache size would be like much larger than the SIMD. So his question is how does the cache size limitate, the sort of run size has to be the quarter of the cache size. Or it will get up to that. Yes, it'll get up to that. How does that relate to the SIMD stuff? So within the sort of run, you're doing again these sort of comparators between the some element from the first sort of run, some element from the second sort of run. So for then you can combine them up to things together that you know you're always going to have to do because it's always been the same set of instructions every single time. You don't care what the actual values are. And then you just run the SIMD instructions for those, then they produce the output. So if I had more, if my runs were larger here, I would have more comparators and I would have more shuffle phases. And so again I think for the mid-max stuff you can definitely use SIMD for the shuffle one it may not be the case. But it could be possible that like everything in the first run is actually smaller than everything in the second run. His question is everything in the first, maybe the case of everything in the first run could be smaller than the second run. So could you do like a simple check? This is not half a run, this is the full run. Right? So this would be the first time you invoke this butonic merge network from level two. You're dealing with four element sort of runs from level one. And then now all these four elements will be sorted here. Question back? Does the SIMD register have to be loaded into continuous memory? His question is does SIMD register have to be loaded into continuous memory? I don't think it has to be now. But it'd be expensive because it's multiple load operations. So again we'll cover this. I won't spend too much time on the SIMD stuff. We'll cover this in the next class. Making SIMD work requires you to sort of like problem solve and figure out well what's the SIMD primitives I can use, how to get my data into the right form so that I can just load it in without having to do like a for loop to go cover lots of bunch of stuff together. And there's other cool tricks we'll talk about of like you can use like lookup maps to say how to jump to the right offset the fine thing you need. There's a bunch of cool tricks you can do to make SIMD work efficiently. But you're right. If you do the slowest thing like do random access to pack in your register then that might just negate the benefit you get. Okay. So this is the multi-way merging. So this is level three. This is a single thread. So here's all our sort of runs from level two. And then basically we're just going to keep doing merges to produce larger and larger sort of runs. And then we start writing them out to these cues. And as I said, there's a single thread we have and it has this extra bookkeeping infrastructure in place so that it knows what the size of each cue and it knows whether there's work to be done at these merge operators. So in this case here when we first start there's nothing to be done at this merge because our cues aren't full. And then likewise there's nothing done over here. But then let's say as we do our merge we start feeding into these cues and at some point we have enough data into our cues where we can go ahead and actually now start merging things. So the thread is to say, aha, there is actually a task for me to do here. So it could have been doing a merge down here and then all of a sudden it says I now work over there and it jumps over here and does the merge and produces more output. So this seems bad, right? Given that everything I talked about before about maintaining, you know, you want to maximize your use of CPU cache. But then the idea here is that you have to be paying a penalty of having cache misses but essentially having a context switch jumping from some task down here to some task up here. But the idea here is that rather than just having the system sort of take as much data as you can and sort it and then produce the output and then go get more data and produce the output, right? That's essentially having this ping pong effect of like pulling data in from memory into your caches, processing it, and then pulling more memory. The idea is that you could recognize that all right, well maybe I can start pulling the data over here, bring to my CPU cache while I'm crunching something over here and then when this is ready then I can jump over here and then process that. Again, the idea is that we just sort of balance things out and make the access patterns more uniform and then we end up getting better performance although we may have more cache misses. We'll have less cache delays. Yes? Is this a picture of what a single core is doing? This is what a single core is doing, yes. So how could it make progress? If you make progress in the sense that the memory system could be pulling in data while computing doing comparisons around? Yes, so this question is how could we be making progress on this? Yeah, because we can have the system know that all right, I'm maybe doing something down here but then I want to go up here next so I'll start bringing things to my CPU caches, right? I just need to bring the first pieces, right? So then when I jump here, it's ready for me and then I can get the rest of the stuff to start bringing in the CPU caches from the hardware pre-fetching or software pre-fetching. So in the end we'll see this when we see the experiments. We're going to end up using less cycles although we could have more cache misses. Okay, so then we get to the merge phase and as I said in the beginning there's not really any magic to make the merge phase go fast. Depending on how you organize the data in the sorting phase you may or may not have to read memory from a remote Newman region and that will affect performance as well. So again, for the merge we're just going to iterate over the outer table and the inner table and do comparisons of each element based on their join key. We never have to backtrack on the outer table we may have to backtrack on the inner table if there's duplicates, right? So as we'll see in a second we can do this merging process entirely in parallel if there's no synchronization between the output buffers of each thread. But of course then we said that if you need to produce a coalesce result to the next operator in your query plan somebody's going to have to go through and combine all these things together. So it's like you either have everyone write to a shared buffer and use latches or compare and swap to protect each other or you just let them write to their private buffers but then you get to take another path and combine them again. Yes. Sorry, before a question about the sorting thing for lunch. This one. Yeah. So every core will have an output of a chemical list. Yes. And then after all cores have done their work then you have to combine them and go over it. Depending on how you're actually... No, we'll see in a second. This can be all... This could be a globally sorted list of all runs within a single core. And depending on how you... We'll see... Let me go through the actual algorithms. You may shuffle things around so that, yes, they're sorting locally but then all cores are globally ordered. We'll see an example. All right. So I want to talk about three different sort merge join algorithms. So the first two are from the paper you guys read from ETH, right? The multi-way sort merge and the multi-pass sort merge. And then the last one here is from the hyper guys, the massively parallel sort merge. But again, they claimed was in 2012 was the best way to do join algorithms but then they abandoned it. And then the ETH paper will show how it gets crushed. All right. So the multi-way sort merge is basically everything I've talked about so far, right? You're going to use the cash conscious sorting with the three different levels and then you'll have a merge phase at the end that just combines everything. So what's going to happen is on the outer table, again, you use the... Each core is going to sort the data in parallel for level one and two and then you're going to redistribute your data in level three to spread things around so that you end up with a globally sorted list. And then on the inner table you're essentially going to do the same thing. So then what will happen is when you actually do the merge at each core all you need to do is just look at the data that's local to your core and you don't need to look at anybody else. So visually it looks like this. So say this is our outer table, right? So when we start off we're completely unsorted. So we're going to do local numeral partitioning which is just the morsel stuff where we don't... We're not looking at the data. We just know things are divided up in blocks or morsels and every core knows that it's going to be responsible for the data that's local to it. So in the first step we're going to do our level one, level two sorting and then we end up with at each numeral region or each core we have our locally sorted data and then now we do the multi-way merge where every single core is going to write out the data within a particular range to one core. So in this case here for some range of data everybody's going to say, alright I'm going to write my data to this buffer here. And then you can do the multi-way merge sort to make sure that these are now locally sorted for all the data that you've combined here and you'll do this for every single range written out to every partition. So now what you have is you have a globally sorted list or globally sorted keys for the entire dataset and then now on the inner table side you can do exactly all the same steps but I'm not going to show them because we'll run out of space. So I'll just say this sort box here represents all this part here that we did. So it's going to end up also too with the same globally sorted list that's divided up into partitions per core. So now what needs to happen is all we have to do is a local merge join where just at each core it's only going to compare the data that is at its core. So your two iterators are going to scan down and it's only going to do a comparison across horizontally at a single core. It doesn't need to look at anybody else. Yes. Basically like a range partition. Yeah, exactly right. Yeah, this is range partition. Yes. And then how do you ensure that it's balanced? Yeah, so his question is back here before we got here I have these locally sorted lists. There's an extra step where you need to go through and say well what does the data actually look like so that you want to divide it up evenly so that the number of tuples you have per core is the same. Now maybe in the worst case scenario you could have a billion tuples that all have the same join key and then you're essentially screwed and in that case just throw your hands up and just let it go anywhere, right? Okay, so the next one is from the ETH paper again is the multi-pass sort merge and this is just the same thing I showed you in the multi-way sort merge except that you're not going to do the level three read distribution before you go into level three so you'll do the same level one and two sorting and then you just do what's called naive merge on the sort of runs where you don't really care where the data is actually being stored and you do the same thing on the outer table so the idea now when you do the merge is that you have to check the you basically know that I have a chunk of data on my inner table and I know where that chunk of data is in my outer table I know where to go find it but I don't try to localize my access so it's essentially think of like doing the same thing we did here except that you don't have this extra phase here where you're redistributing so when this guy wants to do a lookup to find a matching tuple it just has to go scan any possible core so I probably shouldn't make a visualization for that but that's fine so then we get to the last one from the hyper guys so what they're going to do on the outer table they're going to do range partitioning the same way we did for the first joint sort merge algorithm and they're going to redistribute all the data to the different cores and then at each core they're going to do a sort in parallel so that they have a local sort of partition but then on the inner table you don't do this at all you just sort your local data and then you know how to again jump to the different locations and do your scans or do your lookups for all the sort of runs on the outer table here so here's the visualization of this so again we just take our morsels or blocks of data and we're going to move all the data around to do our partitioning so we're not doing any sorting yet we're just doing sort of the radix partitioning the hash partitioning to divide things up and then within these we do sorting so again this is not globally sorted it's just sorted across within that single numeral region and then on the inner table side we don't do that repartitioning at all we just again do our local sorting on the data that we have so now when we want to do our join we have to do this cross partitioning join because we don't know whether the match for the tuple in the inner table can be found at which of these guys here so you essentially are doing sort of as you do a sequential scan on the inner table you have to possibly do a sequential scan for the entire block in the outer table and you do this for all of them one by one so the same thing for all the other ones here so again you're doing all these sequential scans on these outer relations but they argue that because it's sequential access that the hardware can recognize that it's sequential access and the hardware prefetcher can kick in and start pulling the data you need from the remote core to your local core so that hides all the latency you have from the non-local NUMA memory access does everyone know what hardware prefetching is or no? so hardware prefetching is on the CPU if it recognizes that and you're reading a bunch of contiguous memory regions then it assumes you're going to keep going and starts pulling the things you haven't read yet into like your L3 cache we'll see this there's a technique also called software prefetching where it's essentially the same thing and you can provide hints to the CPU and say I'm going to read this data so go ahead and read it from you ahead of time and we'll see how we do this in Peloton but this one they're explicitly doing hardware prefetching so hyper in their paper they have some rules that they say are necessary in order to have good performance in a parallel sort merge drone algorithm so the first thing you're going to say is that you should never have any random rights to a non-local memory so in their case they're doing sequential reads to a local memory but all the rights meaning when you actually sort the data those are always being done locally so in this case here except for this first phase here this will be random rights to non-local memory but everything else that comes after that is all going to be localized these are just doing sequential reads these are doing sequential reads not rights so you pay a penalty at the beginning but you don't need to do it later on the next thing you're going to claim is that you're the only one to perform sequential reads anytime you have to read local data or non-local data and that's what they're doing in the merge phase and then again the same issue that we had for the hash join you never want to have to have any synchronization primitives that require one core to get blocked on another core everybody can always be running in parallel they may block because of a cache miss but they're not blocking and waiting to acquire a latch from another thread so go through the evaluation real quick so for this they're going to compare the multi-way the multi-pass and the massively parallel algorithm and then the ETH guys are also going to throw in radix partitioning the radix partitioning hash join from last class and see how that compares against these other approaches and they're going to run on a much more beefier machine that we had last time with a half a terabyte of DRAM so the first thing they want to evaluate is just how their SIMD sorting algorithm well we showed the three-level approach how that compares against the STL sorting algorithm you get in C++ so STL sort as at least last year they used a hybrid sort so they do quick sort at the very beginning and then when you have larger sorted runs then they switch over and do heap sort and they claim that gets the best performance in those cases so they're comparing here they're scaling up along the X axis the number of tuples that they're they're trying to sort and this is running on a single thread so what you see is that the SIMD sort for smaller sizes gets the best performance but overall it's about 2.5 3x faster than STL sort again what I like about this experiment is that this essentially this corroborates the previous paper from Intel that showed they got about a 3.5x improvement in their sorting algorithm when you use SIMD so this matches actually very nicely so now we can do a comparison of the joint algorithms well actually this is a comparison of the joint algorithms for doing sort merge but it's going to be broken down based on the different phases so we have the partitioning the sorting the sort merge and the merge join part and then we'll measure throughput separately so the way to understand this this is the merge phase of the sorting algorithm and then this is the merge phase of the join algorithm right so along the y-axis here we're going to show the number of cycles expended in order to produce a tuple in our output and then along the other y-axis it's just throughput so in this case here for cycles lower is better because you want to use less cycles and then throughput is higher because it means you're generating more output so the first thing to point out here is that they're essentially all paying the same penalty to do partitioning in a magic that makes one algorithm work better than another they all produce the it all takes the same amount of time but the the multi-way join algorithm actually does the best the hyper one actually does terrible and actually if you plot throughput here there's another way to observe it so this is producing more tuples output using less cycles than what the the hyper guys can do and the reason is because they're trying again to be conscious about how they're doing sorting in their caching being conscious of their caches as they do sorting and then when you actually do the well in the sorting phase here the the multi-way merge part ends up being this part here the multi-way merge part although you're again executing more instructions because you have your threads jumping around it ends up again doing less cycles because you have a nice balance of your hardware resources right and it makes actually the the join part actually is really fast because you're just comparing against data within within the single core all right so we compare again the multi-way join versus the hyper join in this experiment they're showing how these things are affected by by hyper threading so this is a synthetic workload they're trying to do the outer table has 6 billion 1.6 billion tuples the inner table has 128 million tuples and so what you want to see here is the x axis I'm showing in logarithmic scale so what you want to happen is you want the throughput to double as you double the number of threads and for up until maybe around here they're all achieving that but then when you get to eight threads then the performance actually in the case of the hyper example starts to fall off right so they're going from 54 million tuples per second to 90 million tuples a second even though they double the number of cores or threads in the case of the ETH multi-way sort merge they're basically doing double of course when hyper threading kicks in those aren't really real threads so now there's contention on the CPU caches and other things and so it's no surprise that the performance doesn't scale the same way so this is showing you that the the extra instructions we have to use to do the jumping around in level 3 for the multi-way sort merge it ends up being a benefit for us because we we have less stalls we're wasting less cycles to process things all right so the next experiment we're going to show is actually can throw in hash join into the mix and do a comparison so for this I'm just combining the build and probe phase into the same mix so the the only one that's really sort of worth looking at here is hash join is always much faster they didn't do a comparison when they don't have the partition phase the one that's interesting is here where the hash join is almost as slow as the sort merge join and this is because they're joining two really large tables right 1.6 billion tuples is a lot and you see that there's a huge penalty for paying in the hash join case of doing the partitioning so the main takeaway here is again just the hash join always does better but in this case here because partitioning is a overhead that's why we go much slower so what would be nice to know is what does this graph look like when you don't do partitioning and then the last one here is again we just want to show how we vary the size of the number tuples when we do our join what is the cutoff point where these things actually meet so again here the radix hash join when you have a smaller table it's much faster but then as you increase the number tuples you have to deal with it gets slower but in the case of the sort merge join the performance essentially is sort of roughly the same and this is because we know we're not going to be nicely in our CPU caches but we can have the most efficient way of actually handling that when you have really large tables so any questions about the sort merge at a high level yes like multi-pass summary join that spends more time in partitioning why yeah and you said it's because the first one is the multi-pass the partitioning time is always the same here for all of these sorry this was that why is the sort phase larger why is the sort phase larger what's larger than a multi-point this versus this I think they look roughly the same I had to check the number yeah it's the same but like multi-pass spends one less for all the sorting comparables right but then you pay that penalty when you want to merge everything together yeah look at the sort phase along and why take more time than multi-way why your question is why is this red bar bigger than this red bar why is it roughly the same yeah but why they sort less wrongs so let's go back here multi-pass only sort of level one level two while in multi-way like have more by the way it has one more sort than like multi-pass right I'm not filming your question sorry I think maybe you're going to merge the sort yeah we can take this online but like the sorting phase of the sort merge drawing algorithm merge sort as described here you could use quicksort or we're using merge sort okay we'll come up alright yeah any question or no okay I can't jump to a slide it shouldn't be the view okay so to finish up we're out of time the as I said sometimes you want to use sort merge join when you know the data needs to be sorted in the same way that you want to do your join I actually don't know how often that happens and I suspect it's probably not as common as maybe the textbook says and this is why most systems always do hash join but every single major database system as far as I can tell all usually support both like Teradata supports both of these Exadata supports both of these and it's up to the optimizer to figure out which of these algorithms that actually wants to use but in practice the optimizer is almost always probably going to pick a hash join here the if now you actually if you just want to sort the data you can use the big tonic sorting networks that we talked about before the merge networks like that first phase of the sort merge join algorithm is still applicable if you have an order by clause so just because again we may not actually want to use a sort merge join instead of a hash join if you have an order by clause you're going to probably want to use the the SIMD vectorized sorting algorithm that we talked about today okay alright so we're out of time but I'll see you next week on Monday actually no wait let's actually go through this now because Wednesday we have no class Monday you guys have to give a presentation and then your code user do on that Wednesday so let me go through this now actually so as I said everyone for project 3 has to send a pull request you can do it to the master branch but that'll kick off a bill on Travis for you automatically if you just want to do it between each other that's fine but you should set up Travis which is free to do so that it'll actually build your PR and see that it actually passes all the tests it also if you actually send it to the master branch for us we won't merge it you can tag it as do not merge you'll also get the coverage calculation for free as well so the coverage should never go like down should always be going staying the same or going up with some small percentage points which has to do sort of non-determinism how we do certain things and then what you need to do is that make sure you put the URL to your pull request on the Google spreadsheet so that I can see it and then the on GitHub the viewing group can just go add comments as part of the code review like there's actually an option in GitHub to say perform a code review and it has a nice bookkeeping for all of this so some quick general tips about how you should do this so these are tips that I found on the internet take them for whatever you want I don't always follow these but I think it's released to say that there's some guidelines you can think about as you do this so when you submit your pull request you don't want to say in your comment on GitHub here it is you should actually provide a summary that says at a high level what are the files you change what are the functions you change what do you want the other team to look at if you made a bunch of changes that's not ready to be reviewed don't have the other team wasted time looking at that focus their attention to say this is what we want feedback on and in general as you're doing your code review you want to make sure that you kind of only look at about 400 lines at a time and don't look at this for more than 60 minutes because otherwise your eyes start to bleed and you're not really absorbing anything and you're not going to have a really good feedback that you can provide to their team and I think it's also helpful to go into this with a checklist this side ahead of time about what you're actually going to be looking for so some general things obviously is the code does the code work you would know this if you could actually can build and run on on Travis or on your local machine is the code you're reviewing is it actually easy to understand you know make sure they're not duplicating code or copying what it replaces the standard software engineering things like we don't want any global variables we want to avoid single tins as much as possible even though I know we have a bunch of them in our own code we're going to get rid of them in the summer but try not to add any new ones you should not have any large chunks of common at code and then you don't want you don't want to have any printf statements and this is something that if you run this on Travis there will be a code review check to make sure you don't invoke printf or fprintf or standard c out right so everything should be using our built-in log debug methods is there documentation to make sure that the the code is actually written as the way that the comments said this is going to do it right you want to have java doc for all java doc style comments for all your functions and again we can provide you with we have a write up somewhere that says what these should look like then if you have anything that's super bizarre like it's a tricky part of the code it should always be clearly defined what the hell is actually going on if you have any third party libraries try to document where they are and what they're doing and why do we need them as far as I know nobody in the class except for some people doing the self-driving stuff need to bring in any third party libraries hopefully shouldn't be an issue and if you have code that you know is not finished should have proper to-do flags make sure that it's clear that the thing is not finished and what needs to happen to actually make it be finished everyone should be having test cases for all their code as I said the cover should always be going up it should never be going down but your test cases should actually be meaningful meaning like that oh it ran without crashing so that's my test case and I'm good right or in some previous years you've had students just print things to standard out then they look when they run it and say oh that's the output I expect and they think that they're done of course the problem with that is no one's going to do that when it runs on you know nightly tests so you need to have real test cases that actually check whether the output is correct not just that it printed something printed something to the terminal try to also avoid having hard-coded answers in your tests again maybe I'll send an email out and say here's some good test cases that we usually try to emulate there's some bad ones in there and you want to keep propagating bad code but try to avoid having things that are hard-coded so that if we go change logic in some part of the system that all of a sudden like it doesn't break your test cases and we don't understand why so one example of this would be someone hard-coded that the tuple they inserted would be at a certain offset in the block which depends on what was already inserted before and depends on how we're actually inserting things so when we modified how we actually store our trial groups then all the hard-coded tests failed so any questions about this? I'll send an email out with some additional guidelines and maybe send some examples about where to write the test cases and also say that you can write C++ test cases and you can also write Java test cases the Java test cases should be for high-level things because you're going through a SQL interface if you want to test low-level stuff it should be written in C++ this question is should you be doing a rebase before you do a pull request? absolutely yes I think there's instructions on the wiki if you can't figure it out then we can help you okay? no class on Wednesday if you haven't signed up yet I think everyone did but if you haven't signed up yet for a meeting time on Wednesday please do that and then you have to submit your first code review on Wednesday night doing all the steps that I said before okay? any questions? thank you bye