 All right, so I'm assuming everyone else is at the TensorFlow talk on campus, that's all right, we're here to talk about databases. Nobody needs TensorFlow, right? I'm kidding. All right, so today is sort of the first discussion or first lecture we'll have on vectorized execution. On Tuesday next week, I'll spend sort of some of the time to talk about another vectorized execution model or execution scheme that's in Apache QuickStep, but then I'll spend the other half of the lecture talking about the trade-offs of query compilation and vectorization together and how that sort of tied all in with how it relates to compressed data. And I'll talk a little bit about some of the research that we're doing to solve some of the problems we'll see in today's class. All right, so for today we'll sort of talk about sort of the background of vectorization and what does it mean to have vectorized operators or execution models inside of a database system. And then as we go along, we'll also talk about what the hardware looks like. That's going to be relevant to the evaluation we're going to do at the end or in the paper that you guys read. And then I'll talk about a self-set of the algorithms that were in the paper from the guys in Columbia that you guys read. All right, so I realize in this paper they basically had a ton of different algorithms. They all sort of tried to vectorize, and in the appendix there was code. Hopefully you guys didn't read the code, but as I'll say later on, I think of this paper as a really good instructional guide that if you ever need to go back and implement vectorized algorithms, you could use this as a reference. All right, so before we begin the discussion of what vectorized execution looks like in a database system, I want to make some sort of obvious observations to motivate why we're going to talk about what we're going to talk about today. So these are just getting these obvious things that everyone should know. The first one is everyone can figure out by now because you're working on project three. Building a database system is hard. It's hard to build and make it fast and also hard to build and make it correct. Another obvious observation is that if you go to Taco Bell, you're probably going to have stomach problems. So try to avoid Taco Bell. I hadn't been there in a while. The last time I went there, it was a rough ride. And the last obvious observation is that the new CPUs that Intel and other companies are putting out, they're simply not getting faster. You can say it's sort of a demise of Moore's Law, but it's not like in the 90s and the 2000s where they would ratchet up the clock speed to get better performance. They're sort of hitting a limit of how fast you can make silicone or how much power you can rush through silicone without it melting. So we're not going to get, you know, in ten years' time, we're not going to have magically higher clock speeds and things that just be faster. So we have to look at other ways to improve the performance of our database system. And so there's essentially two types of CPU architectures that are available for us today. The first is what we'll call multi-core CPUs, and these are the ones you're most probably most familiar with. These are like the Xeons you get when you buy a laptop or you get a server machine. And these CPUs are going to be comprised of a small number of high-powered cores. And by high power, I mean that they're running at a higher clock speed and they have a sort of complex instruction set that's going to allow you to do a bunch of different things all in that core. The other two key characteristics that are going to matter for us is that they're going to be massively superscalar and support aggressive out-of-order execution. And so what that means is that we're going to have this long instruction pipeline and the CPU is going to be executing them in sort of sequential order, but what it's going to do is that in between different clock cycles it's going to try to execute multiple instructions and that's sort of preemptively going to the pipeline and picking out things that it can execute. And then it has this sort of extra mechanisms to check dependencies to see whether that turned out to be the right order. So you're sort of allowed to execute things not in the order that the program actually wants you to execute them. You can sort of execute them in any arbitrary order. You just have to do extra checks to make sure that everything still cannot be correct. So what this is going to allow them going to do in these multi-core CPUs, the heavy weight ones, you're going to be able to process multiple instructions per clock cycle than you would otherwise be able to do if you were just going then, you know, one after another. So the other type of CPU architecture that's common today is what's called the many integrated cores in the mix. And these are sort of think of these sort of like a GPU, but also sort of like the Xeon Phi is another good example. So think of this as like a CPU where you have a larger number of smaller power, weaker cores, lower power cores than you had in like the regular Intel Xeon. So in the case of the Xeon Phi, it's going to have, I think the current version is like 72 cores. And each of those cores are going to be what are called Intel P54Cs. And these are essentially the same Pentium cores that Intel developed in the 1990s, right? But back then you had a single socket that had one of these. Now you can have one of these Xeon Phi's that'll have, you know, a couple dozen. And so the key difference about these cores and the last cores I just talked about is they're not super scalar and they only support in-order execution. And actually the Pentium when it came out in the 90s was actually the first, I think, Intel CPU that supported out-of-order execution. But when they put them on the Xeon Phi's they only support in-order execution. And so the way they're going to overcome the not having out-of-order execution is that they're going to extend the instruction set on these cores to have expanded SIMD capabilities that we're able to leverage in our algorithms beyond what the regular Xeons have. So you're going to have wider registers with more instructions, a lot of you more complex things you can do in a regular Xeon. So if you've never seen what these Xeon Phi's look like, this is sort of a picture from Intel's marketing. It essentially looks like a GPU, right? It doesn't have a graphics output, it's just meant for the new computation. The newer versions of the Xeon Phi's you can actually get them to sit on the socket. And so you can buy a machine that doesn't have any regular Xeons, it only has Xeon Phi's. And I really like the name Xeon Phi is kind of weird because I say Xeon versus Xeon Phi. They're completely different architectures, but they're using the same branding, which I think is stupid. But that's their problem. So the newer ones you can actually get, I think on the, sort of came out last year, you can actually get them to sit on the socket. I don't know whether you can mix regular Xeons with Xeon Phi's, but the idea here is like, in the case of Xeon Phi, any data you need to put, you know, crunch on, you have to push down through the PSA Express Bross to send it down and then send it back up to CPU to use it. GPUs have this sort of the same problem. And then the idea of having it on the same socket, it would be cache-coherent with the regular CPUs and then you don't have to do this marshaling back and forth. So this is going to come up later on when we look at the performance of these different algorithms. But the fact that it only supports in-order execution, but it's going to have wider CND registers, that's going to have different trade-offs when we looked at the vectorized algorithms. All right, so another high-level way to look at these is, again, how it's going to be organized with memory and caches. So this is the architecture that we saw before in the many core CPUs. You have a shared L3 cache and then you have four different cores on a single socket. Again, we could have this in the Nuber architecture, we could have a bunch of these and they would come all together over the interconnect bus. The mic architecture for the Xeon 5 actually uses a ring and every core doesn't have L3 cache, they only have L1, L2 and any time again you need to read data or communicate with these other cores, you send it along this ring bus. I think it goes in one direction. This is actually different also than the paper we read earlier on the 1000 core stuff. That had this 2D mesh because that was an architecture based on Tylera, but this is how Intel designs their Xeon 5s. So why is vectorization important? Why is this actually going to matter? We've already spent a lot of time in the semester talking about how to parallelize a bunch of different algorithms running on multiple cores. We talked about how to do parallel hash join, parallel assortment join, you can have concurrency troll where you can have multiple transactions running at the same time on different cores. We talked about how to do logging in parallel. We already know the techniques to parallelize across multiple cores, but now we throw in SIMD, now we're actually going to get an even larger performance boost. Say I have my machine has 32 cores, if my algorithm is perfectly parallelizable, then I'm going to get a 32X speedup when I run across all the cores. But now if each of those cores has a four wide SIMD registers that I can do processing and parallel for vectorized execution, but now the speedup I'm going to get is going to be multiplicative. So now I have a 32X speedup, and then for each of those cores, they'll have a 4X speedup. So for my single algorithm, if I vectorize it, I can get a 128X speedup, even though I only have 32 cores. And again, the wider registers you have, this number can go higher. So this is why vectorization is important. It's not just enough to look at how can we parallelize all the different cores to get the extra boosts we're going to want to need to support more complex queries and larger datasets. We want to have this multiplication factor in our system be able to leverage that. So the idea of vectorization in sort of a high level definition is that we want to take our program, we want to convert it from sort of a scalar implementation where we would normally operate our process data, one piece of item at a time and apply one operand or instruction at a time. And we want to combine them into a batch and apply a single instruction to process them all in parallel that improves a single result for our batch. And we saw this earlier when we talked about the way you do this is through SIMD. We showed an example before where we talked about SIMD, but now we can go a little more detail of what is actually going on here. So again, the SIMD stuff are these extra CPU instructions that are going to allow us to do these vectorized operations to process data, or multiple pieces of data in parallel. So SIMD is not an Intel specific term. This is sort of a general term in computer science. So all the major instruction sets for the different micro-architectures that are out there are all going to have their versions of SIMD and they just call them different things. So in the x86, there's the MMX that came out first and then there's the SSC stuff and then the later things is the advanced vector instructions or the AVX, advanced vector extensions. In the power architecture, they call their thing Altevec. And then I think in the newer arms, they have something called Neon. And they're essentially, again, all doing the same kind of things that you guys read in the paper. You're doing vectorized addition, multiplication, comparisons, gathers, gathers, a bit more complicated as we'll see as we go along. But at a high level, these things are all essentially doing the same thing. And in the case of the Intel stuff, the main difference is really how wide are the registers and what kind of data types they can support in these later additions of SIMD instructions. So this is the example that we saw before and we'll see this again when we talk about... This will be a recurring thing when we talk about how to actually enable SIMD. We want to take the vector x and y, add them together and then produce our result in the vector z. And the way we would implement this is through a simple loop. The naive implementation would just be iterating over a single value and taking the offsets of the two different vectors and adding them together to z. So if you're doing single instruction, single data item, again, it's just looking at... It's literally taking this thing and executing one at a time to produce a result. And you could turn on loop unrolling in your compiler. That only speeds up the avoiding habit of doing the jump to begin the loop and restart things. But you're still going to have to do the single instruction per data item. So now with SIMD, the idea is that we're going to combine a batch of these values into a single register. And again, in this case here, we're doing SSE with 128-bit registers. So that means that within one register, we can store 4 32-bit integers. And then we just apply them, the SIMD instruction, to add them together and then it produces our single output that is then stored in another register. But now, depending on what algorithm we're doing, we can either just keep reusing this register and apply it to some other operation, or we can extract it out and load it into the program's memory, into L1 cache and then process it any way we want. And actually, in the later versions of SSE, one of the big advantages that you get versus other SIMD instructions is that you can actually have this thing avoid the cache and go directly to memory. Like say I'm doing a join and I'm scanning through and I'm applying the SIMD operation. If I know I'm not going to go read it again and I want to take it out of my register, I can have the CPU stored directly into memory and avoid polluting my caches. And again, we do the rest for the other part of the vector. So the streaming SIMD extensions, again, this is sort of when these came out in the late 1990s, this is when these things actually became useful and when MMX came out, that was sort of the first version of SIMD instructions by Intel, they were not really wide registers. It was only 64-bit wide and there wasn't an easy way for you to access them using intrinsics or other compiler flags. You had to actually write your own assembly to take things out and registers and put them into memory. And they also prevented you from executing regular instructions, non-SIMD instructions, and the CPU when you ran the SIMD instructions because it would block everything. So again, so the way these things work is that you have this, in the case of if it's 128-bit, you can store 4 30-bit scalar values, whether floats or integers, and then perform a single operation to do whatever it is you need to do and then write it out to another register. So the types of instructions you can do with SSE, you can do data movement, so this is copying data in and out and we'll see this later when you do loads and stores. You can do the basic arithmetic operations, you can do logical operations and we'll need these in order to do our comparisons when we evaluate predicates in parallel. And then the things that are actually really relevant to us is to do shuffling and these other conversion things. So shuffling is going to allow us to take the contents of a SIMD register and write it to another register. And again, the idea here is we want to do as much processing as we can on the data we packed into our SIMD registers without having to go back into the CPU caches. So if we need to permute them or transform them to put them into a particular output and align them a certain way, we can have the CPU write from one register to the next without having to go back to the caches. And for the types of conversion, this is essentially transforming data from how it's being represented in your CPU caches. There's one representation for integers maybe in the CPU that will be different than how they are in SIMD and this allows you to seamlessly convert them to the right format. And the cache control I talked about before allows us to move data directly from the SIMD registers out to memory to avoid polluting our caches. And again, this is useful for things like joins. If we know we're scanning through our table and we're probing the hash table, if we find a match for our tuples in our SIMD registers, we don't want to put them in our CPU memory. We don't want to write them out to... Sorry, we don't want to put them in our caches. We want to write them out to memory because we know that as you complete the scan, we're never going to go back and look at those things. It's only when the next operator starts we may want to go bring those things back to our CPU caches. So the cache control is going to allow us to do that. So this is a table that was... it was a presentation from James Rendir, who was an Intel director, one of the lead guys working on SIMD at Intel. And this just shows you, in the case of the Intel architecture, the different additions are variations of SIMD or how it's improved. So MMX came out in 1997, and then the SSE stuff came out later with wider registers. And as you can see, since the very beginning, they always supported integers, but only the later ones, they have single precision and double precision floating point numbers. Then at the bottom here, you see that AVX is the latest version of this of SIMD instructions, and that's where you have 200 of these 6-bit registers. And then the AVX 512 is when you go to 512 bits, and we talked about before, in the case of sort merge, we need this in order to do parallel sort merge because we need to be able to sort 128-bit values because we have to pack in the tuple offset and the actual sort key. But in the current... this is currently not available in the regular Xeon CPU. Xeon 5 has this. Supposedly it's supposed to come out in the next Skylake Xeon from Intel later this year. I don't know whether that's true or not. I can't get a clear answer from anything I've read online. But if this is actually true, then now we can actually start doing the parallel sort merge using SIMD, which I think would be kind of cool. So the other thing to point out is like, you clearly see that the size as you go down, the width of the registers, it's unlikely I think that they're going to go to 1024-bit registers, SIMD registers, because 512 is the current cache line size. So when you do that, then you have to change other aspects of the system and that may cause problems. So I suspect that Intel will be able to put out the 512 AVX instruction set in Xeon pretty soon, but then whether they go to 1024 is not clear. Yes? Has the SIMD become wider? There isn't also steepest cycles wasted in loading data into the SIMD registers? How many cycles will be wasted in that? So the question is, as the registers get wider, do you waste more cycles copying data into the registers? Yeah, as the SIMD registers, why would they get slower? Why would that be more cycles? I mean, they would be spending some on a cycle of loading the data into the register. Yes? How much overhead is loading data in? I think that's independent of the size of the registers. It depends on how many things you want to store in it. We'll actually talk about this in a second. In the paper they talk about doing the vectorized loading store and actually out you can't do that truly in parallel because the CPU only allows one or two accesses to L1 per right? So yes. If you want to store, say the 512 register with a lot of lanes because maybe you're doing SIMD on like 16-bit integers, then yes. It'll take more time to store things in because it's the way the L1 works. In the case of like if we're going to do 128-bit lanes, then you pay four cycles. So it depends on how many lanes you want to use for these things. I don't think it depends on the size of the value. Yes? 512 is the cache line size. Yes. If it's multiple-gathered, scattered loading store, it will actually be the same speed as the normal loading store. Right, so the statement is, I said this before, 512 is the cache line size. So you're saying if it's larger than this, then it's just the same as doing a regular load and store in a memory. Is that what you're saying? Sorry. No, I'm telling you that if it's from the same cache line, it's not a scattered gather that's normally needed. Yes, it is. So he said, if this is the same cache line, then if all the values you need to pack into your register are in the same cache line, right, there's contiguous, then it's just one load to go suck that cache line and put it into this. He's right. Yes. But if it's a scattered gather where you're reading different parts of memory, then you're going to pay whatever, a cycle per load assuming it's an L1. Okay, so this sounds awesome, right? SIMD is, you know, clearly this is something that we're going to want to be using in our database system. It is this is sort of the, you know, this is one aspect of how Intel is going to get better performance other than CPUs. In addition to adding more cores, they're going to add better SIMD instructions. So now we've got to figure out how do we actually enable this. We basically have three choices to do this. So the first is that we don't do anything in our code, right? And we're the database developers. We get paid a lot of money we're working on our system and we just don't, we don't bother changing any of our algorithms. And we just hope we enable some flags in the compiler, you know, in GCC or the Intel compiler, ICC we just hope that the compiler is going to be smart enough to find opportunities in our code to be able to vectorize them, right? So the problem with this is that it's only going to work for really simple loops and you have to have your loop constructed the correct way for this to work. And in the context of the database system these simple loops are going to be really rare because we're going to be doing, you know, we're doing complex things. Now this is part of the reason why I showed you the vector-wise sort of primitives at the end of last class because they create these really simple loops that are just doing really simple operations and then they rely on the compiler to be able to find these loops and vectorize them for the system. But in general, in our own system, if you have more complex code base, then the compiler is not going to be able to find these things. And then if you're also doing query compilation with the LLVM the LLVM is not going to do as aggressive optimizations as the GCC or ICC will. You sort of have to program these things yourself. And obviously having support having the compiler be able to turn on these vectorization requires you to have hardware support because otherwise it's just going to keep everything and be always together. But nowadays, as I said, most CPUs will have this. So let's look at an example that I had before and talk about why automatic vectorization may not work. So this is that same loop that I had where I was taking X and Y and writing them to Z. So I'm going to take a guess why or why not this cannot be automatically vectorized. Who here has taken a compiler's course? Okay. So let's start with this simple question. Can this be automatically vectorized? Yes or no? Raise your hand if you say yes. A few. Raise your hand if you say no. Nobody. Okay. The answer is no. This is actually not legal because the compiler doesn't know what these things are actually pointing to. Because they're passed in at runtime. So A may actually be pointing to the same address or the same location and memory. You could have the X and Z point to the same memory location. So that means that when you write into Z, you're also writing into X. So let's say at iteration 0, you write into Z. That overwrites the value in iteration 1 for X. So if you vectorize that, you would actually end up with the incorrect result. So the way this code is written, and this is the byproduct of just C or C++, the program is telling the compiler that this thing has to be done sequentially. Because there's these dependencies potentially inside of these loops where the output of one iteration may affect the input to the next iteration. So the compiler would look at it and say, I can't do anything because I don't know whether you truly meant for me to allow you to overwrite Z and that would, when you write into Z, that would overwrite something in X. It doesn't know whether that's actually what you want or not. So it always plays it safe. The compiler's job is to end up with correct code. So it's going to avoid this optimization. So the way we can get around this is to provide hints to the compiler. So we can basically tell it in our code and say, hey, look, there's I know what's actually going on here and I know what data I'm going to be always passing into you. So it's okay for you to go ahead and vectorize this. It may not always do this but in general it usually does. And so there's two ways we can do this. There's one way we can give it information about what we know about the memory locations we're passing into that function or we can just tell it, ignore the loop dependency that you normally would enforce loop dependency checking and just go ahead and vectorize it. So the first way to do this in the first example is that we can use the restrict keyword in C and C++. I don't think I actually think this is in the standard in C++ but I compiled this and it works. I think it's like C++99. It's good enough. And the idea here with the restrict keyword is basically you're telling the compiler that these memory locations at X, Y and Z, they're guaranteed to be distinct or not overlap so it's okay to go ahead and vectorize this. And in Intel's compiler it'll definitely do this and I think in GCC if you pass in the vectorized flag it'll be able to handle that one too. The other approach is to use these pragmas to tell it, the compiler that for the loop that's right below me ignore that dependency checking that I know about. So the pragma, that means ignore the vector dependencies. And then the compiler said yes, I don't care that X, Y and Z are the same I'm going to go ahead and vectorize this exactly as you want. There's other directives you can use. There's like openmp directives I've seen. There's also like pragma, simd there's a bunch of other things that I think all do do the same thing. The key thing about this though is that it's up for you as the programmer to make sure that this actually ends up being correct. Because the compiler would do exactly what you wanted to do then you have to go back and decide whether is this speeding up your programming or are things actually getting and is the result still correct? In the case of database systems unlike machine learning, machine learning sometimes you can be loosey-goosey about the weights and numbers and things like that because it doesn't really affect the final outcome. Because in many cases in machine learning there's not always going to be a single correct answer because it's doing approximation things. But if you're computing like salary and totals or bank account numbers then you want these computations to be precise so therefore this may cause problems. So the last way to get vectorization is to put explicit commands or explicit instructions in your code to tell the CPU exactly what you wanted to do. And the idea here is that you want to say you don't even bother with these pragmas or directives or hints and you just write your code to be exactly to use the simd that you know you want to use. And this is essentially the approach that the Columbia guys use in the paper. If you look in the appendix they're using CPU intrinsics in all the different algorithms. So the only downside of this is one, it's hard to write. The syntax for these intrinsics are a bit archaic and you have to always look at the table to see what they actually mean because they're abbreviated. And then the other downside is that it's not portable. And I'll show what that means in a second but if you say you run on like the 128 bit SSE instructions and you want to go to AVX 512, the intrinsics that you're going to use now all have to be changed. So this is again the same loop that we had before. And now what I'm doing is now in the first part I'm going to store the vectors that are being passed in into these 128 bit SSE registers and then in my loop I'm now going to invoke the vectorized addition of 32 bit integers for my two vectors here and then write it out to the output vector. Again this is me telling the compiler exactly what instructions I want to do because what's going to happen is the compiler will see these intrinsics and convert them into the inline assembly that forms the exact instructions that it needs. And again the example that I said about how this may not be portable is say again I'm using 128 bit registers say I go to AVX 256 or 512 I would have to change these numbers here. I think there may be libraries that can sort of extract this away from you but I haven't looked enough to see whether there is something there. So any questions about this? There's three approaches you can sort of hope things are going to figure it out for you you can provide hints to the compiler to try to vectorize certain parts or you can do explicit instructions to do exactly what you want to do. Alright so now we can talk about how do we use these assuming for our purposes we always assume we're using intrinsics that we know exactly what our system is going to be doing. Now we're going to talk about how do we actually use these to do vectorized execution in our database system. The way this is going to work is in the paper they sort of start off with some basic primitives or fundamentals of doing vectorized execution like how to do predicate evaluation and deal with compression and then sort of build them up to do more complex things like sorting and merging and then from then we can then sort of combine all these things together to do more the more complex operations we want to do in our rational algebra to do things like tree joins or the bucketized hash tables. So these are like sort of the simple things and we sort of build on them to do the more complex things. So again so this is why I had you guys read this paper because again it starts in the beginning and then builds up to do the kind of things we've been talking about all semester in our relational operators. So they're first going to provide the fundamentals of vectorized operations and then build up from them to do more complex things and there's two sort of edicts or guiding principles that they want to have in their algorithm, their vectorized algorithms. So the first is that they're going to favor vertical vectorization over horizontal vectorization by allowing us to process different data items per lane. Again think of a lane as like in the Symbi register say it's a 128-bit register and I can store 4 30-bit integers in it, each of those integers would be in a lane. So I can have 4 30-bit lanes in a 128-bit register. So the idea here is that we want to get by doing vertical vectorization we're always going to be processing new data in all our lanes rather than computing things that we already know the answer to. And this is allowed also to have full lane utilization by executing possibly different data items in our input sequence at our different lanes instead of going directly in sequential order that the data was passed into our operator. And again I'll cover this more when we talk about doing the hash table probing and this will become more obvious. Alright so first I want to start off with the basic fundamental operations you can have that can be vectorized. So select a load and store there should be a T there and then gather and scatter and then we'll build up from there and show how to do vectorized scans and vectorized histograms and vectorized joins. Alright so for selective load the basic idea here is that we want to take some contents of memory and just write out to particular lanes in our input register. So the way we're going to be doing this is that we're going from to the top to the bottom so this is our target we want to write data into this SIMD register this vector and the data is going to come from some location in memory. So what we're going to have is that we're going to have a mask where we have zeros and ones that correspond to each lane in our SIMD register and if it's a zero it means that we don't want to write anything into it if it's a one then we do want to write something into this and what's going to happen is we're going to go in sequential order along our memory and start writing them out to the register. So in the first step here at the first lane the value is zero so we don't write anything to the first lane up here. But then for this one the value is one so we do want to overwrite something here so we'll go to the first memory location at our starting address and we'll just write it up in there. And the next one is zero so obviously we skip that the last one is one so yes then at the second position we write its contents in there. Selective store does the reverse so the idea here is that the top part is our target so we're going to take some contents in our semi-register in our vector and we won't want to write it out to memory. And again we'll have another mask that's going to tell us for what lanes should be writing out data to memory. So again same thing if it's zero we skip it if it's one then the first lane will get into the first position in memory. Skip this zero and then this last one is a one so the last lane in the register will get written to the second position in memory. So now we can build up from these and do more complicated things but as I said depending on where the memory is actually stored you can only do one two axes per cycle. So this may not be done automatically in parallel. Now for Selective Scatter and Gather this is slightly different than Scatter Gather you would see in a distributed database system that term usually that term means when you have a query and it has to touch data across multiple nodes you break it up to smaller tasks and you scatter the queries those tasks to the different nodes they process the data that they have locally and then they bring it back to one single node and gather the results together. So in this case here we're not doing queries Scatter Gather we're moving data items back and forth between memory and our semi-registers. So now what we're going to have here is a piece of memory or as our SIMD register is our target where we're going to write into and then we're also going to have another SIMD register that's going to be the index vector that's going to tell us and for each lane that they correspond to what location and memory do we want to write them into. So in this case here for the first position at lane zero it wants to write the data that exists at memory location two so zero one two would be this one here and we're going to add that up into the first lane. For the second lane it's position one so zero one and it's just straight right up there and so forth for five and three. Alright and then the inverse of this is Selective Scatter where we again we still have the index vector that corresponds to our lanes and the value vector and then the idea here is it's going to tell us at what position should we write the contents of this lane. In this case here the first one is the first lane as index vector two so it would go zero one two and right there this one is at lane two it's position one so zero one that lane is written there and so forth. Alright so this looks awesome right this looks like we can do complex things in parallel using the SIMD instructions. Well it turns out it's not actually truly atomic and completely parallel because as I said you can only have one or two accesses to reader rights or loading stores to L1 cache per cycle. So the other thing is that not all CPUs are going to support the vectorized scatter and gather. So in the case of the gather it's only in the newer hauswells in AVX2 so if you get an older CPU like the Nehalem they're not going to have this vectorized instruction. And then the scatter is not in any current Xeon CPU it's only in the Xeon 5. Furthermore the selected loads and stores are also not fully vectorized. In the case of the scatter they're just going to do it sequentially using assembly right so when you have the intrinsic the compiler says I don't have this vectorized instruction here's the assembly to do the exact same step but it runs in serial order. In the case of the selected loads and stores it's not as bad as running it fully in serial order they use some vectorized permutations to give you the answer you're looking for but it's not going to be the single instruction that you had that I'm showing here you would expect. So selected loads and stores may be slightly a little slower than the truly vectorized version but in the case of the scatter it's not going to have a vectorized version at all for the Xeons the Xeon 5 will have all these things be vectorized. Alright so now given we have selected stores selected loads and then the scatter and gather we can now look at how to do the sort of more complicated things we expect to have in our database system right so we'll talk about how to selection scans hash tables and partitioning in the paper they also talk about doing joins sorting and bloom filters we sort of covered the sorting stuff before when we talked about the sort merge with bloom filters it's kind of obvious right you do a bunch of hashes on your input key and you just do a bunch of vectorized lookups into your bloom filter to see whether the bit matches for joins what I'll say is actually the I'll show the slides at the end actually I should just say this in the beginning the spoiler is a lot of this stuff doesn't actually work it's not that's because they it's not just because in this paper they are looking at 32 bit keys and 32 bit tuple IDs it's also because they assume everything is going to fit in your cpu caches and we'll see this again when we look at the join numbers the even if you had AVX 512 because you start exceeding cpu caches you're always going to be bottlenecked on stalled to memory and so vectorizing it doesn't actually help I'll show you, I'll talk about a technique on Tuesday next week for our own research here where we can actually rely on software prefetching to mask some of the latency of memory stalls and actually do some of these vectorize vectorize operations so again it's sort of like when we talked about the sort merge I said yes, it's an interesting approach but you actually can't use it I still think it's interesting to look at because whether or not you go and build a data system in the future you may come across something where you say oh I know how to do this from the class and I don't have to worry about 128 bit 128 bit key value pairs so I can use the SIMD so as we're discussing here but it's again the upfront this stuff doesn't actually always work and they're going to make a huge assumption about the cache sizes or the dataset that always has some caches and the size of the things you're trying to sort or do vectorize operations on are small enough it is what it is so for the first example I'll show you how to do SIMD selection scans so I would say this actually this will work in existing systems and whenever you see vectorize execution this is probably the most common they'll do SIMD for selection scans I think Apollo does this vectorize gets it because they get automatic from the compiler and I think Hanon does this as well some memory systems can do this alright so for this our sample query is a simple select star on this table where we have a predicate where the key is greater than equals some low value and less than equals some high value so we're doing some range scan on a subset of the entire database so what I want to first do is show you how to do this in sort of a scalar manner and doing this sequentially and I want to show the two different ways you can do this one with branching one without branching and then you'll be able to recognize in the non vector you'll be able to see from the non vectorize implementation without branching of how to then do this in a vectorize world because you can't do the branching one with vectorization but you can do non branching so if you were like building a data system from scratch for the very first time this is basically how you would implement a sequential scan this is currently how our own system actually works the basic idea is you have a forlip over some table you're going to iterate and get every single tuple you extract the key for that tuple and then you do your predicate comparison where key is greater than equal to the low value and less than equal to the high value and if that predicate matches the true then this if clause you'll branch down in here and then copy the tuple to your output buffer and increase the output buffer by one so given what I said earlier about out of order execution what's the problem with this approach I realize you're reading source code on a slide but again you kind of have to do this so what would happen here say if my query is the predicate is 50% selectivity that means for a billion tuples half of them will evaluate the true and half of them will evaluate the false what will happen in the actual CPU was that mispredicted exactly so the hardware wants to do out of order execution it's going to see this if branch now this is probably going to evaluate the true so let me go ahead and do these instructions down here let me actually copy the data but then if you get it wrong it has to undo that and jump back over and flush the instruction pipeline to come back and do whatever it is it should be doing so if you have the two extremes either if the selectivity is 0% meaning no tuples match or if the selectivity is 100% where every tuple matches then the branch predictor will be always correct it's the one where you have in between there then it's a coin toss whether it's going to predict this right or not so the key problem we have here is this if clause because depending on the query, depending on the data we may or may not be going down this branch and therefore the CPU is going to have a hard time predicting this so this is the thing we want to avoid this if clause so we can actually rewrite this sequential scan without any if clause at all doing it this way so now on my for loop I'm going to go still grab every single tuple but the very first thing I'm going to do is actually copy the tuple to my output buffer I haven't even looked to see whether the key matches I'm saying like immediately just go in the output buffer then I extract the key and then now what I have here I have this sort of modification flag or modification integer where I'll evaluate the predicate key greater than or equal to low and key less than or equal to high and if either of these ones evaluate to true then the modification flag will get set to one and these are ternary operators that the CPU will execute with a few number of instructions and then depending on whether the flag is 0 or 1 I will then increase my index or offset into my output buffer if it's 0 then I know when I loop back around I do my next copy I'm just going to overwrite the last tuple because the last tuple shouldn't be there if it's 1 then when I come back around I'll have moved over and I'll retain the tuple that I put in there in the first time right so this seems kind of crazy right like and then we check afterwards to see whether it actually should be there or not as humans as we write code this seems counterintuitive but because again we're an in-memory database we're not having to go fetch things from disk so therefore the penalty of having branch mission prediction here is actually going to be greater than having to do this redundant copying and another way to think about this is that the CPU is always going to be executing the same set of instructions for every iteration in the loop because it's no conditionals so there is no branches to predict so therefore it's always going to just blast it very quickly so this is actually so to see how this affects performance for this part here there's this graph in one of the optional papers from vector-wise where they vary along the x-axis with the selectivity of a query so 0% means that none of the tuples will match 100% means every single tuple will match and they're measuring the number of CPU cycles you're spending to process these tuples for these two different variations so the red line going across horizontally that's no branching approach and again you see that the performance is always the same as I said because you're always executing the same instructions no matter what the query looks like or what the data looks like here is the branching case and so what you see is that when you have very low selectivity the branching case is better because you're not paying that penalty to copy things you don't need but then as you increase the selectivity it goes above and actually becomes worse because you're paying this penalty for branch misprediction and eventually at some point the CPU gets better at predicting things and it's assumed that most of these things will actually get selected and therefore it starts converging to the no branching case yes so couldn't you get a more direct comparison if you re-grow the branching case to just do the copy anyway and then the branch would only be the equivalent so your question is the statement is as you're just saying wouldn't a more direct comparison be take this copy put it here and then just have this if clause be just do this the compiler might be smart enough to be ready to do that right because you can look and say the only thing you're doing is have this if branch is 1 plus i plus 1 so it could be smart enough to rewrite that so I'll say though is we'll see this again when we look at the Columbia results they have one graph that doesn't actually follow this pattern here which I don't fully understand I've seen this the same result in other experiments or other systems but for whatever reason the Columbia guys don't exhibit this that part I don't understand of the results so now we can look and see how we can do the vectorized version of the branchless sequential scan so now what you see is that when I iterate over my table I'm using this v with a subscript to represent a vectorized version of that variable so I'm getting a vector of tuples and then I can do a vectorized load of putting the key into my registers to my cd register and that's again writing it to the output buffer then I'll do my comparison and what will happen here is I'll generate a mask that we saw before to do the selective loading store and then if at least one of these values in the mask is set to 1 then I know that I want to do I want to store that data into my output buffer this loads the vector this sets the mask up and then I want to if at least one of them is set to true then I'll store my stuff into my output buffer and increase that so you still have this if branch here I think it can be rewritten without it but it's not as bad as in the scalar approach because I can do this for multiple tuples so to walk it through it say you have a simple query like this where it's select start from table and the key is greater than o and less than equal to u and this is our input sequence we would start off by taking this all of these values immediately write it into our key vector and this is another example where DSM makes a big difference because if this is stored as a column store I can just jump to this starting point and just write all that out to my to my register in a single load operation right where if it's NSM then I may be jumping to different locations and they may not all fit in a single cache line so then I'll do my simd compare and that generates my mask and then this is then will be used with a a vector of offsets 0 through 5 over here and then I'll do my simd store to take my mask and write those offsets into the output vector so here I'm not actually copying the values or the keys that I did my evaluation on or actually the IDs of the tuples these are these correspondence offsets in the column here right so one would be this position here and three would be there so now we can look at the we can compare all these different approaches in the paper and here I have two graphs this one is going to be here for a Xeon 5 it's going to have 61 cores plus four hyper threads per core and then this one is a Xeon I think it's a Haswell it has four cores it's a single socket CPU with two hyper threads per core and again along the X axis I'm going to vary the selectivity and they're going to measure this in the throughput in terms of billions of tuples per second this seems like a lot but remember they're not actually implementing this in a real database system these little toy system or toy implementations are just the algorithm so you're not worrying about moving things up the entire pipeline in your query plan it's just say what does it take to actually do this differential scan so that's why they're getting really high numbers so the first thing we see is that for the scalar branching case in the case of the Xeon 5 you don't really see a difference in performance as you change the selectivity because again it's doing in order execution there's no branch miss predictions the reason why it falls off is because you essentially run out of cash you're saturating the MAMI band they're trying to write things to your output buffers in the case of the in the the regular Xeon this is where I'm saying I don't fully understand why they don't exhibit the property of branch miss prediction being this big penalty because when you have 0% selectivity or even in this middle stuff here you'd expect it to get worse and dip down but it only goes down just a little bit I suspect so that I don't understand I would expect it to get lower and then maybe arc back up and then of course when it gets over here you just always saturate it on the MAMI bandwidth and that's why it sort of goes down on that trend and then you can compare this with the the branchless approach and here what you see this one makes sense to me because it's in order execution that means that it's always going to copy every single tuple so you're not going to get any benefit over the branching one so that's why this one's lower and again everyone saturates MAMI bandwidth over here and this one here again I would expect it that these lines would have crossed in the middle they're both hitting the MAMI bandwidth bottleneck over here and that's why they converge but I would expect these guys to flip in the middle as you increase this activity but you don't really see that here actually maybe you see that here at the 50% so maybe that makes sense so maybe this is the this is the first part of the graph where I showed you where the selectivity was still very low and then since it goes from 10 to 20 to 50 you don't really see the arc the crossing that you see here and of course at 100% selectivity they're all saturating MAMI bandwidth but then you have the vectorized versions and they're going to have one for early materialization where you do actually have to copy the tuple in your output buffer and late materialization where you only pass the offset and again the Xeon 5 makes sense because the vectorization with late materialization you just copy the offsets out you don't have to do anything else if you actually go back and copy the data in your output buffer then that slows you down I don't fully understand why these guys match up over here because I would expect late materialization to match with early materialization so maybe my number is just wrong but I don't think that should be right so now we can look at how to do vectorized hash table probing so again first I'll start off with the scalar version and then we'll see how to make this run in parallel so in the scalar version essentially what you do is you take your input key in your for loop it's iterating every single tuple you're going to hash it and it's going to produce some offset in your hash table and for this I'm using that open addressing linear probing hash table that we talked about before where instead of having these linked lists of buckets you're just going to point to a single location and if you find a match you're done if you don't find a match then you just do linear scan down the table to find the key that you're looking for so in this first look up key 1 doesn't equal key 9 so we know that we need to keep on scanning down and keep doing additional comparisons to either if we find an empty slot which means that our key isn't in here or we find a match for our key so again this is the scalar version what are we doing? we're doing a single tuple at a time a single key at a time and then we're doing a single comparison at a time as we scan through the table so the way to do this with vectorized or horizontal vectorization is that now what's going to happen in our hash table for every single position, every single slot we're going to store multiple keys and they have corresponding multiple payloads so in this case here we're going to have 4 keys per slot and then therefore we have to have 4 payloads so we're still going to now evaluate though a single key at a time we'll hash it just like before and then it's going to point to some position but now what we're going to do at that position we're going to get back a vector now and then we can use our simd compare operator to evaluate our input key with every single key in the vector and then this will produce a mask it'll determine whether we need to keep on scanning down to find the thing we're looking for or we find one that we know where to need and the idea is that the way that you would insert into this is that when you hash for the insert and you land at some space if all the positions are full you go to the next one but if one position is available then you write into there so as soon as your comparison finds one empty position you know you've completed your scan so this seems nice, this allows us to parallelize the comparison to take a single key and look at multiple values but the problem is that we are still processing one key at a time for our input sequence so with vertical vectorization the idea is that we're going to take a batch of tuples and do the comparison in parallel across our table so again we'll hash them all at once we'll then produce our hash vector and then that's going to point to these different locations so then we can use our simd gather to write them out whatever the values at these positions into our simd vector then we can do our simd comparison and then this will produce either a zero or one based on whether we have a match so if they're all ones then that's easy for us because then we just go back to our input sequence and jump down to the next four vectors but it's very likely that they're not all going to be ones so in this case here you see the first guy and the last guy are one but these middle guys are zero here so if I was being sort of a naive way to deal with this is that I'll just for these two keys here I'll just iterate down or increment in my hash table the address and just fill in new values for these keys here and just keep whatever the first guy and the last guy had and just do the same comparison again right because I don't want to increment them and actually produce correct result so if I know that these guys already matched then I avoid having to look at other matches but in that case we're getting bad lane utilization because these guys are essentially doing ways to work we already know they match, we don't need to keep comparing them so then furthermore if like say the second guy matches and now it's only this one we need to keep doing comparisons for now we're essentially just doing this in sequential order and it's even worse because we're loading things in and out of our simd registers it doesn't come for free so the way they're going to handle this to do a vertical vectorization is that they'll recognize that the first one and the last one match and therefore when they go back into the input sequence they'll just load in the next two values you need in your keys or your input sequence and just replace them in these slots you then have to hash them and find new locations but then you know that you don't need to rehash these keys, you just want to increment they're just by one so now then you can go back in the hash table do the same probe again and now all your lanes will be looking at new data right and so this is not you know this is not easy to do because there's some actual bookkeeping you need to have to say like alright well this position is already matched this other position hasn't but again you're going to get better with your lane utilization they also talk in the paper about how if you do this approach you end up making your algorithm unstable and that means that if I take the same query on the same database and I run it through this vectorized algorithm on one day it may produce results in one order, the next day it may produce results in another order that's not necessarily a bad thing and actually we don't even care in a database system because we're dealing with bag algebra right with relations we don't care about the order of tuples in our output if you cared about the order and you wanted to make sure the result was always stable then you should put us an order by tell us to do an order by and we'll do it so I don't think this makes a big difference in it does make a difference in correctness it may make it harder to debug because every single time you run this vectorized probe in your join it may do things in different orders but you know that's why they pay us a lot of money to work on these things we'll deal with that alright so we're running out of time but I'll go through this quickly so for this one they're going to compare the performance difference between the scalar approach, the vectorized horizontal probing and then the horizontal and vertical and along the x-axis they're going to increase the size of the hash table and you'll make a big difference you'll see in a second so in the scalar version what you see is that the the Xeon 5 there's not really a big performance drop as you decrease the actually take it back you're going to see the same drop on both of these architectures because at this point here at around 1 megabytes now the hash table is larger than your cache size the Xeon 5 cores don't have L3 so you're exceeding your L2 cache and then it's sort of flat lines in the case of the Xeon going beyond 4 megabytes is when you exceed your CPU caches and that's why it falls off here alright and then in the case of the vectorized stuff you don't see a major difference in the horizontal vectorization for the Xeon 5 I actually don't know the reason why that's the case I can think about it for a bit but you definitely see a huge performance improvement in the vertical one and then in the case of the Xeon they flip going from 64 kilobytes to 256 kilobytes where the horizontal one actually performs better than the vertical one but again they all essentially become the same once you exceed your CPU caches this is what I said before all these algorithms seem really sophisticated but with the CPU caches it doesn't actually matter so beyond the CPU caches right they're all the same okay so I'll show you how to build a histogram real quickly when you do this when you scatter and gather do increment things and then we'll be able to replicate the histogram to handle collisions let's say this is our input key vector and we want to use a SIMD radix extraction that we talked about before with the hash joins to generate a hash vector and then for each of these slots they would point to a position so each of these caches will point to a position in our histogram and all we just do is increment the count by one right but this actually doesn't work because the hash the second key and the last key actually point to the same location so if you then try to do a vectorize add on this where value goes in got in there last will end up being there you can't do this atomically with a compare and swap so putting one in here will override the override the other lane putting it in there so the way you can handle this is actually um to replicate or have multiple histograms one per lane and at each it's each lane in our hash index vector corresponds to one lane in the histogram so then this guy only writes to this lane the second guy only writes to this lane so they're always going to be in new locations and then then you just do a SIMD add to combine them together to produce the correct result so you have to do an extra add and have to have more registers to do this um but then you'll produce the correct result so this doesn't work if you have composite keys this only works if you have single scalar keys you want to build your histogram on if you want to have you want to build a histogram on 5 attributes um this doesn't work but for single scalars it does work alright so real quickly I'm not going to go through how to do joins in their comparison they're looking at so the no partitioning case and then min partitioning and max partitioning so this would be the min partitioning would be the where you partition the build side of the join you build one hash table per thread and then the max partitioning is sort of like the grace hash join where you're partitioning both the probe side and the join side or the probe side and the build side of the join operator alright and this would be both these would be fully vectorized and this one would only be partially vectorized and the main takeaway from this is that uh with full vectorization and partitioning both sides then you get the best performance over all these other ones if you try to do the grace hash join with just doing uh scalar operations then you're actually going to end up performing worse than everyone else right and again this is not surprising because you're doing more work to end up with the same result at these other ones here so um and again the if you look at again with the size of the keys and payloads it's 32 bit integers and keys for keys and payloads so like it's not this is not real you couldn't actually do this so um I'm only showing also the Xeon 5 numbers here I forget what happens in the regular one I think the vectorized one may not make a big difference alright so I sort of rushed through the end um but as sort of as I was going along uh I think this is the way that paper describes how to do vectorization I think is really interesting the aspects outside of database systems so I think it would be relevant to your careers you know once you leave CMU or go on to other things um but the main takeaway I want you to remember is beyond just the size of the keys and payloads this doesn't actually work unless you use uh everything doesn't fit in your CPU caches and as you can imagine in a real world system this is going to be very unlikely so none of these algorithms would actually work directly as they're written you have to do some extra stuff which I'll talk about on Tuesday to make this actually work with a with database or data sets that exceed your size or CPU caches the other aspect that's very important too is that all the other optimizations we talked about how to do all the other parallelization techniques we've been talking about for our algorithms they're all still applicable we can combine them together with the vectorization stuff that we talked about here so now again it's multiplicative so if we have for algorithm can run on 32 cores and we have 4x speed up effectorization we'll be able to get 120x speed up of our algorithms that's not always going to be and that's the theoretical of Brown's but obviously it's not always going to be the case because there's a you know going from one operator to the next there's a lot of marshaling and data back and forth and you're not everything's going to not everything's going to be able to be vectorized sometimes you've got to get to do sequential operations to put the data into the correct order you need right and also to the there's no reason a compile plan using all the stuff we talked about last class couldn't also use vectorized operations to the best my knowledge though our system is the only one that actually does this so hyper does compile plans and they don't have hyper does but the combination of interpretive plans and compile plans and they don't do any vectorization in the compile parts vector wise doesn't do explicit part vectorization right they rely on the compiler do that for them but everyone else as far as they know that because of this CPU cache size of limitation nobody actually does all these things on Tuesday next week I will spend the first half talking about another vectorization technique and then I'll talk about the technique we've developed here that combines compilation and vectorization for datasets that exceed your CPU caches and the sort of secret sauce that's going to make this work is through software pre-fetching which we have really talked about in this course and it's actually not really used as far as they know in any other in-memory database system but we are finding actually that this is the way to make this all work in terms of project three there's a bunch of stuff coming out in the next couple weeks that you need to be aware of so one we'll have the first code submission for doing reviews on April 11th and I'll put instructions on how you match up with your team and what you're expected to do then on the 13th there won't be a class and what I'll do is we'll arrange 10-minute meetings for every group to come meet me in my office and maybe 15 minutes might be better we'll come and meet this individually and we'll talk about the status of your project what are the problems you're having and what's the current where do you think you can go and then when we come back on the following Tuesday then we'll have the in-class updates where you come present just like before when you need proposals and tell everyone what the status of your system is what are some of the things that you fix what are some problems that you're having so again everyone knows what's going on and I think the first review will have to be due on the 18th as well as to do the seven days after you submit your code to each other okay any questions so again I will send the email out today there's more bug fixes we have to take care of about how to rebase and get the latest version I know some of you also require or need the new catalog you should talk to those guys in the back about how to get their code so you can start playing with it we're pretty close to merging it in but it may not be ready for everyone else but if you do actually need it you can get started on it now okay alright guys have a good weekend although it's going to snow tomorrow and I'll see you all on Tuesday