 So, today's class is, again, we're continuing with the more advanced topics on execution. So today's class, what we're talking about vectorized execution, so this is now looking at how to parallelize on a single core, whereas in the previous lectures, we've been parallelizing across multiple cores. So this is, in my opinion, I consider this one of the advanced topics, the most advanced topics in the class, because the SIMD stuff is sort of, at a high level, is sort of easy to understand, but then seeing how you actually apply it to relational operators in our database system can be tricky. So for this, we'll start off with some background information about what vectorization looks like, what hardware looks like, and then we'll get into the details of the algorithms that you guys read about in the paper from Columbia. So my opinion, this paper is a really good instructional guide on how to build vectorized algorithms for a database management system, but the spoiler will be, and we'll see this as we go along, is that it doesn't actually work. It doesn't work for two reasons. One of them we've already talked about before, and then one of them will come up as we go along today, and then the paper you'll read from, on Monday's class, actually today is Wednesday, on Monday's class is our sort of implementation of a vectorized database system, so it's Prashant's work, on showing how we can actually still do some of the things that the Columbia guys talk about, even though there is that restriction that these guys just ignore. So the high level idea of what we're doing here today is that we want to vectorize our algorithms in our database system. So what does that mean? The high level idea of what we want to do is that we want to take an algorithm that do something in our system that would normally be implemented as performing single operations at a time on a single data item, and then we want to generate a vectorized version that can do one operation on multiple pieces of data at the same time. And again this is different than what we talked about before, so before we talked about how to do parallel joins, or we talked about how to do concurrency troll or logging in parallel, all those techniques were about how do you take an algorithm or a component in the database system and scale it out horizontally across multiple cores and get parallelism that way. Now what we're talking about here is we're talking about taking a single core that could still be running in one of those parallel joined algorithms, we're taking a single core and showing how do we get parallelism within that single execution thread. So the reason why this matters is that the speed up is that we get from vectorization is actually multiplicative to the speed up you get from multi-threading. So let's say that we have some algorithm that doesn't matter what it is, some algorithm that we can then, we can parallelize to run concurrently on 32 cores. So now at each of these cores, if we have a four-wide SIMD register, meaning we can process four data items within a single SIMD instruction, then we'll get a 4x improvement. So we get a 32x improvement for the multi-threading, and then for each thread we get a 4x improvement. So our total speed up here is 128x, which is an order of magnitude, this is quite significant. You don't see this very often in algorithms or in systems. So to get this potential speed up is massive, and this is why we want to do vectorization. Now of course, obviously we saw this when we talked about joins, there's often times where the threads may have to synchronize or we have to write things into a shared buffer. So we're not always going to get 4x improvement even though we can process four things at a time in our SIMD registers. This is sort of the upper bound of what we'll get achieved. In practice, it's going to be much lower because it's not like every single instruction we're going to execute can be vectorized, but it is something that can help a lot of things. So we talked about a few classes ago, last class when we talked about as an introduction to SIMD, we're going to SIMD more detail, but for our discussion here today, I want to go into a bit more detail what modern CPUs look like and how we can design algorithms that can take advantage of how they actually execute instructions. And we want to design our vectorized versions of our algorithms so that we avoid some of the pitfalls and performance problems you can have if you don't use the hardware correctly. So there's essentially two classes of CPUs that you can have. So the first is what we most think of when we think of a CPU like an Intel Xeon or AMD Ryzen. And these CPUs are characterized by having a small number of high-powered cores. So high power meaning that they have a more complex instruction set, they can do more complex things than maybe what your cell phone processor can do. They obviously also use more energy. So this is the most emblematic of the latest version of the Intel Xeons. These can be considered one of these high-powered multi-core CPUs. So the Skylake is, I think the previous generation, the Kabylake is the current one. The paper, I think they're running on Haswell, that's a few generations back, but it's the same class of CPUs. So the key distinguishing characteristic about these type of CPUs is that they're going to be massively scalar and try to support really aggressive out-of-order execution. So everyone sort of had probably basically taken some basic architecture course at some point, but as a refresher, superscalar just means that the CPU can execute multiple instructions from its pipeline within the same clock cycle. And the out-of-order part means that it doesn't have to actually execute those instructions in its pipeline in the same way that they were issued when you actually run your application. And the idea here is that the CPU wants to go figure out, you know, once we do something at all times, so rather than just executing things in order, you know, one instruction after another, if it knows some instruction is going to get blocked because of a cache miss or because it has to go out the memory to get something, it can maybe peek ahead, execute some other instructions that hopefully don't depend on the output of the instruction you're stalled on, and then they can execute and parallel at the same time, and that way at the end, you end up with the same correct result or correct answer. It's just you're actually utilizing the CPU more because you can jump ahead and execute things that you normally would have to wait for something in front of it. So to make this out-of-order stuff actually work is that they have to have these extra mechanisms in the system on the CPU that can figure out the dependency between instructions so that it knows that, again, if something relies on the output of something else, you don't want to preemptively execute it. You have to wait for the one that you depend on to finish up first. And there's the other additional things that these high-powered CPUs can do with trying to figure out what branch you're going to go down in your instruction, like if you have an if clause, whether you're going to go down and execute something inside of it or whether to jump around it, and they want to do this because they're trying to jump ahead and execute things before you actually maybe need them in hopes that you are going to need them, and so by the time you actually get to the point where you would normally execute them, the result's already done. And then if you mispredict on the branch, you have to undo all your changes and go execute things back sequentially. This is why people always say if you have in a tight loop, if you have an if branch or you make a function call that jumps to another location in the address space, this becomes slow because this slows down your program because these CPUs are trying to do all this out of order stuff. And so we'll see this later on when we design our vectorized algorithms, so how design algorithms in such a way that we can be mindful of this and avoid these performance problems. So the other classes of CPUs that were discussed in the paper are these many integrated cores, or MIX. So these are CPUs where you have a larger number of low power cores. So you have more cores than you would have in like a Xeon. So Xeon, I think maybe the latest one, you get like 20 cores on a single CPU or 24, I forget the exact number. These MIX, they're talking about like 72 or 64 cores on a single CPU. So these guys are going to be, each core is going to use less power and take less physical space on the actual socket than the high powered cores in the previous slide. But in order to actually make them useful, they're going to expand their instruction set to include new SIMD operations because this is going to be important to do the kind of processing that you may actually want to have a lot of cores on. So the paper, the processor that they target in the paper is this Intel Xeon Phi. You can also more or less think of this as also like a GPU, right? But the difference between the GPU and Intel Xeon Phi is like this thing is actually can run x86 assembly code or instructions. So you could compile your program on a Xeon and then have it actually run without making very many changes on the Intel Xeon Phi. So one thing that is, I want to say up front about this paper will cover this as we go along. But the paper talks about how their version of the Xeon Phi was non-super scalar and only supported in order execution. This is because when the paper was written in 2015, they were using the first version of the Xeon Phi, which has its cores based on the P54C instruction set or micro architecture, which is the same thing as the Pentium 4 from the 1990s. So this is how things were up until 2016. The latest version of Xeon Phi, so if you ever Google to see what Xeon Phi actually, what its micro architecture looks like, it is now super scalar and supports out-of-order execution like the regular Xeon. So in the paper, they have the older version and the newer versions look more like the Xeon. And they talk about how because they can do out-of-order execution on this, they get like a 3x improvement over what the old one could do. So the cores on the night's landing one is equivalent to like an Intel Atom processor now. Again, it's not as full-powered as the Xeon, but they're going to overcome that lower clock rate, less CPU cache and things like that by having expanded support for SIMD. So if you've never seen what these Xeon Phi's look like, you actually can get them in three different form factors. So the first way you can get it is actually almost like a GPU. You can get something that sits on the PCI Express bus, and then you still have a host CPU that issues commands or instructions down to this thing, and it crunches on that and sends back the results to you. And for this, I think this has on-board memory that you can load things into. It can't read directly. It's not cache-coherent with the memory up above on the host CPU. But you can also get what are called self-boot CPUs where this is actually the host itself. So you can get it as it sits in like a regular Xeon, that this will actually run your operating system and be the control CPU for the entire system. So the two versions of it are ones that it sits in the socket and then another one that has this little thing that comes out called like it's a fabric connect. So you can basically do like RDMA or do fast memory connections with other machines. Sort of like you don't have to go through a bus or anything. It's streaming directly off of the CPU. Think of these as GPUs, but instead of having thousands of cores that can do some really limited things, these are almost like full-fledged Intel X86 cores. You just get a lot more of them. So I think the latest version of the night's landing you get, I think like 72 cores. I'll have to bleep this because we can't put it on the video. But again, that will get bleeped or cut. So again, for this paper, this paper is going to have non-super scalar in order of execution. And we'll see performance results that are indicative of these characteristics. But the latest versions will look like the regular Xeons. So we talked about SIMD a little bit less class. We'll go a little bit more detail now. OK. So again, SIMD is a class of CPU instructions that is going to allow us to execute vectorized operations. So we can take a single instruction, multiple data items, and apply that operation on it all at once. And pretty much all of the major chip vendors have their own version of this. So AMD and Intel. AMD has their different names for their things. But for Intel, they started with MMX. And then the latest version is AVX 512. And the 512 means that it has 512-bit registers, which is what we're going to need to do parallel store merge. PowerPC has this thing called Ultavec. And then ARM has their own SIMD instructions called Neon. So this is that same example we saw last time, where we want to do X plus Y equals Z. So we want to add two vectors together and write it out to a new vector. And the simplest way to implement this is just a simple for loop that iterates over every element, assuming X and Y and Z are the same length, and just adds them together and writes it out into this. So the SISD instruction, so single instruction, single data item, is just taking this for loop literally and going through and taking every single offset, the two vectors, adding them together, and then producing a single output. Whereas in SIMD, what you can do is you can group together four elements at the same time and write them into a single SIMD register. So in this case here, we're doing 128-bit SIMD registers and assuming these are 32-bit integers. So we have four lanes that we can write in four different numbers. So each sort of slot is called a lane in SIMD parlance. And then now with the single SIMD instruction, we can add up the elements at the same lane offset across the two vectors and then write it out. And then now we can gather together the next four elements, put them into our register, and write it out as well. So this will come up a little bit later on, but one of the things you can do with SIMD is actually you can have actually fine-grained control and actually where you write things in your cache or whether you write to the cache or not. So as I'm adding these numbers together, I need to get out of the SIMD register. I could either just put it into a CPU cache if I need to process it again, or I can have the CPU actually just write it directly to memory by passing the cache entirely. And I would want to do this if I know I'm not going to go back and read these values again anytime soon. Can you guys close the door? Thanks. And this is just for some things like say you're doing a join and you know you're not going to, you do your SIMD operation, you produce some output, and you know you're not going to go read that back again for a while because you have to process the rest of the data. So you can tell it when you write the things out of the register to go directly into memory. That's sort of the streaming operation. You do this avoiding polluting the cache. We'll see this. This will come up later in a few of these other algorithms. But the paper here we guys read on Monday, this is something you do at sort of the end of the pipeline breaker. So when you reach the end of the pipeline and you have your tuples and your vector, you don't want to pollute your cache, you just want to write it out to memory because you want to go down and process the next vector. Yes. Because directing memory means without cache? His question is just direct to memory mean without cache. Yes. So like there's a, the class of instructions are called streaming instructions. Actually that's the next slide. So the streaming here doesn't mean like streaming like video or like streaming like a data system. It means that like I can, instead of writing the CPU cache, but then eventually that gets flushed out to memory when the CPU decides, I can tell the CPU directly right into memory to avoid the cache entirely. So I talked a little bit about this last class where I said Intel first put out their MMX instructions in the 1990s, but they were pretty much unusable from a database system perspective because they were difficult to use and you had to put the, you had to write an assembly, you didn't have these intrinsics to make it easier to write. So it wasn't really until the SSE stuff came out in 1999 of when these things actually became something that we can apply to a database system and leverage. So the first version of these SSE, they had 128 bit SIMD registers and we'll see in a few slides how they sort of over time they expanded the number of items you can store in your registers and do more operations on them. So the class of operations you can have for SIMD, the class of instructions are sort of the standard things you would expect like moving data in and out of registers like doing loads and stores, get things from your cache into your registers. Then doing all the additions, subtraction, multiplication, all your arithmetic operators and then your logical instructions to do Boolean logic. But the ones that are most interesting to us is doing the shuffle instructions so allowing us to move data from one SIMD register to another directly without having to put things back in our CPU cache. This is going to be really important because what we want to be able to do is we're going to load our things in our SIMD registers and just entirely in the SIMD registers because then we don't have to transfer anything back to our caches and it'll be really, really fast. So the shuffle operations are going to allow us to do this and then we also have additional instructions to do conversion so how the CPU will represent things in, you know, regular x86 versus the actual SIMD registers is slightly different so they have transformation operations or instructions and then as I talked about before you can have cache control and if you want to get things out of a SIMD register do you put it in your CPU cache or do you put it into memory? And depending on what algorithm we're doing in our database system, we may or may not want to do this. So I like this table because it shows the history of Intel's development of the SIMD instructions. So this is from a former director of Intel James Rendir who was one of the lead guys promoting and working on SIMD at Intel for several years. So again, as I said, it started out in 1997 with MMX but it was really limited to 64-bit registers and then in the SSE instructions they added 128 AVX came out in 2011 with 256 bits and then we're at AVX 512 which just came out last year. And you can see also too here it's as they expand over time the number of lanes they can support for single point or double point precision numbers. They've always been able to support integers. So I'm often asked should we expect 1024 SIMD registers and as far as I can tell from browsing the internet from what I can tell there's no fundamental reason they couldn't do this. It takes more real estate on the chip becomes more expensive to implement and from a database perspective at this point 512 is exactly what we need. So once we have this we can start paralyzing more things. Yes. Why is it exact that way? Because remember when we were talking about parallel sort merge we said that you have 64-bit values or 64-bit keys and 64-bit pointers to tuples. So every single tuple that you want to join can be represented as 120 bits. And then when we did the bitonic sort network that had to do four elements. So if we have 1024... If you have 1024 you can just sort more entirely in SIMD registers. Then the benefit wouldn't be reduced. Without this you can't do it. Without 512 we can't do it. With 512 we can. How much better it gets? It certainly would be improvement but it's not a fundamental thing where you can't have it at all whereas this thing gives us this. And actually this is one of the big limitations of this paper and the same thing we saw in the parallel sort merge paper they assume 32-bit pointers to tuples and 32-bit keys. So all of their objects they're dealing with are always 64 bits. So that fits nicely when you have 256 because you have four 64-bit integers whereas we really need four 120-bit integers. That's 512. So now you may be thinking well if we want to vectorize everything and we need a lot of cores we want a lot of these wide registers we can put a lot of things in is this essentially what a GPU is? A GPU is going to be a lot of cores that we can process things in parallel and vectorize so why not just use this instead of all this Intel SIMD stuff? Well the answer is because in the Xeon 5 that sits on the PCI Express bus it's not cache coherent with the in-memory database up above with the CPU so that means anytime we want to actually process anything and you run a query say in the GPU we've got to load and stream everything down to the GPU crunch on it and then stream everything back up. So that's why as far as I know none of the major commercial database systems actually support GPUs for native execution but there has been in the last two years there's been a new group of database systems that have come out that are actually designed to be specifically for GPU execution. So probably the most famous ones are MatD and Connecticut. Connecticut used to be called GPUDB and so these systems as far as I know they try to load as much as the database or if not the entire database on the GPU and run all your queries down there and again because these are these cores are much more simple than what you would get in an Intel Xeon or Xeon 5 they can't do sort of complex if statements and other things right so they can't have you can't have an index on this, hard to maintain this so in the case of MatD they just run everything as a sequential scan so you load the entire database into GPU whatever your query is it just scans the entire thing and because it has so many cores right screams another one and there's a few others that sort of work in this area so we'll see how these these new startups come along but in general again the major commercial systems the database nobody actually uses GPUs now one thing that actually may change this is that there's some new co-processors that are coming out where the the co-processor is cash coherent with the CPU so the most easiest one to understand is the APU from AMD it's essentially a GPU that sits on the socket with the regular CPU cores and has access to all the main memory of the regular CPU so and again it's cash coherent so meaning like if a transaction updates some memory location the APU will see that so in that case you could push now query processing to the APU there's another version I think this is out of date but there's Intel had a prototype that was sitting on the socket but it got discontinued but now you could do something like this but as far as I know nobody extra there's some papers on this area but no one's actually looking at it yes did you have some exclusive advantages why don't they just put the database on the CPU that has like the same cores the number of cores so again why not do what sorry why don't they put the database on the CPUs that have the same number of cores as a GPU yeah GPU has thousands of cores you can't buy a CPU like that so he has a good question and I have questions I want to learn more about these things we'll cover a little bit about GPU databases at the end of the semester but if you're not graduating or you're not getting kicked out if you're coming back in the fall I'm organizing another seminar series for databases so this one will be entirely hardware accelerator databases so all the companies we just talked about all the major databases on GPUs or databases on FPGAs or databases on ASICs they're coming in the fall and come check it out and if you graduate and you're leaving CMU everything will be on YouTube you can watch it at home in your underwear so now that I think I've convinced you that vectorization is a good idea that it can potentially help and this is something we're interested in so now the question is how are we actually going to do this right, we're the database developers we get paid a lot of money how do we actually build a system that can take advantage of SIMD so at a high level there's essentially three ways to do this so at the top you do automatic vectorization where you just hope the CPU, the compiler does it for you it provides some hints hopefully it'll help figure it out and the last one we actually explicitly write either intrinsics or assembly to do exactly the vectorization operations we want I'll go through each of these in more detail but the way to sort of think about this is at the very top there's sort of this spectrum of how easy the it is to implement versus how much control we're going to have out of the vectorization so at the very top it's easy to use because you just pass a flag to the compiler and hope it gets it but the compiler is not going to be that good to find all opportunities for vectorization so you may not have complete control you may not get good performance in the bottom case if we write assembly we can do exactly whatever we want even if we write intrinsics that's just as good so we'll have complete control but it's going to be harder to write these things and the paper you guys read from Columbia if you looked at the appendix the appendix is full of code that's full of intrinsics so that's how they wrote it wrote it explicitly and that's actually how we do it in our system so let's go through each of these so with automatic vectorization we essentially pass a flag to the compiler and we hope that it can find loops in our code that are eligible to do vectorization and then knows how to convert the sort of scalar code in the loop to be the vectorized version of this right so this only works for really simple loops and in the context of a database system the amount of vectorization you're going to get through hoping the compiler gets it for you is actually quite low and this is because that the kind of loops the kind of things we're doing in our database system is very very complex and the compiler is not going to know how to reason about it and identify here's how you know here's how to vectorize it but even if you have simple loops even then it's still not going to be able to vectorize that because it has to make sure that when you vectorize code that it doesn't have unexpected side effects and that's really hard to figure it out let's go back to the example we had before that I showed in the beginning where I wanted to add two vectors together and write it out to another vector so let's say I encapsulate this in a function we take pointers x and y and z in as arguments and then we just iterate to x plus y equals z so if anybody is taking a compiler class can you tell me whether this can be vectorized or not automatically by a compiler he says no, why? I mean the compiler doesn't actually know that your previous instruction your instruction is out of 10 and your previous so he says the compiler doesn't know that your instruction the instruction within the loop doesn't depend on the previous instruction can you be more specific? as loop z could be the same pointer that's right so you can't vectorize this because these pointers actually might be pointing to the same locations in memory right so this is not like loop unrolling where I could just have copies of this and i plus 1, i plus 2, i plus 3 and going down where we're guaranteed not to have any side effects if we execute that in sequential order as he just said z might be pointing to the same memory location as x or y so in each iteration the effect of the previous instruction could one instruction could affect the next iteration right so let's say z is just a pointer of i's location plus 1 or plus 32 if we're doing before for bytes in that case here if I try to vectorize this automatically I'm not going to skip the same output the same result as I would if I executed this in sequential order because the SIMD thing will happen atomically and it will not see the effect of the previous instruction right so this is sort of a byproduct of writing things in C or C++ where when you write it this way we're describing it, we're telling the compiler a sequential ordering for our instructions but it doesn't know anything about the dependencies memory addresses and therefore it can't automatically vectorize this right so this is a good example why like if you just hope the compiler is going to figure it out it's probably not going to figure it out right because it doesn't know anything about this memory so the way we sort of help the compiler is to provide a hints right so essentially we can provide additional information about our code and about the memory addresses that we're going to operate on to know that it's possibly safe to vectorize something so there's two approaches to do this the first is that you essentially provide information about the memory addresses you're dealing with that you're going to use in a loop and then the alternative is to just tell the compiler hey man just take the C better off ignore checking for dependencies do you can vectorize anything here right so and to do it the first way you can use this keyword in C++ called restrict I think it's in the C99 standard but it's not actually in any C++ standard but as far as you know every compiler major compiler supports this so what you basically do with the restrict here is you're telling the compiler that the you can guarantee that these memory locations for these pointers are distinct they don't overlap so therefore you cannot have any side effects in your iterations for your loop right essentially what restrict does is basically said for any pointer of the same type so in star you're guaranteeing that for all other pointers it will not overlap right and so therefore all the load and some stores that can occur on the address addresses that are pointed to by this guy here will only go through this pointer here and not some other pointer right so this is one way to do it another way is use these pragmas where you basically tell the compiler again you ignore checking for dependencies on the vectors and try to vectorize everything so in this case here you use pragma IVDep meaning ignore vector dependencies and it knows that within the for loop that occurs after it that it shouldn't check for it doesn't have to worry about any dependencies and it can try to vectorize everything there's other pragmas or things like OpenMP or like PragmaCindy or let's do the same thing so again these are also hints just because we're telling the compiler that it can ignore vector dependencies doesn't mean it's actually going to guarantee the vector you know vectorize everything and the other thing is that it's up to us as programmers in this case and also with explicit vectorization that we know what we're doing and we can guarantee that we're going to get the correct result and when we tell the compiler to try to vectorize this code and the compiler wants to be super conservative it doesn't want to generate code that produces the wrong result but if we start using these hints it's not going to check all these things for us so we could end up with a side effect that we don't actually want and it's really important for us since we're building a relational database system that we produce the correct results because nobody wants to run a select query unless you're doing an approximate query processing nobody wants to run a select query and people will get upset about this for machine learning it's okay because the weights of things can go off do things asynchronously and it doesn't really matter because there's no final correct result because you get confidence in the rules in this we're dealing with hard values so we need to make sure that we generate the right value in our computation so therefore it's up to us to be super important to make sure everything is we're doing things correctly now the last one is with explicit vectorization again these are using the CPU intrinsics which I think we talked about this before when we talked about compare and swap this is essentially a syntactic sugar for the compiler to allow us to invoke explicit instructions so rather than having to drop down to assembly we can use these intrinsics that essentially look like functions even though they're built into the compiler so these are going to be hard to write because the syntax is kind of gnarly I'll show an example of the next slide but every time I write in simd I have to go look at the intel docs because everything is always abbreviated and you got to remember what the hell they're actually doing for simple things it's okay for more complex things it gets harder and harder and then the other problem is that these are potentially not portable because again these are invoking explicit instructions so if you're running if you're running on abx512 and you're invoking abx512 intrinsics if you go to compile on something that doesn't have abx512 but only has abx2 the compiler is going to throw an error because it's not going to know how to map whatever you want to do on the new instruction set on the older one so we see this sometimes in our system on Jenkins for our build farm I think we ended up on a VM that only had se3 not sse4 and we have sse4 instructions in our code and it throws an error and it throws a compiler error so we can't find those instructions right because we're invoking those intrinsics so these ones are not portable actually even furthermore they're not going to be portable across different architectures so if you compile for power or compile for arm and you use the intrinsics for their SIMD instructions it definitely will not work on on Intel's CPUs there are some libraries that can sort of help with this but I don't think anything is widely used yes is portable in the level of systems portable in the level of systems what do you mean the brand that should be the origin of Linux oh his question is say I run on OSX or I run on Linux is it portable for that if it's the same compiler and it's the same CPU yes as long as it's the same CPU the CPU is the main thing yes all your support is required for this approach but does this stand true for every situation his question is his question is I said in the first thing for automatic vectorization it relies on the hardware to support vectorized instructions yes that's true for all of these so I think in the case of the hints if you use pragma IVDep if the CPU you're running on and have similar instructions the compiler should just ignore it I was wondering if the compiler's thoughts like seem to be intrinsically and if you see it likes in closed lips where the compiler like sort of a transformation so his question is it's really this one here like how does the compiler actually automatically vectorize this and your question is does it convert it to intrinsics intrinsics is just for as humans it's just a way for us to invoke the instruction like transfer to what intrinsics intrinsics are just a map you know synonyms for assembly so here's an example of using SCC IV instructions to do the same vectorized addition here so here now you see that again we still pass in the same pointers to our vectors but now we have to cast them into the SCC IV 128 bit integer vectors and then now in my for loop instead of iterating over every single element I'm going to iterate over every four and I can pack in the there should be plus plus here this actually won't work but assume there's plus plus for these guys too but I'm just going to iterate over every four elements pack them into my my to invoke the intrinsic instruction that can do addition on these vectors and then I can store the output of the addition into the other vector here so again this is what I was saying before it's all these underscore underscore mm store like it's all this all the intrinsic will look something like this so any questions about like how we actually want to vectorize again from this point going forward we're going to assume we're going to do explicit vectorization because we need complete control over everything we're doing here all right so now the question is all right now we know how to do how we're going to write our vector vectorize operations at a high level we want to think about now how we're actually going to organize our vectorize algorithms to make use of these instructions and there's two ways to do this and this is called the direction of vectorization so the first approach is to do horizontal vectorization where you want to take all the elements that are within a single vector apply some operation on them and then produce a single result let's say I have a four-wide SIMD register I want to take all these guys together and do an add on them and produce a single result so I'm going to do 0 plus 1 plus 2 plus 3 and I'll produce 6 this kind of vectorization is only really found in the newer instructions I think SSE 4 and ADX 2 at least have them but the earlier ones didn't actually support those what is more common and what is used mostly or used entirely in the Columbia paper is to do vertical vectorization where you're going to do perform the operation across elements in the same lane and produce output into a new vector here so to do SIMD add I'm taking 0 plus 1, 1 plus 1 2 plus 1, 3 plus 1 I'm adding the items in the same lane and then producing a vector of the output like this so now with the explicit vectorization we can then start building up our basic primitives in our system and then compose them together to do more complicated things in our algorithms that we're going to need in our database system this is what I really like about this paper because they start off with these low-level vectorized primitive operations to do predicate evaluation and then they show you how you can build them there and do sorting and merging which we already more or less covered in the last class and then you compose these all together and actually build in more complex data structures like multi-way trees and bucketize hash tables to do hash joins and things like that so the in the paper again it provides these principles of how we want to develop these efficient vectorized algorithms that we're going to need to do in our database system to actually process queries and there's two main high-level concepts that are going to apply to all the algorithms that they can develop the first is that they're going to favor vertical vectorization by processing different input elements within each lane and going across them rather than trying to process within a single register so going across different registers rather than processing a single one at a time and then they want to try to maximize the amount of useful work they're doing within these lanes by always processing processing data that produces a new result so we'll see this later on when we talk about how to do hash probes but the way to think about this is that whenever I perform an instruction on a vector in my semi-registered and I produce output my next instruction that I invoke should always compute something new for all my lanes rather than maybe reusing some or using the same input that I just did last time so they'll do some extra bookkeeping to go always find new data to fill in our lanes but that way for every single instruction we're always doing something useful again this may not make sense at a high level now but this would be super obvious when we see the hash table so before we get to more complex things we want to talk about the fundamental operations that they're going to build upon to allow us to do the kind of things we want to do in our database system so they'll have selective load in store and then selective gather so selective load the basic idea again this is the basic building block we need in order to do the vectorized algorithms the basic idea is that we're going to take the contents of some location in memory and write it into a vector or one of our semi-registers so this vector at the top is our targets what we want to write into so we'll have another vector here that corresponds with the mask that tells us whether we should be actually writing something into a particular lane so the mask is just a bitmap that either has 0, 1 and if it's 0 it says we don't want to write anything that's corresponding lane if it's a 1 that means we do want to write something so when we start off with the first guy here it's 0 it's again it's 0 so we skip that for the first lane but then when we come to 1 we want to write something so we have a starting address we're going to go across in this direction and every single time we want to copy something into we copy whatever the next element that we need so in this case here this is the first time we need to copy something into this so we'll use this first position here and that gets copied into the lane there and we do the same thing for all the other ones this next one is 0 we skip that one so we get the next element in memory and write that there so in this case here this always can be done by processing data within a single cache line so it should be fast and this is because you can't jump way far ahead to some other location in memory you can only access four things so these four elements at the very most will always appear in your cache selective store is essentially the reverse of this so at the top we have the memory location we want to write into and then we have a bitmap mask again and then we have the data stored in our vector and then again the 0 and 1s correspond to whether we should be writing something out into memory at each lane so for the first one here at 0 we skip that the next one is 1 so we'll take the second lane here and write it to the first position in memory next one 0 skip that and then 1 here again we take the third lane and write it into the second position in memory so as far as I can tell this is not supported in zions I looked recently I looked this weekend to see whether the newer zions support these operations the problem is if you google selective store the only two things that show up is well three things show up first is the Columbia paper second is my slides from last year and third is a dude in Korea that stole my slides from last year so as far as I can tell this is not actually supported in zion but they can emulate it with vector permutations so this is not done the permutations are done in SIMD but it's not like this is a single instruction that can do this yes should we like specify the memory address as being the memory address? no no this question is how do you specify the memory address like for the instruction itself you would say here's my target vector here's my mask vector here's my starting address in memory the memory is in the form of a vector no no no memory is not this is the SIMD register this is like DRAM or CPU cache so you just have a starting point and then read squenching continuously are you specifying the argument correct yes the question is is the memory address specified as an argument yes the next thing we want to do is scatter and gather so the one thing I'll say if you come from a distributed database background scatter and gather actually means something different than what we're talking about here and in distributed database scatter or gather is when you take a query break it up into subtas scatter across a bunch of nodes and then they all process locally we sort of solved this when we did the parallel join algorithms but in this one it's going to mean something different so at the top again for selective gather we have our target vector and then instead of a mask vector we're going to have an index vector that's going to correspond to all sets in memory that we want to write into each lane so in this case here so we have this is our starting address zero and it goes up different offsets so these indexes correspond to the offsets in here so for the first lane it wants to get the offset at position two and write that content into the first lane here same one for all the other ones here and the idea here is that the memory our data is stored in memory in some other form and we want to align it to the way that we're going to need it in our vector and this operation does this for us for us for scatter it's the reverse of this again it's taking our index vector and a value vector and it's going to tell us what offset we want to write into memory so in this case here the first lane says I want to write my content that's in my value vector into position two so zero one two writes in there that's one is one so zero one goes there so so is this clear what's going on here right again it's just a different way to get data in and out of memory and our registers and it puts it in the form that we're actually going to need and this will be useful for us to actually build the algorithms we're going to need when do we need like distributed values according to the index this question is when would you actually need this I will see this some operations later on some operations later on so the issues we're going to have though to be upfront about is that the gathering scatters here are not really executed in parallel in sort of the way that I'm showing with all the lines happening once and this is because there's a fundamental limitation of the CPU's L1 cache where you can only really have one or two distinct axes per cycle so going back here I'm showing that in this case here I want to write out take four different locations in memory and write them out to my register I can only do maybe at most two per cycle right because that's the limitation of how L1 works so from the programmer standpoint it looks like it's a single atomic operation but underneath the covers it actually could be across multiple cycles and of course in the case of gather and scatter and gather you may not be writing out to a single cache line the memory address could be quite large it may not fit into a single cache line and then the other issue is that the gatters are only supported in the newer CPU's I think in the paper they talk about the HOSWELL not having this but the Xeon 5 having this as far as I can tell ADX 512 actually does support this so newer CPU's will actually support this as I said before the selective load in stores are not actually supported in the regular Xeon CPU's but they can get pretty close approximation of it by using vector permutations alright so now we can go through and start building upon these to start doing a bunch of vectorized relational operators things you need to do process relational queries so the most common thing that people do in a database system is do vectorized predicate evaluation if any database system says that they support SIMD and query execution they're probably doing this and this is where you get most of the bang for the block it's very hard to actually do SIMD hash joins it actually doesn't work out very well we already saw how to do SIMD partitioning in the paper they talk about new joins and we saw sorting before and then balloon filters as well so going forward I'll try to do scans, hash tables and partitioning and histograms in vectorized form so I've already said this before but the big limitation of the algorithms in this paper is that they assume 32 bit keys and 32 bit pointers and that's because in 2015 they didn't have AVX 512 and then the other big thing that they assume is that the database fits entirely in your CPU cache and that's obviously not realistic and so by CPU cache I mean like L3 so that's not realistic and when you actually try to go try to execute these algorithms on data that exceeds the cache SIMD doesn't help at all because the cache misses are what kills you so the we'll see this on the paper you guys read for next week one way to overcome this is through software prefetching which is the technique we use in our system so you can avoid the big cache misses and pretend as if everything fits in your CPU cache and do some vectorized execution by just prefetching ahead explicit prefetching to go grab the data you're going to need to process next so that when you finish up your current batch the next batch will already be there and that hides this problem but you can't do this for everything alright so let's kind of do selection scans for this I want to first show how to do a predicate evaluation in a sequential scan with sysd and then you'll see one fundamental limitation of one problem that will occur when we do query processing a predicate evaluation the normal way people do it you normally do the first time you write a database system and then we'll see then how to apply the version that's better and make it work for SIMD so for this we can do a really simple query on the entire table no indexes we want to check that for each key for every tuple that the key is greater than equal to a low value and less than equal to a high value so when you first build a database system just like we did the way you're probably going to implement your sequential scan is essentially like this you have more or less a for loop or an iterator over every single tuple in your table you extract the key that you need to do evaluation on and then you check it's key greater than equal to a low value or less than equal to a high value and if it matches then you come down inside this if clause and then you do the copy to put it into your output buffer and increment your output buffer offset by one so that when you come back around you can write it out for the next tuple so I realized again we're reading code here in class and I said that's always terrible but we kind of have to in order to understand what we're doing here so what sucks about this I've already said it earlier in the class what's that? if clause exactly the branch because what would happen if as we're actually executing this assuming we're in a super scalar out of order CPU the CPU is going to see I got this pipeline of instructions it doesn't matter whether the compiler unrolls this or not it's still going to happen and it's going to say well I have to make a guess about whether this if clause is going to evaluate to true and then if I think it's to be true then I want to go ahead and preemptively do this copy and update this thing but if I'm wrong then I got to go back and do this and then start my pipeline back over and come as if I'm skipping it so for the extreme cases when the predicate is either 100% selective or 0% selective we're going to be golden because the hardware the branch predictors probably always get it right so when things are in the middle that's when it starts to have problems every other every other tool we evaluate satisfies this predicate it's going to get it wrong 50% of the time if it makes one assumption versus another right so this is bad because we're going to do some work here that we don't need we're not actually need and then we're going to have to undo that instruction pipeline and come back and start over again so this is going to suck so we want to get rid of this if branch so the way to do this is actually just do the copy first right just do the straight copy assume it's going to get satisfied to true then we extract our key then we do our evaluation and we try to be clever how we do our evaluation so that's actually not an if clause we just do a ternary operator that says if this thing is greater than this that can be done in single instruction and then we just do an and between these two integers together and that's going to tell us whether we jump our offset by one or zero so what would happen here is that we always do the copy evaluate our our key and then if it's true then we go ahead move our offset by one if it's not true we keep it at zero so then we come back around and overwrite it the next time we copy and of course you have to have a little something outside the formula that says was the last thing should not also be there and make sure I don't include that because otherwise the last one will always be included no matter what right so this thing here is essentially do the same thing as this if clause here but there's no if branch which is what we want in a super scalar out of order execution engine or a CPU yes so as you said the we're executing more instructions so we're doing more copies here but as I said depending on what the branch predictor is going to do it might be doing that copy anyway right and we don't have to do that you roll back in the hardware so this is great graph from the I think it's from a vector wise paper from a few years ago but I've seen this graph reproduced in other experiments comparing the branch list versus the branch version of a sequential scan operator and so it's kind of hard to see here but the red line here is the no branching case, the branch list version and as you expect no matter what the selectivity is of the predicate the cost is always the same because you're always doing that copy that's why it's essentially a straight line in the branching case again at the extremes when the selectivity is 0% or near 0% or when the selectivity is near 0% it performs the best because it's always getting it right the CPU's branch predictor is always getting it right but you see this huge arc in the middle here where it's actually worse and this is when again the hardware is mispredicting what branch is going to take right at 50% selectivity is at the apex of this because it's getting it wrong half the time potentially and you end up rolling back and having to undo all the stuff or just to copy it anyway right so again this is a good example of like being aware of what the hardware running on and designing our algorithms as we build our database system to use it correctly so now we will apply the same technique when we want to do a vectorized sequential scan where we don't want to have any if clauses at all so for this I'm doing super pseudo code to sort of simplify the explanation so instead of actually invoking the intrinsics I have this function called Cindy store or Cindy load like assume it's running the selector store the vector intrinsic store that we talked about before and then I have a subscript to say that we're operating on a vector here so I can do essentially the same thing that I have before where I always make a copy of the tuple and then I can load in my vectors of keys that I want to do a compare on and if some portion of that I figure out what portion of those keys evaluate to true and I use that to figure out what offset those tuples are in and I just need to copy or retain those in my output and when I loop back around I can overwrite the ones that shouldn't actually be there so this is essentially doing the same thing at a high level that I showed in the branchless case but it's doing all in with vectorized instructions and we want to do this because again loading things in a register now it's actually getting more expensive than just doing it on the scalar case so again we're doing processing on vectors at a time we load them in, load our keys in we do our evaluation again all vectorized as assumed instructions and then we use our selective store to figure out what we actually want to retain so sort of see at a high level how this actually would work let's say we want to evoke that same SQL query here right select start from table where key is greater than equal to letter O or key is less than the letter U so say my database looks like this a single table right and I have a integer key or it could be the offset or integer ID and the offset and then the actual value is just a single character so I load in my key vector right into my register I do my seem to compare it generates my mask and again the ones tells me whether it evaluated to true or not based on my predicate and then I can use this as a sort of a mapping table to figure out at what offset each lane corresponds to right and then I can use the selective SIMD store to say alright well I want to take the anytime there's a 1 match it up with what's in the value here and that tells me in my output after my predicate I want to retain the tuples at 1, 3 and 4 and that corresponds to the offsets here right so again I had my my table sits in memory I load them into my target vector my key vector and then I have to do my seem to compare against my predicate and that produces this mask and then I use the selective store to figure out what offsets I want so then I can take this as my output and no I just have to retain these keys up here again to me I like this because it's like it's like a puzzle how do you take these low level constructs of SIMD instructions and try to do as much processing as you can inside the register without bringing things back to the cache yes like this thing here yes his question is how do you actually do this I'm showing pseudocode yes you have to do a you have to add things together yes alright so let's look at a performance number this is running in 2015 on the older version of the Xeon Phi which did in order execution was not super scaler just be mindful of that so the way to sort of think about this this is the side that is in order execution and then this is the side with the with the Haswell Xeon that's doing out of order execution so the first result is going to be the scalar version of the the sequential scan without with branching and what you see is that the as it's sort of expected the performance sort of gets worse as the selectivity increases I'm actually surprising actually for this one this one looks exactly as you expect because it's doing in order execution so as you get more selective you're doing more instructions you get essentially copying more things and that's why performance goes down in this case here I would have expected this thing to be much higher than it actually is and then it goes down over here because I think you're saturating your memory bandwidth and that's why it goes down but otherwise it would be arched when we do the scalar branch list version so in this case here because we're doing in order execution this means we're always copying things no matter what this as expected this is actually worse because then you're doing you're always doing more work in the case of the Xeon the branch list version actually gets better because again we're not having the misprediction in our branches and the reason why they essentially converge at 100% selectivity is because we're maxing out the memory bandwidth because every tuple is satisfying our predicate we always have to write it out and we're just saturating the channel for the vectorized versions for the in this case here starting with this this to me makes sense because when you have 0% selectivity whether you're doing early materialization or late materialization so this means do you have to materialize the tuple immediately after the join or you can just pass the offset up into the query plan and assume it'll get materialized up there this is not a real system this is just like a toy test bed that just contains the algorithms so in the case of late materialization they never actually materialize it so when you have 0% selectivity no tuple satisfies so you never have to even materialize anything so that's why they get the exact same performance and of course again as you get down here everybody converges at the same point because you're maximizing your you're maxing out your memory bandwidth for this one here I actually don't know why the late materialization for 0% selectivity would be better than the early materialization because it's not like the early materialization has to copy anything so I don't know why this does this but again as you get more selective it's just you end up doing more and more work they converge so again this is as expected that the branchless version would do worse because they're doing in-word execution whereas in a out-of-word execution super scalar CPU the branchless one makes a big difference alright so let's talk about how to do hash tables here alright so for for probing the the scalar way to do this is assuming you're doing a linear probing hash table is again we have our input key we hash it that gives us a hash index and then that jumps to some offset here and there'll be a key at this offset and then we can just do a scalar comparison between our key and the key that's in there to see whether we have a match then we just jump down to the next one and we keep scanning linearly until we either find the exact key that we're looking for or we hit an empty bucket meaning we know that there's nothing else that could be there so this is the way to do the scalar version dealing one key at a time and just do a sequential scan into the hash table so let's see how to do this in a vectorized manner and we'll try to do it in a horizontal vectorization so to do horizontal vectorization again it's taking a single element in our input and we want to process it against other data items at the same time so what we're going to do is we're going to create a bucketized hash table where at each bucket instead of having one key we're going to have four keys and the idea here is that when we do our input key we hash it and generate a hash index when we land into a bucket we're going to extract out all four keys that are in that bucket and then do our simd comparison against them so in this case here we would do a simd compare against all the elements on the side of this and then we'll have a mass that tells us we have a one whether we have a match so the way you actually would insert something into this is that when you land into an offset you just find whatever the first position that is empty and you're right into there and if they're all full then you just jump down to the next one so essentially it works the same thing it's just now at a single bucket you pack in four keys all together so to do this vertically again the idea with the vertical vectorization is that we want to process multiple data items in our input at the same time so for this we're going to take our input vector of four keys and we're going to do our hashing in parallel and then produce a hash index vector and then these are all going to point to different offsets in our linear probing hash table so what we want to do is we extract out using simd gather grab the different memory locations that we were pointing to pack them into a single input single vector here and then do a simd compare to produce our match mask right so in this case here we have key one matches with key one so that's a one and then these middle two guys don't match and they're zero and the key four matches with this and it's one right yes can we combine like two methods together this question is can we combine the two methods together no that's a key one like compare it with like another vector let's keep going we're short on time thinking this all I don't think it's that easy alright so at this point here the first key and the last key matched but we have this middle these middle two guys here didn't match so we need to keep processing with them because we have to keep scanning down until we find what we want right so one way to do this is that all we need to do is have for the middle two guys just have them increment in the hash table and figure out what the next key that you need to compare against but it sticks these two guys the top one and the bottom one to be exactly the same keys so that because we already found what we're looking for so we don't need to come you know keep looking anymore but we said in the beginning remember I said I wanted to maximize our lane utilization this is an example of what I was talking about where as we go down and we keep scanning we're essentially wasting computation because we're comparing keys that we already know match and we're not getting any new information from this but we have to do this because otherwise we don't you know there's something else here we actually want to scan so what they'll do instead is instead of just re-computing the same keys over again they'll go back to their input vector and go grab the next items they need and fill them into the the lanes of the vector that we already know match so again the first one and the last one already matched so the next two keys are key five and key six and we use them in our input vector and then we just know that we need to go back through and produce hashes and generate the next offsets we want to jump into for our keys so this ensures that every single time we do a probe into our hash table we're always looking for something new right and not just re-using the keys that we already produced the answer for of course this means we have to do some additional bookkeeping here to keep track of this yes I'm just still waiting for this so so his question is are we still wasting work because before we had we had a hash four keys and now in this case here we only have hashing two so the hash actually cannot be vectorized I should be clear about this this has to be done in parallel across cores if you want to do that right or done sequentially it's really for this part here is vectorized so there's actually one problem with this it's a bit nuanced if anybody could pick it up so this makes the algorithm unstable so what I mean by that is depending on what our hash table looks like we may not be executing we may not be evaluating keys and our input in the exact same order every single time so right in this case here when we look at our output say that in the first round key one key four match we produce that in our output vector and then in the next round say key five key two key three and key six all match and they get put in our output buffer if we executed this sequentially we would expect the key one key two key three key four we respect them the output buffer would maintain the same order that they are evaluated in so now we run this on a different day we may actually produce a different result now the great thing about relational database is that since we're unordered we're using bag algebra there is no ordering to actual data we don't actually care if you really cared about the order of these things you would add an order by clause but that one potential downside of doing this in a vectorized manner is that you may end up not getting the exact same result you'll still have the same content but it may not be in the same order one day to the next alright so to finish up real quickly we can look at the results for the a gift from the paper again for this you see that on the Xeon Phi it's slightly faster than the the regular Xeon this because it has way more cores but you see a definite difference of performance in the vectorized versions versus the scalar version and in the case of the Xeon Phi the vertical vectorization for probing the hash is much much faster than the horizontal one and then you see the same thing for the other one but there's these points here where they basically again converge and this is when you run out of your CPU cache and this is what I was saying before all this vectorization stuff gets thrown out the window as soon as you run out of your CPU cache when everything fits in the CPU cache you're getting a forex improvement over the scalar version of this when you hit the cache everything falls apart so that's one of the main problems we're going to have to deal with we'll try to overcome in the next class I'm going to skip histograms just skip it, yeah we'll just skip it we're out of time for joins we'll skip this as well again the only performance number I want to show is that their vectorized version can outperform all of the the regular parallelized versions of the hash joins we talked about before but again this only works when everything fits in your CPU cache alright so to finish up vectorization is super important for OLAP queries but most of the systems that are out there that do some kind of vectorization only do it for predicate evaluation because you don't require to have everything fit in your CPU caches for this you can just grab a bunch of tuples do your evaluation in the SIMD registers and then move it up the query plan and move on to the next one if you had to read from disk then it all falls apart anyway so it maybe doesn't matter that much the thing we talked about in the beginning is that we can combine all the interquery parallelism that we talked about before like how to do parallel joins and potentially parallel logging and other things we can include SIMD stuff inside of them and then multiply the performance benefit we can get from this because we're maximizing the performance we get on a single core and we can multiply that across all multiple cores the paper you guys read about next week is to show that you can actually do vectorization carefully in a compiled plan and then the magic sauce to make this all work is the software pre-fetching technique to ensure that again you hide the cache miss latencies all right so again next class next class I'll be split into two parts so I will teach bit weaving in the beginning which is an idea from the university of consen the quick step guys and then Prashant will cover the his paper with you guys are signed to read in the second half because I have to fly out any questions I need something refreshing when I can finish manifesting a whole bowl like Smith & Wesson one court and my thoughts hip hop related ride a rhyme and my pants intoxicated lyrics and quicker with a simple moan liquor just on my city slicker play waves or pick up rhymes I create rotate out of way too quick to duplicate fill a breeze as I skate Mike's a Fahrenheit when I hold it real tight then I'm in flight then we ignite blood starts to boil I heat up the party for you let the girl run me and my mic down when all your records still turn with third degree burn for one man I heat up your brain give it a suntan to just cool at the temperature rise to cool it off with same eyes