 So today we're talking about vectorized execution. So we talked a little bit about this already with when we talked about sort merge join. But now we want to go a bit more detail and talk about how we can do vectorized execution for more operators in the system. So we'll start out with more background about SIMD in general and hardware that we're going to be looking at. And then the paper you guys are assigned to read is, in my opinion, the definitive guide from these guys at Columbia on how to do vectorized algorithms for all different operators you want to do in a relational database system. Now, I think it's a good instructional guide to tell you how to do this. The spoiler would be, I'm going to just tell you upfront, is that it doesn't work. Now I'm very afraid of that. The assumptions they make about the operating environment are not realistic. Do you want to take a guess? I don't even know what the two major assumptions given that what we've talked about so far, we know are not going to be true. One is they assume the data sits in your caches, CPU caches. That's not real. The second one was that they assumed that you have 32-bit keys and 32-bit pointers. So that means that the things that they were processing are going to be 64 bits. That's not going to work either because we're going to need 128 bits. We need at least 64-bit keys and 64-bit pointers. Now, back then, they didn't have AVX 512. We have that now. So that's not that big of an issue. But the main one is that everything has to fit in CPU cache. And we'll see this next class. So, yeah, Wednesday, also Monday's class, I'll share a technique that we developed to how to overcome that limitation of how to avoid the, you know, we'll see why these go along, why the cache misses, make vectorization come fall apart and not work, and then we'll see a technique on Monday how to hide that. All right, so again, we've already sort of covered this. We already covered why, you know, at a high level what vectorization is doing. But now we want to go into a bit more detail and talk about actually what's going to be happening in the hardware and how we want to design our algorithms to use it correctly. So the idea of vectorization is we're basically taking an algorithm that we would normally process a single data item at a time with a single operand. This was the SISD, single instruction, single data that we showed before. And then instead what we're going to do is use a single operation to process multiple items of data at the same time, right? And it should sort of be obvious, given that we've talked about so far of why we actually want to do this, but to put into context of everything else we've talked about, the speed up we can get if we can vectorize our operations are going to be quite significant. So we've already seen how to do the hash join, to do our scans, to do our servers joins. We already know how to do parallel scans, right? So say we have 32 cores, we can have all 32 cores scan our database at their tables at the same time. But now it's say that if every single core has four wide SIMB registers, so four wide means that it can process four data items at a time per instruction. So if we can run on 32 cores and get a four time speed up on each core, then it's multiplicative. So for all 32 cores, you're getting a four X speed up because every single core is processing four data items at a time. So that means we're getting 128 X speed up, which is massive, right? So this again, this is why we want to do this. Now, in reality, we're never going to achieve, this is like, you know, this is the upper balance. We're never always going to get this because as we'll see as we go along, you know, we have to move things in and out of the SIMB registers. Sometimes we're going to do extra instructions to put data in a form that can then be loaded into the register and then get it out, right? So this is, we're never going to achieve this exact, you know, this level of parallelization, but you know, this is the upper bound and this is quite a lot, so this is something we want to work towards. So the things we need to talk about a little bit and this is mostly relevant because the paper you guys read, but it highlights, again, the issues between the different types of hardware that could exist and that we have to run on, is you know, the type of hardware we've been assuming we're going to run on are like our standard Xeon processors, right? Sky Lake, KB Lake, I forget what the latest one is, Coffee Lake, and in these type of modern CPUs, you're going to have a small number of cores, but each core is going to be super high powered, right? I mean, they're going to have the full instruction set, a bunch of SIMD stuff that they can do and a bunch of other stuff that we don't care about in databases, but it's basically a, you know, it's every single core is high powered. And so as part of that in the Xeons, they're going to be doing super scalar execution, which you've already talked about. That means we have a, we can execute multiple instructions at a time within a cycle in our pipelines, and then we'll also do out of order execution, meaning we don't have to execute instructions in the order that they appear in our instruction stream for our pipeline. The CPU can be smart and say, well, I know this one's going to have a cache miss, so let me go ahead and maybe load that thing to my caches from memory, but then I'll execute these other instructions while I'm doing that. And then it tries to make sure that, if there's any dependencies to make, you know, there's this instruction, depending on this instruction, it makes sure that it'll stall or not, you know, not execute that one until the data that you need to produce by this other instruction. So there's a bunch of magic going underneath the covers to execute multiple instructions out of order, and then the CPU makes sure that it appears as if they're executing in order, right? Because again, you don't want to have this thing produce incorrect results. So this means that they're going to have a long instruction pipeline, there's a bunch of extra stuff they're going to have to do, if you have the branch prediction stuff we talked about before, but we'll see it again in a second, there's a bunch of stuff they're doing to make this all happen. And that means that the way we maybe want to design our algorithms should be aware of what's going on so that we don't end up having the CPU do stupid things for us or slow things for us. Now in the paper you read, they also talked about this other type of processor, the brand name from Intel is called the Xeon 5, but this is what's considered a many integrated core CPU or a MIC, and the idea here is that rather than having a small number of high powered cores as you did in the Xeon, they would have a larger number of sort of low powered cores. So it means that the clock frequency of every single core would be less than the Xeon, and the amount of sort of specialized instructions that they're going to have are going to be much more limited than the Xeon. So in the paper you guys read, they ran on an older version of the Xeon 5 called Knight's Fairy, and again Intel is afraid of getting sued based on name, so all these correspond to like geographical areas or lakes or rivers or stuff like that, right? So that's again, so they can say that no one can claim that Intel stole their name because they say it's a river, of course we didn't steal it, right? So the Knight's Fairy that was in the paper you guys read, this was an older version of this, so this was a non-super scalar in order execution cores, so it didn't do all the stuff that the Xeon did, and every single core itself follows this ISA, the Intel P54C, and so that was the same thing as the Pentium, I think Pentium II or Pentium III from like the 1990s, but they're just running a ton of them or having them on a single form factor for the processor. The later version that came out, after the paper you guys read came out, is this thing called Knight's Landing, I said this one they actually said, oh actually we do want to have the super scalar out of order execution that the same way the Xeons do, so the newer Xeon 5s that you can get have that, and now these are based on the atom ISA, not the Pentium. So the key thing though about, that's gonna be relevant to what we're talking about today is that although the cores themselves are gonna be low powered, they're gonna give them a lot of really wide SIMD registers, so now you can do on every single core, you can do a vectorized operators execution, and in the paper again from Columbia, because we're non-super scalar and because we're doing in-order execution, they're gonna favor things that are branchless, because they just execute things as fast as possible that way. If you've never seen what these things actually look like, you can get them in three form factors, so the most common way is to get them almost like a GPU, just like a GPU where it's something that sits on the PCI Express, and then the main socket CPU is the one that's actually controlling the OS, you can also get one that sits in a self-boot CPU, so this thing could actually run the operating system and control the whole motherboard, and then they have this thing with this little connector called OmniPath, think of that as like RDMA, or like a messaging fabric that allows you to read memory on another machine very efficiently, right? So I had to do this every year, Intel killed this, it's not like I think at this point, but like they don't make these anymore, they are, I don't know how, they didn't really sell it like that, yeah they didn't sell it at all, no one ever, people wanted to use GPUs because these things are not, like you get like 70 cores or 60 cores in these things, the GPU gives you like 5,000 cores, so for like neural network training and machine learning, the GPUs are way better than these guys, right? What Intel's actually doing now is this thing called, or the future is that they're gonna do things and something called Configural Spatial Accelerators, so sort of think of that as like a TPU that has like in a PCI Express form factor or something that sits up on the socket up there, right? So this thing's not cache coherent, right? This thing is, this thing is like the central processor, the main CPU has to send data back and forth between this thing, whereas this one I think was cache coherent with the main CPU, right? All right, so for SIMD, we've talked about before, so I don't wanna go spend too much time on this, but again, the basic idea is that SIMD is a class of instructions that are gonna allow us to do the vectorized execution on some sort of basic primitive operation, and all the major ISAs have their own versions of this, it comes up in different names, different flavors, and basically x86 going back to the 1990s with MMX when these first SIMD stuff come out, like as they go over time, they're adding more instructions, more operands that are vectorized, as well as expanding the register size, meaning you have, the lanes are wider, you can operate on larger pieces of data. So we already showed this example here, we wanna do this addition between these two vectors, and we would implement it with this for loop, and if we were doing SISD instructions, we would just go through in our thread one at a time looking at each data item in our array, and then invoking the SISD instruction to add them together. But if you wanted to do this in SIMD, we basically have, we collect some data into, in this case here, it's a 128-bit SIMD register, so a four-wide, a four-lane register, we put our data items in there, we invoke one SIMD instruction, and then it spits out, or it writes that result into another register. Again, what we'll see is that this is in memory, we load this into a register, we invoke our instruction on that register, and then it's gonna write to another register, and then we're gonna have to go get this out and put it back into memory if we wanna use it in a REST graph program. Or depending on what algorithm we're doing, we could have this register then be used for other SIMD operations to without having to put it back in the cache, and that's sort of the best case scenario, because if we can keep things in our registers as long as possible, that's how we're gonna get the best performance. So, again, as I said, Intel's been adding SIMD, improving SIMD over several years. The first time in 1999 when the SIMD instructions actually became usable in the context of databases or things that we actually care about was the first implementation of the streaming SIMD extensions. And in this first version, there were these 128-bit registers that you could pack with fourth 32-bit scalar values, and then the first version of SSE allowed you to invoke SIMD instructions while the main CPU was doing other stuff. And my understanding is the first version, MMX, when Intel first announced that we have vectorized instructions, it was basically the main CPU would stall or it would not be able to execute SISD instructions. At the same time, it would execute SIMD instructions. It would have to switch back and forth. Whereas in this one, you can sort of do both in parallel. And so the classes of operations you can get are sort of in low-level primitives that we're gonna need to build up and do more complex things in our database system. So, of course, we can move data in and out of our registers. We can do the arithmetic operators. We can do all our Boolean operators to do comparisons and predicates. And then we can move data in between, between the different SIMD registers without having to put it back in the cache. But some of the more important things is that they're gonna have the ability to convert data into the x86 format and into the format that the SIMD registers want. And then we're also gonna have sort of fine-grained control of where we move data around from, how we move data back and forth from the registers to memory. So, the streaming and SSE, that class of instructions, or streaming instructions in general, are operators or instructions where you can bypass the CPU cache and write directly to memory. So, you don't pollute your cache. So, let's say that I have a data item where I do whatever operation I wanna do it, or whatever instructions I wanna do it to crunch on it. Normally, I would then write it back, the result to my CPU cache. But if I know I'm never gonna read it again, then I can use a streaming instruction to dump it off the memory right away, and not worry about polluting my CPU cache. So, there'll be some techniques where we can take advantage of this, where again, we know we don't need the data ever again, and we can shove it off. Yes, sorry. Is Select-A-Store and gather it from the paper and not from another operation? Yeah, so his question is, is Select-A-Store and gather, are they not fundamental operation or SIMD instructions? Correct. So, the way you have to implement them is like a small kernel that does a series of SIMD instructions. So, there's no Select-A-Store instruction to do exactly what the paper describes. And then, every year I Google this and see whether does this actually exist. Like, does Zion actually support this yet? And as far as, again, as of last night, it still doesn't exist. If you Google SIMD Select-A-Store, you get three things. You get my slides. You get the Columbia paper. And then you get this guy in Korea that stole my slides. So, as far as you know, yeah, there's no single instruction to do this. All right. Okay. And somebody else with Wisconsin's don't want slides too, but that's okay. All right. So, this is just the history of SIMD in x86. And again, you just see that over the years that registers get wider. You know, in the original MMX, it was 64 bits and then went 28. That actually makes it usable for us in databases. And then 256 was AVX-2, AVX-1 and 2. And then as of 2017, they added AVX-512. And we have one machine in the database group that has this, right? And this is, again, this is what I said when we talk about sort merge, like the, that's, all that sorting stuff didn't work if you only could do sorting on 32 bit numbers. Because again, you need a pointer to the tuple that you're sorting. You just can't sort the key. With 512, you can actually do it. Now, I also'd love to see whether Intel is going to add 1024 bit SIMD instructions. Again, there's no reason they couldn't do it. Just, I haven't seen anything that says they're doing this. But again, from our purposes in databases, this is sort of the threshold what we need to be able to actually use this in a real system. All right, so let's talk about how we actually can take advantage of SIMD, right? So it's clearly that this seems useful. We want to use this. It'll speed things up. So how do we use this inside of our database system? So there's three approaches. There's the automatic vectorization, getting from the compiler. There's us telling the compiler what it can try to vectorize. And then the last one is what they did in the Columbia paper and what we do in our own system is we actually just invoke the intrinsics directly to do explicit vectorization. So there's this great talk. This link here will take you to this video from James Rendiers, who was one of the lead architects or the lead proponents of the SIMD vectorization architecture at Intel. So he sort of lays out a bunch of these, how you can actually do vectorization in these different ways here. But at a high level, the way to think about this is that the two ends of the spectrum are at the very top you get ease of use, meaning us as the data assistant programmer. And the bottom is like how the guarantee that you're actually gonna get things to be vectorized and have complete control over actually what the CPU is doing. So the top one is basically we just write our code without thinking about vectorization and hope the compiler finds it for us, but then the bottom one here is like we're telling the compiler we want this absolutely vectorized. Do they cover this kind of stuff in 418, 618, or no? Yes, okay. But not everyone has taken it. All right, so let's go through each of the three approaches. So again, automatic vectorization is us getting on our hands and knees and praying that the compiler can identify different parts of our code that can be vectorized. And it's basically gonna be looking for loops where it knows that the instructions are independent and therefore those can be replaced with vectorized instructions, right? So the downside of this, of course, is that the compiler's never gonna be perfect. And in many cases in our database system, the candidates for vectorization are going to be rare, because our loops are usually not gonna be that simple. Now, remember when I talked about vector-wise, they had those loops, those primitives, they were doing like a scan on a column and applying different predicates. So we will see that approach next class in October, you read. And that's a good example, in that paper, it'll tell you what percentage of the operators could then, will be automatically vectorized. And it's not 100%, right? So let's look at a simple example here. So we have a simple function here and we're taking in three vectors and we're just gonna add them together. It's the example I showed in the beginning, right? So let me take a guess whether the compiler can automatically vectorize this for us. Is he saying yes? Raise your hand if you say yes. Raise your hand if you say no. Raise your hand if you say if you don't know. Good, all right, good. The answer is no, right? Why? Well, what are we doing here? I just done the wedge and bureau label. What's that? I just done the wedge and bureau label. Correct, yeah. X, Y, and Z might actually point into the same address, right? This is something we don't know until runtime because we don't know what the values are gonna be passed into us. So Z could just be the address of X plus some offset. So in one iteration of this loop, I'm gonna write to Z, but that's gonna write to something in X, right? So this is not like unrolling, right? Unrolling would just say just replace this with the instructions executed all in a row, right? This is trying to take a bunch of data at the same time and then invoke a single instruction. So if there's a side, if one of these addition operators modifies the contents of one of our arrays, then we're not gonna see that in our vectorized operation because it's gonna happen atomically, right? So this is sort of the downside of C or C plus plus because we're writing our code as humans in a way, we're describing what we're trying to do in our algorithm in sort of a sequential manner, right? Interate over this and do this one by one. But what we really want is the hardware to be able to execute this in parallel in a vectorized form, right? Right, so this, so again, this is what I'm sort of saying, like it seems like this would be super trivial, add two arrays together and produce a new array. Seems like this is something that compiler could find for us, but it actually can't. So how can we fix this? One approach is you compiler hints. So basically we're providing information about the, to the compiler to tell it that the code that it's examining as it's building it, we're gonna let it know that it's gonna be safe to vectorize and this can be done in two ways. One is we say, we give it sort of explicit information about the data, the memory locations of the data that we're trying to operate on and can tell that they're not gonna overlap and therefore it's safe to vectorize. Or we just, you know, we're driving our car and we take the seatbelt off, we say ignore all dependencies that you may be looking for and you'll be fine, just go ahead and vectorize this, right? Of course, again, that may not be what we actually wanna do, right? Because again, if we do this and we get it wrong, then our select queries are not gonna produce incorrect results. And unless we're doing approximate query processing, you know, people are gonna notice that, you know, we're losing money in the database, right? Or we're getting more money than we should have. So the first approach, again, the way to handle this is they restrict keyword. So this is in the C99 standard. I don't think it's in the C++ standard, although I haven't checked for C++17 or the newer ones, but every single compiler actually uses this, so it's not a big deal, right? So this is basically telling us in our function that with this restrict keyword, it's saying that these member locations are distinct, they're not gonna overlap, and therefore it's safe to vectorize them, right? So essentially the way they might have said, X, Y, and Z are not aliases of each other. We're sort of saying that explicitly, right? The other approach now is to use these pragmas, and this is how to tell it to ignore dependencies. So I don't know whether this is in the standard or not, right? But it's basically IVDEP, it says ignore vector dependencies. So this is saying that everything in sort of this code block here, it can ignore dependencies and it allows it to, it can go ahead and vectorize it. Of course now this relies on us as being discipline programmers, if we start doing this, this is like taking the seatbelt off or taking the safety off our gun, right? We can still pass in pointers, or values to these pointers here that do overlap and therefore this vectorized instruction stuff will get messed up, but it's up to us to make sure that doesn't happen in the rest of our code, right? There's other ways, other libraries and frameworks, or compilers have different pragmas, like in OpenMP there's pragmas, you know, simd, as far as I can tell they all basically do the same thing, right? All right, so the last approach is to do explicit vectorization, and this is what, again, this is what we're gonna mostly talk about today. And the idea here is that we're gonna use CPU intrinsics to basically invoke the exact instructions we wanna invoke to do our vectorized operations, or vectorized primitives, right? Now the downside of intrinsics, right? Intrinsics are basically syntactic sugar for humans. It's basically, it gives you, it looks like a function, but it's really just the low level assembly instruction that does the simb operation. And the downside is that they're not gonna be portable. Like all the examples I'm gonna show you here, they're all gonna be x86, either AVX2 or SSE intrinsics. Then if you go try to plop that on your PowerPC machine or your ARM machine, they're gonna have different instructions or different intrinsics and it won't work, right? There are libraries that can sort of hide this for you, but there's no one that everyone adopts, right? It's varied, and as far as I can tell, most people write x86 code for databases and they do simb if they're gonna do these intrinsics, all right? It's also to write these things too, because they're always abbreviated and every single time you gotta figure what they're looking at, you gotta look at the table and the Intel's website, it's not intuitive, at least for me, because I'm not writing simb stuff all the time. All right, so here's the intrinsic way to do that addition between two arrays. So now the first thing you see is that we are passing in our regular integer arrays, our pointers to our integer arrays, but then we're gonna cast them and convert them into our simd-able primitives or not scale, our simd data types for x86, right? And this M128 is telling us that it's a, this is for the 128-bit simd registers. So we're gonna be able to take four 32-bit integers and stick it into one register at a time. And then in our for loop, we have for every single four elements that we're looking at in our arrays, we're gonna load them into our simd register, so we're gonna load the addition operation between the first register and the second register and write it into the z register there, all right? So this is sort of clear. The underscore, underscore stuff is, it is what it is, it's fine, okay. All right, so there are different types of simd operations or instructions that we can do. And the way to think about this is like, in what direction are we applying our vectorized instruction and producing the output? So the first is horizontal simd instructions. And the idea here is that within any single simd register, I'm gonna do whatever the instruction that I wanna do for all of the elements inside that register and then write it out somewhere else, right? So say I have a simd register that has the number zero to 123. If I have a simd addition that's doing the horizontal operation, it's gonna add zero plus one plus two plus three and then write it out to six, right? It's horizontal within the register. The vertical instructions take elements across two different registers across the same lane, apply the instruction and then write them out to another location, right? So this is taking zero at the first position here and one here, adding them together, producing one, one and one produces two, three and four and so forth, right? So the first approach is only really found in newer CPUs, the horizontal ones. So it's in SSE4 and AVX2, so new meaning like, you know, last 10 years. The vertical one is pretty much supported by everyone and in the Columbia paper, they choose this approach. They're always gonna do the vertical simd stuff, right? But just in the back of your mind, just be aware that you could do this, you could do the horizontal one, but for their purposes, they're doing nothing but the vertical one. So the kind of things now using explicit vectorization that we're gonna be able to support in our data system are essentially all the low level primitives of sort of, not even within the query plan, but just like within a single operating query plan, all the stuff we have to do in order to execute and process data, we can start vectorizing those things and use those as building blocks to do more complicated things, right? So we're gonna be able to do vectorized predicate evaluation and compression. Then we saw last class how to do vectorized sorting and merging, and then we can do even more complicated things like doing hash tables, which we'll cover here today. So we sort of start small, sort of small vectorized operations and then use those to build more complex algorithms or more complex techniques, right? So again, that's part of the reason why I like this paper so much. All right, so again, the Columbia paper, it's not a real system. It was just a test bed system that really only has the algorithms. So they're not actually supporting SQL, they don't have a query processor or anything like that. It's just like, here's a bunch of data that's in a columnar format, and then here's the little kernel of the algorithm that's been vectorized. So the paper again, but lays out a clear example of how to vectorize a bunch of the core things we care about in our database system. And the two overarching principles in their decisions of how they design their algorithms are that, as I already said, they're gonna support vertical vectorization over horizontal vectorization, but again, it means processing data in two registers across the same lane offset, and then they're gonna maximize the lane utilization by making sure that every single time they invoke a SIMD instruction, they're always producing useful work. And what I mean by that is like, they're not gonna have useless data in our SIMD registers. They're gonna, say I do some operation in the SIMD register, and then some data I wanna keep, some data I need to process more. They could just keep using the same registers over and over again, essentially doing waste and work in some of the lanes, but instead they're gonna go back and get more data and make sure for every single SIMD instruction they invoke, every lane is actually doing something useful. This doesn't make sense now, it'll make more sense when we talk about the hash table stuff, it should click right away. All right, so the things I wanna talk about first are how the low-level fundamental operations that we're gonna vectorize, and again, we're gonna build up from these and then start to do more complex things like the hash tables or the histograms. So we do select the load and store, and then selective gather and scatter. And as Tini already asked, for these two guys here, there's no SIMD instructions that actually do it, they're gonna emulate it with multiple SIMD instructions. For the gather and scatter, I don't think the CPU that you guys read in that paper supported these things, I think AVX 512 does have it though, in the newer CPUs. All right, so the first one is selective load. The basic idea here is that we want to take some contents we have in memory, and store them out to specific lanes in our SIMD register. So we have, again, this is memory, but for me, it's the CPU cache, right? But it's not a SIMD register, it's the CPU cache, but I'm saying it's called a memory. And then this is our four-lane vector that we wanna write into. So then we're gonna have a mask that's gonna tell us where do we actually wanna store data. So again, the way to think about it is that these are matching to the same lanes. So whatever offset you're in the mask corresponds to this lane in the register. So for this first one here, the mask is set to zero, so we're gonna go ahead and skip that. That's basically telling us that we're not gonna write anything into this lane here. And what's gonna happen is that as every single time we have a one, we're gonna take whatever our current position is in memory and write that to that particular lane, and then move the cursor over by one here. So this guy here is a one, so we're gonna go ahead and write where our cursor starts off at you, and we're gonna write that into this lane here. Next one is zero, so we skip that. Next one is one, so we take the V here and write it out to that lane there. Select to store is the reverse. So the top is our target, we wanna take contents of our vector and write it out into memory, and same thing, we have a mask that's gonna tell us whether we actually wanna write data here. So the lanes match up just like before. So the first one's a zero, so we skip that. The first one is a one, so it says we're gonna take whatever is in this lane and write it out to the first position there. Next one's zero, skip that. Then we have a one, and then write that lane out to this position here. So again, the idea is that ideally, we want this to be vectorized, and to have one instruction just do this for us in practice that's not gonna be the case. So select to gather. This is different than how we describe gather and scatter in a distributed database. Distributed database gather or scatter gather is you take a single query, break it up into some sort of plan fragments, scatter them across the different machines, and then gather the results back. This is within the single memory context. All right, so again, for this one here, we have a value vector, we have our index vector, and what's gonna happen is these are gonna correspond to offsets in our memory. And so what'll happen is this one here would tell us that we want to take whatever is that offset to zero, one, two, and write it out to this lane here. Same thing, right, for all the other ones. So this is basically allowing us to take some chunk of memory. I think it has to fit in our cache line, because we can't jump too far ahead, but it's gonna allow us to take memory that can be in any order and write them out to the order that we want in our register, like that. And I think again, I think this can be done in newer CPUs with a single instruction now. Scatter is the reverse, so this one here, we want to take our index vector, and that's gonna tell us that where do we take the contents of our SIMD register, and where do we want to write them into memory? All right, so this is clear. So, actually take back what I said before. The memory itself, in both cases, this actually might not fit in a single cache line, so this may be multiple instructions, or sorry, multiple cycles to perform this operation. What I'll also say too is with L1, you can only do one or two loads in stores per cycle, so even if it did fit in a single cache line, it may take two cycles to actually do this, right? And actually, this is what I'm saying right here. So again, the gathers, the scatters are not really actually in parallel, because you can only do so many things on L1 per cycle. The gathers are only supported in newer CPUs, and as I said, the selected loads in stores are, you can implement them using multiple SIMD instructions. There's no single selected load in store instruction that you can just use, all right? All right, so now, using selected load in store and scatter and gather, we can now talk about how we actually want to do SIMD stuff in our database system, what we care about. So we've already talked about how to do parallel partitioning, right? That was the Radek stuff. The scan stuff we'll talk about here today, and the hash tables to talk as well, and as well as histograms. The paper also provides a bunch of extra other stuff, join, sorting, and bloom filters. The bloom filter one is sort of straightforward, because you're just doing multiple probes into your bitmap. The join stuff is a bit more complicated, and I will say that in our own research, we found that actually once you go beyond the CPU cache, SIMD doesn't help at all, right? The vectorized version actually ends up being worse than the scale up version, just because all the instructions to prepare things to put into SIMD registers, you get no benefit. The sorting one was the big tonic merge sort stuff that we talked about last class as well, all right? And as I already said, they're assuming everything fits in CPU caches, they're assuming they're operating on 30-bit keys and 30-bit pointers. In the ADX 512, we can bump that up to 64-bit, and then everything actually works. So let's talk about how to do a vectorized selection scan. So we already talked about this before. The issue between branchless versus branching scans, right? So again, we're doing a simple select. We have a low key and a high key, and you wanna check to see whether the tuple matches the predicate. And so in the branching case, you have this if clause, and depending on whether that gets satisfied, you go down to this chunk here. In the branchless case, you always do the copy, you always, and then just do this predicate, and that decides whether you move the offset in your output vector for it or not. And then we said that in a super scalar CPU that supports out-of-order execution and uses a branch predictor to speculate execute instructions, this approach can hurt you because if you predict incorrectly, which is often the case because the data is gonna be completely different every single tuple, then this ends up being slower. Whereas this one, you can do more work always, but you don't have that branch misprediction issue, so you're basically running at bare mental speed. So we saw this graph before as well. Again, as you change along the x-axis, the selectivity of the predicate, the branching version of the selection scan could either be better or worse. Whereas the no branching one is always doing the same amount of work, no matter what the data actually looks like. Yes. Can you go back to the previous slide? Yes. So last we know that this class in the tuple is logical line versus if it's true to branch prediction. Actually, if we can reflect that logical line at the big right hand, that goes to this data work for Branson prediction, you have to replace the part a lot from the big right hand. This thing here? Yeah. Yeah, yeah, yeah. So he's basically sitting up. You have to do a bitwise and not a logical and remove one ampersand. Yeah. Good, thank you. I will fix that. All right. So in the SIMB version, we're gonna re-branchlets because there's no if clauses in SIMD instructions. Right? So for this one to make things simple, I'm gonna have, well, again, we're gonna iterate over vectors of tuples in our table. This is like super pseudo code. This is obviously won't compile. And then I'll have, instead of the actually intrinsics, I'm just gonna say SIMD load, SIMD store. So again, we're always gonna load our data into our output vector. And then we're gonna check to see whether the tuple we're evaluating is matches or not. And that tells us whether we want to you move our output buffer ahead or not. Sorry, yeah. We load the key into our key vector. We do the comparison in SIMD. And then based on that, we get a, the SIMD store tells us which tuples match that predicate. And then we can go ahead and write them out. So the key part is sort of, this can be done in a vectorized manner. But this one here we're relying on the, again, the selective store to make sure that we only, whatever tuples actually match our predicate, we get that mask and that tells us which ones we actually wanna write out to our output buffer. So it's sort of like we're always loading, loading things in to do comparison for everyone. And this is just making sure that we only write the small amount of data that we need. And then this part's a little bit complicated. But basically if, for every single element in our mask vector, that is a one, we increase our output buffer by that amount. So if nobody matches the VM zeros and this returns zero, so we don't move forward. And then for every single one that we have, then we get that. I think that's called the, there's a simple instruction, I forget what it's called though. It's like a cardinality, but it's different. Okay. All right, so going back to, let's actually see visually what's going on here. So this is that predicate. So this is the query one X here, but now we're gonna fill in values O and U. So let's say our data looks like this. We have a simple table with the keys, J, U, Y, S, U, X. So again, the first step is we're gonna load that into our key vector. That's what the SIMD load is. Then now we're gonna do the SIMD compare here. And again, we get that now our output mask, which is gonna be one, one of it matches zero if it doesn't. And then now we can use this. We can use a pre-computed SIMD register or vector that says, here's all the offsets that these lanes correspond to. And then we can use that to do our SIMD stored then just to write out the keys that matched. Right, is this clear? So like I said, when you do, when you Google like selective stores, like the load, it's only papers that, it's only slides or discussions about the Columbia paper itself. And then I know people are copying my slides because people also copy Joy sucks in here, right? But not knowing what it is. Anyway, Joy is my first PhD student. He doesn't suck, he's awesome. All right, so look at some results here. So again, this is an older version of the Xeon Phi. That means that it's gonna do in order execution and therefore like it's not gonna have any benefit of the branchless scan because it's not gonna do that specular execution to have any branch prediction problems, right? So when you look at the branching version, right? This one is actually gonna run faster because it has way more cores. Actually what's surprising is that it's, this is not a typo, right? Every year I'm like 61 cores, that doesn't sound right, right? It really is 61 cores. I don't know why it's an odd number. Like you think it'd be 62 or 60, but no, it's 61. So the Xeon Phi is gonna perform better because it has way more cores than the Xeon. But when you look at the branchless version, it does worse than the branching version because again it's doing, it's always copying things even though it's actually not going to need it. Whereas in the Xeon which has out of order execution and speculative execution of instructions, the branchless version does better up until the point where the selectivity crosses this threshold and therefore they're basically doing the same amount of work. They're both performing the same. Like you hear you're hitting, you're hitting your memory bandwidth here, right? So again, I love this graph because it shows you clearly the difference between branchless versus branching in order versus out of order CPU pipeline. So let's look at the Vexrise execution. So they're gonna have two implementations. They're gonna have one that does early materialization and one that does late materialization. So late materialization means that you don't actually copy the tuple that satisfied the predicate in the output buffer of the skin operator, right? You're gonna materialize it some later part in the query plan. Of course now they said this is not real database system so they're never actually even doing that late materialization because they're not executing anything beyond this filter here. Whereas in early materialization is that after the filter matches then they have to copy the tuple. All right, so for this the Vexrise one with late materialization works the best here. Again, so the Vexrise one is branchless but that's okay because you can vectorize everything. And the difference here between the early materialization and the late materialization is that, actually I don't know why this is actually better. Because if your selectivity is zero then nothing matches so you're not really copying anything. I don't know why that's the case. But again, as more things match then you have to copy and that's why they both converge. In this case here there's no difference because the copying is not the most expensive part. So again looking at this roughly remember I said before that you're never actually gonna achieve exactly the speed up that the vectorization can give you. So in this case here they're doing branchless. So this is doing maybe 2.25 or 2.3 billion tuples per second which is a lot. And if you say we're gonna have forex then you would imagine this thing would be 8x but if not it's 6x. So even though again this is not a full database system it's not doing much other stuff that's interfering with this it's just doing the scan you're not getting the true forex speed up because there's much extra stuff you have to do to prepare things to put into registers and get the data out. Right? And again yes 61 quarters is weird. Okay so any questions about the scans? So for obviously more complex scans that may be difficult to vectorize but you can basically break the predicates up and do them in piece by piece. And the goal is if you have more complex predicates or more complex where clauses you try to keep the data in the SIMB registers for as long as possible. And only when you're done evaluating to decide whether it should actually whether matches or not then you bring it back into memory but back into your CPU caches. Okay? All right so let's talk about how to do hash tables now. So for the scalar non parallel version non vectorized version it's pretty straightforward right? We have an input key we're gonna hash it the hash gives us some offset in our hash table and assume we're doing linear probing right? We just jump to that location and then do our comparison between whatever our key is and whatever key's in there and if not it doesn't match we just keep scanning down until we find the match that we're looking for right? Again we can do this in parallel we can have multiple threads do this probing in parallel but now we're talking about within a single core how do we make that core itself run in parallel with vectorization. So there's two approaches to do this right? The first one is to do horizontal vectorization and the idea here is that we're going to expand out now within each slot in our array or in our hash table each bucket now is gonna contain four elements four data items. So we're gonna have a vector of four keys and then we'll have a vector of four payloads of the actual tuple itself. So for given single input key right? We get now our hash index and then we're gonna jump into this location here and we're gonna get back four keys at a time that we can then apply our SIMD compare on. We'll get back a mask that'll tell us which of these keys actually match. If they're all zero then we need to go down to the next one and do another comparison. If we find at least one match then we know we found the key that we're looking for and we can do some arithmetic or some bit shifting to figure out at what lane our one is actually in and that's gonna tell us which one we actually want. So this one actually I think is pretty decent. What are some downsides to this? No, I mean this works, this is fine. Blocking if you're doing concurrent inserts. This question is would this require some blocking if you're doing concurrent inserts? No, because you do a compare and swap on the key slots. If you get it then you're in. Yeah, this seems reasonable. I forget why this didn't work as well when we actually run it. The other approach is to do this in the vertical one. So actually going back to this, I think the, as I said, the only new CPUs support horizontal vectorization but that's okay because the new CPUs have it and we're not living in the past. I'll have to check with my student why, whether he actually implemented it this way and what happened, okay. I'll try to find that out by Monday. All right, let's see how we do this using vertical vectorization. So for this one, instead of hashing, looking at a sort of single key at a time in our probe, we're gonna take a vector keys and do our lookup in parallel in our hash table. So we have four keys here. Then we're gonna hash all of them. So this is not gonna be in parallel, right? This has to be done sequentially. There's no, as far as I know, there's no SIMD hash function we're gonna wanna use like murmur hash is more complicated than what we've done in a single instruction. So this is done in a scalar form but then we fill out our hash vector and now that's gonna jump to a bunch of different locations. And so we're gonna go do a SIMD gather to pull in the keys that we matched on into a SIMD vector and then now we can do a SIMD compare across lanes to see whether we have matches and that gives us our output vector, right? Our mask tells which ones actually you're gonna match. So in this case here, for key one, two, three, four, we only had a match for key one and key four. So again, we have a one here and a one here and two zeros in the middle. So this goes back to that point I was trying to make before about always utilizing our lanes. So what could we do here? So say we know that key two, key three didn't match. So that means that we need to now scan down, bring in the next keys and then do our comparison again. But we know the first one and the last one already match. So if we don't replace them, then we're essentially gonna be doing wasted work. So we're only gonna get 50% lane utilization because these guys already, we already know if the answer is correct. We have what we want. We're looking for the middle two ones. So unless we replace them with some of the new data, then they're gonna end up being just wasted work. So what they're gonna argue that you really wanna do is then go back and get two new keys to then fill in into our vector. Oh, sorry. And to replace two and three, I'm sorry, replace one and four with now five and six. And then for this one here, H2 and H3, again, we're just incrementing our offset by one because we're doing linear probing. So there's actual bookkeeping they're gonna do. When this mass vector comes out, they're gonna recognize, oh, these guys already matched. Let me go back now and move my cursor down for the keys I'm scanning to get two new keys to then I can fill in to my Simi register and do my comparison. You're making a face as if you're confused or it seems a bad idea. It's simply bad idea. Yeah, so yeah, it's a bad idea. Well, it's just a lot more work, right? I mean, actually, when you think about it though, let's say that if you don't do this and say three doesn't match and it never matches, that means I'm still gonna rip through and I'm gonna be invoking these instructions and not getting anything useful out of them. I mean, you could say you have to do this in the scalar case as well, but. This is also on the in-order keys. Say it again? I said this was also on the in-order CPUs. This question is this also in the in-order CPUs? We're gonna do both. That's the next slide. So his suggestion was if I'm 50% utilized, then I go back, but if I'm 70% utilized, just keep going. You can do that. I don't think they did though, yeah. That just sounds like more instructions, but that's not hard too, yeah. All right, so what's another potential problem with this? Other than it seems like more work. It's a bit more nuanced, so it's okay if you don't get it. The issue is that, and this is mostly a matter for us from a software engineering standpoint, is that our query results are not gonna be stable, meaning if we execute our join today and then execute the same join immediately afterwards, we may get a different ordering for what we put in our registers and therefore we may get a different output. Now, in the relational model, all the results are unordered anyway, so this doesn't matter, who cares? But in terms of actually debugging this, trying to reproduce funky results, then this can be problematic, right? I sort of buy that argument, but again, from an end user standpoint, it doesn't matter. If from an implementation standpoint, it can be tricky. All right, so let's look at results. So again, the same thing, the Xeon 5 versus the regular Xeon. So they're gonna do the scalar version and then the vectorized horizontal vectorized vertical one. So again, the surprise, the Xeon 5, even though each core is less powerful than the regular Xeon, it has way more of them so therefore it can get better performance. But in the, right, and then the numbers look like this. The difference though is in the vectorized case, the vectorized horizontal one does actually really, you know, does not do much better than the scalar version here. I forget why the paper said that that's the case, but the vectorized vertical one, even though it seems like it's doing a lot of actual work, you're getting full lane utilization so it's able to rip through it much more quickly, right? In the case of the Xeon one, again, the vertical one is better up to this point here, whereas this one is almost always better. Again, I forget why this is the case as well. I think at this point though, as the hash table gets bigger, then the more probes you have to do to find, the more probes you're doing per key could increase and therefore it's more bookkeeping you have to do to go back and get new stuff, right? And then here at this point here, the reason why they all converge is that but we're out of the CPU cache and then this way everything falls apart, nothing works, right? Because the dominant cost in executing these things is going to memory, not, you know, how fast am I sending these stuff? And this is actually, it illustrates exactly what I was saying, like once everything's out of the cache, who cares, right? That actually the horizontal one does worse than the scalar version. So this is why nobody actually does this. All right, so let's finish up quickly. The histogram one. So this is a good example, another two, where again you can show how you can use these primitives we built up to do more complicated things and we can actually do this to do histogram lookups and histogram updates. So we're going to be able to use Scatter and Gather to increment our accounts in parallel but the tricky thing is going to be if we have collisions, the way they're going to handle that is your application. So this is our input vector. These are the keys that we have and we want to be able to update our histogram and say we have an instance of this key. So they'll do a SIMD Radex operation to extract out some element of this and then we can use that as our hash index and then we do an update with an add into this vectorized histogram here. But the problem is that both key H2 and key four, they both want to update the same slot. Remember I said that like this won't work because we're doing this a single addition operation to take all of this and update this thing. So that means that it's going to overwrite the updates. We're not going to be able to update the same thing twice. So in this case we're missing an update because we're both running to the same location. So the way they handle that looks a lot like a bloom filter in some ways was that now they're going to replicate the histogram and so every single column here corresponds to one key or one lane that we want to update. So this guy writes in this column, this guy writes in the second column, third column and the fourth column like that. So then now, and this corresponds to the number of vector lanes, so now we just use a SIMD add to compress all this and then put us into the form that we want. So I like this because it's like a game. It's like how do you take these primitives and take your existing relational database operators and figure out how to use the SIMD primitives to do the things you want to do. So again, I really like this paper. To finish up with joins real quickly, we've already talked about how to do partitioning before but now they're going to show us how to do this with SIMD. We're not going to go into the details of this, I just want to show you the results and the bottom line is that they're just showing you that if you're doing the vectorized one, this approach over here, you get amazing performance. But again, everything fits in your CPU cache, so it's not really real. Okay, so to finish up quickly, the vectorization stuff is interesting. We need this for OLAP. Everyone will do, as far as I know, the systems that do vectorization, they only do the selection scans. They only do the vectorized predicates. Nobody, as far as I know, does anything else. Maybe some one-off things, like the histograms, for example, potentially, but everyone we talk to doesn't actually do joins and parallel or anything else. So the other thing, again, we've already said this before, but you can take all the things that we talked about so far in the class. These things aren't mutually exclusive, so we can take the compilation stuff, we can take the parallel scans and the parallel joins. We can use all that plus new vectorization to get even better. So I would say, again, if you're gonna build a modern system, you'd wanna do compilation, you wanna do vectorized predicates, and you wanna do parallel operators across multiple cores and being aware of the NUMA location, or the location in NUMA regions for the data you're operating on, so make sure you schedule things in the right way. So I would say that I can't point to one thing, say this is better than this, go to this first, all these things you sort of want. But I would argue that I think compilation will have a bigger bang for the buck than vectorization because it'll optimize all possible queries where this can just operate the predicates, but doing compiled, a cogeneration engine is expensive. It's from the software engineering standpoint. Okay? So next class, we're actually gonna read a newspaper that I helped out with last year. So this is actually now comparing the benefits you get from compilation versus vectorization. And in particular, look at sort of the hyper approach versus the vector-wise approach, of do you have these pre-compiled primitives or do you try to vectorize and compile things manually using the JIT stuff in LLVM? Okay? And then we'll also talk a little bit about the technique that Prashant developed called relaxed operator fusion that allows you to do vectorization and compilation for things that exceed the CPU cache. Okay? Yes? Why can't the pre-compiled primitives have vectorization in them? This question is, why can't the pre-compiled primitives have vectorization in them? You can. There's no reason you can't. So why do you think they're exclusive? Sorry, they're not music exclusive. Oh. Yeah, sorry, did I say that? Oh, no, I just thought it was like, it's more like the, how much benefit do you get one versus the other? Yeah. Okay?