 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. All right, so today we're going to talk about vectorized query execution. Again, this has been the thing we've been leading up to the entire semester, like we've been saying this is the one of the ways that a modern OLAP system is going to get good query performance. We'll see why and why it doesn't always do this. So last class, we talked about how to take a query plan, divide it up into pipelines and run them in parallel. So this method is called task parallelization, so how to take a query plan, break up the task, and run those in parallel. We haven't said how to schedule them and where to schedule them. That'll be next week. But at a high level, we understand we could run things in parallel with the coalesced things with exchange operators. And then we also discuss how the Navy system would actually evaluate any kind of expressions in a wear clause or a join clause. And we saw this as being a sort of preview to query compilation stuff that we'll talk about on, or just in time compilation or code generation stuff we'll talk about on Wednesday this week. And then we also introduced the idea of query adaptivity. We're not going to push in this too much just yet, but it's the idea that the optimizer spits out a query plan, and then at runtime while the Navy system is executing that query, it can decide whether that query plan was a good idea or not, and can make some changes either to change the ordering that it checks predicates, what code path it would use to do certain things, but then we'll see later in the semester how to do bigger things like change the actual query plan on the fly while we're running. So today's class is going to be about vectorization, and the idea here is that we want to take the scalar algorithms that we discussed in the introduction class where we're going to operate on a single tuple at a time, and in some of the cases even a single operand at a time, and we're going to convert them into a vectorized form and rely on SIMD instructions that the CPUs can provide for us to be able to run multiple operations within an operator or expression, whatever we're trying to do, at the same time, and so this is going to be a notion of data parallelization, that we're going to have multiple computations occurring at the same time for multiple pieces of data, and then SIMD is going to be the way that we're going to achieve this. So why does this matter? Well, again, the same way that scaling out across multiple threads, processes, or nodes, it's going to give us additional improvement of performance because we're not being restricted to what a single thread on a single core can do. In some cases, we can get even bigger speedup because of SIMD, because that also can run parallel across multiple cores, and then the speedup we'll get will be multiplicative. So every core we can run have a data parallel algorithm, and across all those cores they're all running at the same time. So let's say that I'm on a machine that has 32 cores, assuming I can scale out perfectly linearly, then I can divide my task up into 32 discrete tasks, so that's a 32x speedup, and then if I can have a portion of that computation, ignoring how we get tuples in and out for now, that can run on using SIMD, and it can do processes for tuples at a time, so then it's 32x times 4x. So in theory, for this scenario here, we could get up to 128x improvement in performance, and that's just on a single node, and that's pretty significant. Anything that's at least an order of magnitude is a huge win. Two orders of magnitude is to be unheard of. Now we're never going to come close to this because as I was saying, there's a bunch of stuff we have to do to get things in and out of the registers, in and out between operators, copying things from disks, sending things over the network. We're never even going to come close. And in some best case scenario, when we look at some vectorized algorithms, you might get 1.4x speedup if we're lucky. But it doesn't mean we shouldn't be doing this. So we covered this, I think early in the semester, I did a quick preview of what SIMD actually is, so I don't want to spend too much time in it. Again, the idea goes back to this notion of this classification of what these instructions are actually going to be, goes back to the 1960s. There was this thing called Flynn's taxonomy where he described what SISD instructions are, SIMD, and I think MIMD as well. And I think at the time, they were all theoretical, you could have these things in the 60s, but obviously now in the 2020s, these things have been around for quite a while, so we can exploit them and use them inside of our database systems. So SIMD is going to be a class of CPU instructions that can allow our processor to do the same operation on multiple pieces of data at the same time. And the way this is going to work is that we're going to rely on these special SIMD registers as a way to get things into these instructions and out of these instructions. And the overall goal as we go through is that we want to keep things out in the SIMD registers for as long as possible, do as much processing as we can, and in the paper you guys read talked about because AVX512, we can achieve this now better than we used to, you want to keep things out in the SIMD registers as long as possible and only bring it back to the CPU cache or memory when we're done with whatever it is we're going to do. So we're going to focus most of this lecture on AVX512, but this is showing here that there's, every other ISA has their own variants of them. And in the case of Intel, it goes back to the 1990s when they first put up these MMX stuff. Yes? A part of arm and wrist with PowerPC, is that what he's still a thing? The question is, is PowerPC still a thing? I mean, what do you mean still a thing? Like, does it exist? Yes, are people paying a lot of money? What's the market share of PowerPC for databases? I mean, pretty small. But there's enough legacy software that's running on some really old systems that need to run on PowerPC. I mean, IMS is still like the number one. I think I remember, I don't know if this is still true. I saw some reports saying IBM makes most of its money on from IMS, more than any other piece of software. And they invented that for the Apollo Moon mission in the 60s. Because there's all these banks that are still running on this stuff. Again, it's mission critical. You don't want to mess around like, yeah, let me just switch to something else because there's a major engineering effort. And if it fails, then your business is screwed. PowerPC has some other advantages over x86 for a variety of things. But I mean, yeah, if you were like a brand new starter today, would you use PowerPC now? I mean, you can't get it from any of the cloud vendors. So, okay, right, yes. I mean, so again, this is just saying that these, there's other sort of categories, I mean, not really categories, or releases of SIMD instructions for different platforms to ISAs, not just the AVX stuff. But again, we're going to focus on this because this is, when Intel put this out, they added some additional things that make it better for database systems in a way that we didn't have before. We have this sort of emulated stuff ourselves. All right, so this is the example that I showed before. We wanted to do a simple operation, take two matrices, x plus y, add them together, produce a new matrix, z. So again, if you want to write this in scalar code or using system instructions, you just have a for loop that iterates over every element of x and i, and then write out the z. So you're literally scaling or going through the each element of the two arrays one by one, running one instruction, add them together, and then one store instruction to put it out into the output buffer z, right? And the compiler can be smart about this, it can unroll it to speed things up. But for now, at the end of the day, it's still going to have to execute a single instruction to add two numbers together and to write it out to another register, or another, write it out to memory. So the same day what we can do is we can take a vector of values. And assuming here we're doing 32 bit numbers, four elements, so it's a 120 bit register, and AVX 512 is gonna be 512 bit registers. So we can put more things in there. So now it's gonna be one simple instruction to add up the offset, the matching offsets across the two registers, and produce a single output. And then do the same thing for the other one, add it together, and produce the output. So what took before eight instructions to do eight addition instructions, now we can do it down to two, right? This is why this is gonna be important obviously for databases. We're trying to get through columns and columns of billions of tuples. We wanna be able to take advantage of this. So there's two type of vectorization we can have in our system, in our data system. The first is what I just showed, the first would be what is called horizontal vectorization. But the idea is that you're gonna have some instruction that's gonna take all the elements within a CB register and then produce a single scalar output. Like if I wanna get the summation of all the elements within this four lane register here, there's some instruction that can do that and that produces some scalar output there. Early CPUs don't support this. It's mostly found in the newer CPUs that can do at least on the x86, that can do AVX2, which is the precursor to AVX512. But this is not gonna be that entirely useful for the stuff we wanna do in databases. The one we care about is vertical vectorization. The idea, again, is that we have two registers and they're lined up across lanes. So I assume the values are all fixed length at the same size. And we just do one instruction to do some operation on the combination of the two and then produce a new output. So this is way more common. This is the technique we're mostly gonna be using in our database of going forward. But again, you could do this as well. Actually, yes, I think this top one here. I think this shows, think of like a summation. If I wanna add up all the, sum all the values in a column, you could use horizontal vectorization for that. So this is a table just showing that, yes. It's a question, is it using real system? I think so, yes. Yeah, I think we have an example of click house. I think click house is doing this for summation. Yeah, so this is a table showing the history of the different SIMD extensions that Intel has put out over the years. And again, the one we care about here is the bottom that came out 2017, AVX512. So the registers are gonna be 512 bits. It's gonna support integers, single precision and double precision floating numbers. And then the big one is gonna be that you read in the papers that they're gonna support these permutations or predicate masks that allow us to keep track of, or specify which lanes should an operation actually apply on. And prior to that, this coming on AVX512, this is something that we have to do themselves by basically using a separate register to store a bit mask like that. Whereas now, in the case of the EX512, there's explicit registers to do those things. So this link here will take you to a great presentation by James Vendir. He was an Intel fellow, it's from, I think, 2017 or so. But he gives a good history of all these things and why this matters and what some of the cool things in AVX512. So if you're interested in this kind of stuff, you can go check it out. All right, so as I said, AVX512 is the one that we care about, right? It's not to say that people weren't doing vectorization in databases before this. It just makes everything a lot easier. And so, in addition to having the new instructions and new data conversions and scatter operations, which we'll cover in a second, that permutations is the big one, right? To be able to say, here's some bit mask that says, I want certain operations, the operation I'm gonna apply, to only occur at these different lanes, right? So the downside though is that unlike in AVX2 and SSD 234, in these earlier extensions to x86 or simd, they were all or nothing. Meaning if I said my CPU supported AVX2, I got all the capabilities and instructions that I would expect to have in AVX2. For whatever reason, it's an Intel thing that when AVX512 came out, they broke it up into groups. So now, when you buy a processor, you have to go check the CPU flags to see what instructions you actually support. And we'll see an example, again, from Clickhouse. Well, they'll have if blocks in their code that says, am I compiling to AVX512 with this group or that group versus that group? Because they're gonna have different instructions and different capabilities. So to give you an idea of sort of how confusing it is, like this is from Wikipedia, this is showing that all the different groups you could have for AVX512, and then which iterations of the ISA, going back to the Xeon Phi, actually supports these. As you can see, not everyone has everything. There's another chart here from, I think, one of the papers that's showing you how these things have been sort of out of their time. But then now within, well, here's a little something to AVX512. But there's newer versions that don't have things of the early versions. So even though you say you support AVX512, the system has to go check what actually it has. Again, we'll look at a Clickhouse, I'll give you a second. They have if clauses in their source code that figures out what CPU capabilities are available. So again, the other issue is AVX512 in a second, but I don't want to spoil it just yet. So even though I'm gonna spend most of the time to say, hey, great, you can do this, AVX512, the back of your mind realize, like, you may not always be able to do this. In some cases, you actually may run slower if you use AVX512. Well, now I'll explain why in a second. All right, so how do we actually want to get, how do you want to actually use this? Right, there's basic three approaches. Do I want the compiler to figure out what can vectorize? Do I want to get hints to the compiler to say how to vectorize things? Or do I want to do the vectorization myself? And so the way to think about these three approaches is that the top one is really easiest to use because you don't have to think about it in some ways. Sometimes you do, sometimes you don't, right? And you just hope the compiler can figure out how to compile things and vectorize your algorithm. And if you design your database system in such a way that you break things up into small enough chunks that are just, again, looping over arrays, then the compiler could potentially be able to figure it out. But not always. The compiler hints is just giving a little nudge to the compiler to say, hey, look, you really can vectorize this. I think you should. And hope it tries to figure it out. And then the last one is like you write the actual instructions in your code to actually invoke the exact SIMD instructions you want. So let's go through these one by one. So they said on vectorization, the idea is that the compiler can potentially identify when certain instructions inside of a tight loop could be rewritten as vectorized instructions. And so my example that I showed in the very beginning, that iterating over an array, two arrays and adding them together, that's something obviously that the compiler should be able to figure out. So this is only going to work for simple loops. And in some cases, in database systems, it doesn't always pan out. This has gotten better than the GCC and Clang. And certainly, ICC have gotten a lot better where it can start and figure these things out without hints. But maybe five years ago, this was an issue. And obviously, if you don't have SIMD instructions in your CPU that you're compiling on, the compiler is not going to try to use it. So if you say you compile on your laptop that doesn't have AXI 12, you take that binary, plop it up on your enterprise-grade Xeon server. Even though the Xeon server is going to have 512, it was compiled without it at the time. You could have compiled on your laptop. So you've got to be mindful of where you're actually compiling and running things. So this is our example that we had before, where we're now going to pass in pointers to arrays X, Y, and Z. And we're going to loop over them by some max value and add them together, right? Can we auto vectorize this? She's taking her head, yes. Raise your hand if you think yes. Raise your hand if you think no. Why no? You need to restrict? Well, he says need to restrict. What does that mean? But why? Why? Yes, so he said if the pointer's ever Latin, then there's dependency. So again, think of compile time. Do I know what the pointers X, Y, and Z are pointing to? No, right? That's a runtime thing. So in this case here, the compiler's going to say, hey, X, Y, and Z could actually be pointing to the same thing. So I can't vectorize this, because let's say that Z is just one byte more than the memory address of X. So now if I'm ripping through my code at runtime, in the scalar version for one iteration of X, or one iteration of the loop, I'll overwrite what the next value should be. And so now in the next iteration, I'll get a different computation. But if I vectorize that with SIMD, then when I do the computation of the second iteration, it won't see the effects of the first iteration, so it will actually produce a different result. So the compiler's going to be very, very careful to make sure that if it vectorizes your code, it doesn't produce something that generates a different value or different computation than it would have if it was a scalar code. Yes? So Jamie, can't the compiler do loop unrolling, then auto-vectorize that? But again, you don't know what Z is actually pointing to potentially, right? So it's going to be very conservative. It doesn't want to avoid any kind of problems. So in this case here, it's going to say, I don't know what X, Y, and Z are actually pointing to, so I can't vectorize this. So his, sorry, question. The same is for us, you can't do that. Yes, we'll get that in a second. OK, we're in C plus C LAN, OK? All right, so Patrick said, oh, you could use the restrict keyword, and that's an example of a compiler hint. So it's we as the programmer. It can tell the compiler something about our code to make it more likely to try to auto-vectorize something. And so the restrict keyword is seen in a second. That's an example of giving explicit information about memory locations to say, these things can't overlap. They're not going to change while this loop is running, therefore, you can auto-vectorize it. A more brute force approach, you just tell the compiler, hey, turn off any checks for dependencies or aliasing here, and just vectorize it, trust me, driving without the C plus. So going back to function before, as I said, if you add the restrict keyword, which is in C99, but it's not in the C++ standard, but pretty much every C++ compiler supports it, you add the restrict keyword, and that's telling you that these arrays are going to be distinct locations that were the, for the lifetime of the pointer, they're not going to change, at least within this function. So therefore, it knows that it's safe to actually vectorize this. So this approach is widely used. So if you go looking like in DuckDB, you just search for restrict, and then in C++, it's underscore, underscore, restrict. So you see all these functions are set up to do this kind of stuff, right? And the goal here is that DuckDB wants the compiler to figure out how to auto-vectorize this. So it's passing that hint to it. A point also to here, you can see sort of two versions of doing this check here, right? There's the, is all the, the bit mask I'm getting, is everything, if everything is not valid, then I have to check my bit mask to see whether it's valid. If I know everything is valid, then I can skip that extra check, right? So that's, we saw that sort of technique with checking for nulls with Velox, right? So even though there's a conditional here, it's worth it not to do that additional check on rows. This is a click house. Click house does the same thing up above. So this is to do an aggregate sum computation, which I think would be horizontal vectorization. Again, you see this underscore, underscore, restrict on the pointer. But then they had this other beast in here, which is, they're actually checking again what AVX 512 group the CPU actually has, then it has different implementations to do that computation. Yes. This question is, you should be to if def this. These are all like macros too, yeah. Oh, they're all macros. Yeah, these are all like crazy macros, they get the auto-generate, the code. I don't know about, I think that's probably, that's probably also a pound f as well, if def. It's, and it's probably, I think it's an if def. Anything up above, you have like use multi-target CPU code. No, it could still be dead code. Yeah, but it would be dead code, yes. Oh, that would be true. Yeah. But anyway, the pungeon, this is a good example of like, hey, here's, there's two versions of AVX 512, there's AVX2, there's SSC4, there's the precursor of AVX2, right? So the main thing I came out of here, like it's AVX 512. Oh no, no, not really, you have to check what group you actually have, right? All right, so restrict is probably the most common one. An alternative is use these pragmas, IVDEP, which is basically ignore vector dependencies, or vectorization dependencies. OpenMP, the big parallelization framework library, they have like pragmas, simd, like there's different versions of this. This basically says ignore any of your aliasing checks when you do auto-vectorize this, right? And you would end up with the same thing. And again, this is up to the data programmer to make sure that this is done correctly, because the compiler will likely do whatever you want it to do. The last alternative is do explicit vectorization. And for this one, we're going to have to rely on what are called intrinsics, or CPU intrinsics. And you think of an intrinsic as like a virtual function, but it's like a fake function in your C++ code. It looks like a function, but it has an underscore or a double underscore in front of it. And it really is translating into the exact simd instruction that you want the compiler to emit for that line of code, right? And that's how you call explicitly the simd operation that you want, or put things into registers and what registers you want to touch and so forth. Now, the problem with this is that if you want exact control of your database system, this is what you need to use this. And talking to friends in industry, this is what BigQuery does. This is what Redshift does in some other systems. And in that environment, because they're hosted database systems, they control the hardware. They know what VMs they're running on in the cloud, so they can make that choice. And they're not trying to run a PowerPC, for example. But obviously, if you use like an x86 intrinsic, you can't run on ARM or some other CPU. Now, there are some libraries that can hide some of these simd intrinsics and have ways to step down to the smaller register sizes that's needed, or based on what grouping or extensions you support. Google Highway is probably the most common one. I don't know of any database system that actually uses this. I guess we could just grab the source code, open source ones, and figure it out. LibSimd is another one that's widely used. Again, I'm not sure outside of databases. Rust has its own simd library, but I think it's only turned on for experimental nightly. I've never used it. The one student that was here before, Chi, he says he lets auto vectorization handle everything. And as you said, because the compiler is in better shape to understand whether things will collide, because there's more explicit control over memory locations. So if you were going to use intrinsics with that one of these libraries, it would essentially look like this. Then you have these underscore, underscore, and then some prefix of what sort of group of simd extensions you're using. Then you say what size of the register you want. And i means you're storing integer. So all we're doing here is casting the integer vectors we were given, putting them into the simd registers, and then now we can do our loop and do simd addition and then store it in the output vector we want. And now you can see here, our loop, we're doing four additions at the same time. So we need to divide the number of iterations we would have divided by four. So this is roughly what it looks like. So which one do you think is the best? Explicit? I mean, most control not the best. Most control? Well, I can maybe phrase that. What is the best for performance? Explicit. What's the easiest to write? It's not at all. Yes, as I said before. So let's see what the performance difference you get from explicitly writing vectorized code. So this is the paper we did with the Germans a few years ago where we compared against the vectorized approach for doing pre-processing and then the hyper approach, which we'll cover next class. And the student in Germany wrote one system that supported both of these techniques, and we did a bake-off between the two of them. The idea is to strip out all the extra stuff that differentiates between vector-wise or hyper or other system, get it down to a common substrate to the extent that you can for these two approaches. And then that way you have a pure apples-to-apples comparison between the different approaches. Because there's other things that would come up like the way hyper does, numerics would be different than vector-wise. And for some queries in TPCH, that would make actually a big difference. So it was a single testbed system that did both vector-wise and hyper, which we'll cover next class. And we just wanted to measure how well a compiler can auto-vectorize a bunch of the vector-wise primitives. So again, think of a primitive being a single function that takes an array or a vector of tubers of a certain type and runs like there's something less than something, there's something greater than something. How well could it vectorize those sort of small loops of code? So much of what I showed before. And so we compared against, we used Clang, GCC, and ICC, which is Intel's compiler. And ICC, it's not free, it's not open source, but this is obviously way better at auto-vectorizing at least a few years ago than GCC and Clang was. Again, GCC and Clang have gotten better, but at the time, ICC was much, much, much better. And again, you pay for that because Intel controls the hardware. They obviously can write really good compilers for it to do it as well. So we're going to basically do a comparison between hashing, selection, and a projection. And there's some other operations that you have to run for the full query we didn't vectorize because you can't. All right, so this is running across some select number queries of TPC-H. And the first bar here is complete auto-vectorization. Let the compiler do everything. The black bar would be just if you do it by hand. And then the red bar would be the combination of let the compiler auto-vectorize everything. Then we would go check and see which functions didn't get auto-vectorized. And we go back and do that, and mainly here. And what we're measuring here is the reduction of the number. Instructions versus a scalar approach when you don't do any vectorization. You don't let the compiler do any vectorization, right? The main takeaway here is that in this case, higher is better. But there are some cases where the manual one, for whatever reason, because it was so complicated to actually write, we always didn't get adhesion improvement in the reduction of number instructions. But the combination of letting what the compiler does and then going back as a human and cleaning things up was actually the best approach for all of these, right? That's not actually feasible, right? Like, you don't know the queries before, and then you develop it. Unless it's specifically optimizing the key to creation. So the statement is, with this not working in a real system, because you don't know the queries ahead of time, again, we're trying to vectorize the primitives, right? And they're not specific to any one query. Like, take a column of integers, check to see whether the number is less than a single value. That's what we were auto-vectorizing. It wasn't hard-coded exactly for Q1, Q6, and so forth, right? So technically, it was still a general-purpose system. We're just trying to auto-vectorize the actual low-level operations of primitives within them. Yes? Q6 was the worst. Yeah, I forget why that was the case. Again, it's the, I think it would be the paper. I forget why that was the case. Yes? Why is Q6? Why is Q6 before Q3? I have to go look at, I read the paper a while ago, I have to go check. It doesn't matter. Yeah, it might be a typo. Maybe this is really Q3 and this Q6, but it doesn't matter. So the key takeaway is that you should do automatic and manual? Yeah, the main takeaway of this is that you should do both. Yes? So, you know, was it more you took away some of the manual instructions, or you added, like, change how it was? This question is, what is auto plus manual? So you auto-vectorize everything. Then you actually look at what was actually generated in the assembly, figure out what functions and primitives were not auto-vectorized. Then you go back and rewrite them, the actual C++ code, to put in the intrinsics. I mean, they're Germans, so like, I'm assuming it's done right. And maybe, again, I have to go look at what exactly this query was. I think the idea was that the, yeah, of course, like, in theory you could have also looked to see what this thing vectorized to, right, and then write the quid and intrinsics for that. But I don't think he did that. I think the idea was like, okay, if you bring in a German who, like, who knows what they're doing, how well can they do implementing themselves? It might even be that way. Again, I don't want to, we can go look at the results again. We can go into more detail in the next class. The reason why I didn't have you guys read this paper is because it's compilation plus vectorization at the same time in one single evaluation. So I cherry-picked this result out just because they're just focused on vectorization. So I wanted to cover compilation first, and then we can talk a little bit as well. So I can follow up and figure out what's actually going on here. Everything's open source online as well, so we could check it out what happened. All right, so, but now we can check to see what the performance difference actually is. And so in this case here, we're measuring in what's the reduction in time of the system running these queries between the different implementations. So it's all relative to, again, the scalar function, the scalar implementation. So if you're above zero, it's faster. If you're below zero, it's worse. And so you can see in some cases here, especially for Q6, even though that one code he wrote by hand had more instructions, it was actually faster than the one that was the combination of auto-vectorization and manual. Or in the case of here, in case of Q3, going back, right, they produced a number of instructions, but it was done in such a way that it was actually slower. Manus is the best, though? Manus is always the best, yes. But the point I'm trying to make is to write that is hard. So if you have the capability to do it, like if you have a German in-house or you can just spend the time doing it, the Revlon's probably what you want. I forget what percentage of it he actually had to go touch up, right. But if you spend the time and effort, you can get... Because it's almost equivalent to writing an assembly. And that'll be any compiler. Manus is the last one, explicit vectorization calling intrinsics. This is compiler hints. And this is compiler hints. And then what doesn't get vectorized, you go and put in intrinsics. Yes? We'll cover this in a second, yes. It gets a hint. Does anyone know why? No. We'll get to the end. There's a footnote in the paper you guys have read that explains it. We'll get to the end. Make sure I never send anything else. Is it the same reason why most... Is it the same reason why most compilers do dashboard 3 instead of O2? Is it O2? No. Bingo, that's it. So he said it's because the newer versions downclock or downcycle the CPU. So when you call AVX 512, they turn down the clock speed, right. And some compilers will actually not auto-vectorize AVX 512 but always choose AVX 2. Because of this exact reason. Because of heating issues, yes. It's heating issues, am I understanding? Yes. But in the early versions of SIMD, like in the 90s and the next stuff, there was literally like... You would call scale instructions, but when you called SIMD instructions, it would stop all the SISD instructions, switch over to SIMD mode, run that, then switch back. Now with super scaler architecture, we can run these things in parallel. But as I said, I don't know whether it's all AVX 512 instructions, but at least enough of them, it'll get downclocked. I don't know whether these are... I think X86 or the current crop of Intel CPUs all have this issue. And so Intel, we'll cover this at the end. Intel actually turns off AVX 512. They fuse it off on consumer grade CPUs. Because they don't want people to get downcycled and think the CPU is running slower than it should. Yes. Do other CPUs, like one other CPU than the X86 is ever relevant in this case, also have clock speed? These questions, does AMD also do this? I don't think AMD has AVX 512. They do? The new ones. The new ones? Okay. I don't know whether they're downclocked. Heat. Yeah. SIMD is doing a lot of stuff, you have a lot of stuff. He uses the scientific term. Is it because they're doing a lot of stuff? Yes. Yes. Yes. This is not a class about Intel's design decisions. I don't know the answer. I'm only telling you what you can read. Sorry, yes. I had to double check. They might be using AVX 2. I don't know. We can cover this in the class. Okay. Again, this is not like let's bash on Intel. But again, this is what I said in the beginning. Just because it's there doesn't mean it's always going to work. And in some cases AVX 2 is going to be better because they won't have that downcycling issue. All right. So now, let's go through the primitives that we're going to use as building blocks that allow us to construct and put together so we're doing more complex functionality to actually start running queries. And this would be a combination of what was in the paper you guys read and then some earlier paper that I'll cover a little bit as well. And these are the basic primitives that SIMD is going to provide for us that we can then put together to start doing the larger database operations or algorithms that we need. So the big one I said is that AVX 512 added and ignoring the downclocking issue is that they have now, all the instructions have these predication variants where you can pass in a bit mask that says which lanes you want the operation to be applied at. Again, prior to AVX 512, you could do this, but you would have to use one of the SIMD registers that are available to you to actually then apply it. So now those specialized ones are just for the bit masks. The number of registers I think in the latest version is like 32. So we're not talking thousands and thousands of registers. AVX 512 going to 32 is a lot. I think it used to be low 20s. So there's more available to us, but it's still not infinite. So the idea is that say I have two vectors here I want to do some operational one and you think of these again, the offsets have sort of line up across the lanes. And so say I have this bit mask here, set to 1, so that's going to say whatever my output to be for whatever my instructions would be, only apply it for the lanes where this thing is actually set to 1. So say I'm just doing addition, then the output would just be 3 plus 2 and 3 plus 2 to produce the output 5 here. And then for the ones where it's 0, you pass in this merge source register and then that's just being used to fill in where the zeros are to put a value there. So you can put any value in. There's also the variant of zero masking where you don't have to pass this explicit register. It just puts zeros where everything is. So that's the basic idea. So with this bit mask, which again, we say we can generate because in some cases in our algorithms when we apply filters, the ones and zeros corresponding to what tuples or what offset actually satisfy the predicate. So that's sort of the basic construct we can carry along in our operations to determine whether a tuple is even valid or not. So the first thing we want to do is permute. And the idea here is that we want to copy values from an input vector specified as some offset to some other destination vector. And again, in the prior to ADX512, the way you had to do this is take things out of the vectors, put it into memory and then put it back in the vectors into the right order. But now again, with ADX512, we can do all of this within register directly into register. And that's way faster. And we don't pollute the CPU caches or slow things down. So the idea here is that here's our input vector. Here's the index vector that's going to correspond where we're going to write things to. So in this case here, for this first value, sorry, first index value is going to this position here. We want the value within the input vector at offset 3, which is d. So that gets written here. And so it's done this all down the line. I don't know why the error is in line up. But it does this all down the line and then populates that. And that's, again, that's all done in a single instruction, even though I'm showing it in different steps in PowerPoint. The next one we have is a selective load. The idea here is we want to take some contents we have in memory and we want to be able to write them out to some input vector. Yeah, wow, I don't know why these aren't lining up. That's weird, whatever. So again, we have our mask. And so what's going to happen is in this first position here, it's just going to skip. So it doesn't overwrite whatever is in the vector right now. So then it's all where the ones are. And it's going to grab one of the first location that it has because you give it this offset, this starting location of the input memory buffer or address. And so every time it sees that one, it's going to increment over by one and then write that value up. So in this case here, we're going to write u to the second slot. We're going to skip this one here, leave that alone, and then go to the next one and we'll write v to that slot. Again, all happens within a single instruction. Select the store is going the opposite direction and reverse. The top is our target, so we want to write out into memory. So same thing, going across, we skip the zero. The one gets written to the first position and then we skip that zero and that one is written to the second position. And then we're done. So this is how we're going to get things in the registers and then out of the registers. But again, more than just like blind copies, we can be clever about how we write things. So then we can use compressed to move things across the different vectors in different ways. So in this case here, we have our target vector is the value vector at the top, we have input vector and then this index vector. So the idea here is that for the first, if ever there's a one, we're going to write out something up there. So same thing here, we write the d up to that first position and then everything else is just left as zeros. So we're basically compressing down whatever is in our input vector to fill in the things beginning to the end until we run out of space or we have no more items to put into it. Expand is the reverse. So we have the one here. So the first value within our input vector will get written to that position. Same thing with the next one over there and then the rest of it are all just zeros. So that's taking what was compressed on this side potentially by this operation and expanding it out back to its original form. So again, they're just reverses of each other. So then we can do a select of a scatter and gather and the idea here is like how do we actually get specific things we want out of our memory into the registers or registers back into memory. So in this case here, I want to take whatever is in this offset to satisfy the index vector, jump to my offset in memory and then write that out to the first position. So two would be this position here and that's written to first position and then so forth for all the other ones there. So now basically you're changing the order of how things are written out to memory but lining them up the way you want them inside the vector. Again, the select of gather is the reverse. Again, we're taking a value vector and then specifying what memory location we want to write things into. So again, so in this case here, the index vector wants to write to two. We take the first position at this lane as A and it writes to memory position two. So I don't know whether they require the memory location you're writing to to fit within a single cache line. There's alignment issues. I think the harbor takes care of all that for you because this index vector, this can't be a million elements. You're not going to be writing out to all different locations of memory. This thing roughly has to fit into a single cache line because L1 you can do I think one or two loads and stores per cycle. So you obviously don't want to spend a lot of cycles just filling out, taking things out of the vector and putting them in. So again, these are the basic constructs. I'm going through them quickly just to say, okay, there's ways to pass in these bit mass or these index vectors to specify where you want things to go to and where you want things to come from when you move things in and out of the vectors into memory. So that's how we actually want to put this into the data system. So I'm going to go through some basic operations that we can use, we can build using Symbian vectorization and in most cases we're almost always going to want to favor vertical vectorization. We're going to have different tuples within the different lanes of our Symbian register so that we can process them in parallel. So again, horizontal vectorization would be either like I'm trying to sum up all the values within a vector or say I'm trying to do a string comparison of our long string and that's breaking up into across different lanes. We're going to ignore all of that. And our goal here is that we want to maximize lane utilization meaning we don't want to have our computations that we're doing in our Symbian instructions to operate on things we know have been evicted or have been removed before. Like if something does not evaluate to true, it doesn't make sense to do a bunch of more expensive computations for it. We want to ideally be able to fill it in with something else that's useful. And the paper you read talked about that and we'll see some other ways to do it as well. All right, so we'll first talk about the basic selection scan. Then we'll talk about how to do vector refill and then I'll talk about two variants of doing hash tables or joins and then this is not the paper you guys read but for partitioning histograms. This one is like really simple idea that I think is pretty clever. And again, it comes to this paper in Columbia. So this paper here, this is from 2015. This is from some researchers at Columbia. I used to have the students, you guys read this but I don't have it, you don't read it anymore. I read the German one because in this one they make a bunch of assumptions that aren't real because it was 2015, it was for ADX 512 so they assumed all your values were 32 bits and that your pointers are always 32 bits but obviously in the real workload, the real data says that's not always true and then they also assumed that everything's going to fit in L3 cache which obviously does not always pay not to be true. All right, so let's go back to how to do a basic selection scan operation. So this is the code that I showed before how to do a branchless scan where we're always going to copy our output into any tuple that we're given to the output buffer but then we run this check here and what if this evaluates to 0, 1 after we end them together, that determines whether we move our offset up by 1. So there's no if clauses in SIMD so we can't run the if and else version of this code. We basically always have to run this one. So the way to vectorize it is pretty easy, right? Because now instead of getting a single tuple, now I'm getting a vector tuples. I load the key I want to evaluate on into some SIMD vector, or SIMD register, not specifying what size, it doesn't matter. Then I can run bitwise operations to do the comparison, sorry, I can run the comparison operations on the key that then produce bitmasks that I can then end together and that's going to determine whether a tuple has been satisfied this predicate or not. Again, I'm not showing you the code to make sure we remove things when we come back around the second iteration. We can ignore that for now, right? So, again, this is me walking through what I just said, skip all this. So instead of using, again, placeholders like low and high, let's actually use real values and some real data here, right? So, again, think of this as that there's eight tuples here and then the key is some single character, right? The dictionary code, it doesn't matter, right? So, it's not a string going across each element, each tuple has a single string, single character string value. So, to do this in SIMD is that you would first do that SIMD compare, right? And that's the first step here, is the value within a given key greater than or equal to a low value. And that's a single SIMD compare instruction that then can produce a bitmask. Then I got to run the second half of the comparison, produce another bitmask where the key is less than or greater to but less than or equal to the high value, the letter U. And that produces another bitmask. So, I have two now bitmasks sitting in CP registers and I can then run a SIMD in operation and instruction to just compare those two bitmasks. It produces a new bitmask and that tells me here's all the tuples to actually qualify or satisfy the predicate. And then if I want to get, return it back to which offsets in my input vector or actually were set to true, I can then pass in a sequence, 0 to 7. And for any bit that's set to 1, I just do a SIMD compress operation to then produce a single SIMD register that has these values here, right? So, there's other tricks you can do. Obviously, there's like, if I can run a rank instruction to determine how many ones I actually have in any of these bitmasks, if they're all set to 0, then I can bail out and not do the other steps, yes? It's an all offset, that's not a bitmask, right? Or is it just like an instruction? Yes? No, no, no, it's all offsets. Yeah, the question is, what is all... It's just a register, yeah. And there might be any SIMD instructions that can convert this automatically now by visually showing it, right? So, again, like, how do you actually implement this in a real system? Well, again, if you take the vectorized approach, which we'll cover more in the next class, you would have an explicit function that says string... an input column of eight elements, or some number of elements of a certain type, run the greater than or equal to comparison operator for a given constant. So, you invoke that function with the pointer to the column and the constant value, and then it just loops through that and does the comparison one by one. So then the compiler can then auto-vectorize that to do the SIMD instruction to put the data that you're trying to compare against into a SIMD register, run the SIMD compare, and take the output. Yes? Doesn't the selected store take a bitmax and simply store it in its memory? Why do you have to do the compressed set? This question is, doesn't the selected store take a bitmax and store it into memory where you want it? Um... I'm just showing you how to take this and convert this into a position list. Actually, he brought up a question earlier. Um... Like, how could I... How could I generate all the primitives for all possible variations of where clauses? And this is a good example where maybe auto-vectorization isn't going to be as... Sorry, exactly what we want because, again, this thing, the primitive that's going to do this evaluation, if it produces this mesh offset, what I really want is the bitmask so that I can then take the two outputs and run this SIMD and myself. So... There are going to be variations of the primitives where sometimes you want to just produce this mesh offset list immediately and other times you actually want the bitmask out because you then feed that into some other operation that takes two bitmasks and can run them together. So how to auto-vectorize all of this is actually not trivial. Again, it has to take a few minutes to come and figure out how to compose these operations together based on what you know, the additional things you need to do in the query. Again, we'll cover more of that next class. So we can now go back to that paper we said before from the Germans, plus me. Peter Bons, he's Dutch, the vector-wise guy. But now we can actually run his version of vector-wise and he's going to use AVX512 for everything because it's easier to again use the bitmap registers to do vertical vectorization. So I'm going to show results for both the... for three different operations within a scan. So the hashing to hash something and put it into a hash table without putting into the hash table, a gather operation and then join probing it. And then we can see how much the SIMD stuff helps for over a scalar instructions. Again, to strip out the rest of the system to say, actually the core algorithm of doing the scan operations and other things in a query plan, how much does SIMD help? And so what you see is that across hashing, gather and join, if you vectorize it, you get a bigger win for hashing and a bigger win than join over the scalar value. So up to 2.3x improvement of performance. But again, that's just doing the bare minimum you need within that scan operation, is doing the hashing or doing the join probing. When you bring it to the rest of the system and now start worrying about getting data in and out of the registers, materializing results, going from one operative to the next, then you see the performance difference is not that significant anymore. So you put it in a full query. The difference between the scalar operations and the vectorized one is actually not that much. This is the best case scenario of like 10 written code, it's everything's in memory. I forget whether it's scale factor one. It's going to fit in CPU cache. It's not that big. Or most is going to fit in CPU cache. So what gives, right? But again, what's going on is that it's not just a matter of like, okay, it's Omdel's law. What portion of the query is actually going to be the part that could be vectorizing and get the biggest win? It's not all of it. It's not a sizable chunk. So you're only going to get maybe, you know, but 10% bump for vectorizing this thing, that one small piece of the code. So all the materialization overhead that's going to slow us down. And that you can't vectorize. So this is not somewhat deflating. Like again, if I just said, you know, spending entire lecture about how great vectorization is and how great, you know, and how much can help. But, you know, it doesn't actually make a big difference when you run a full query. That's true for a lot of things in databases. But these things are cumulative. You obviously don't want to, you can build the greatest query optimizer, but if your query engine sucks, it's going to run slow. But if you have an amazingly fast query engine but you have a bad query plan, it's going to run slow. So, you know, all the lectures put together is what you need to get things to run fast. You know, get that order magnitude and performance difference. Okay. So one of the big problems in the paper you guys read was that they spent, I think, two chapters on, or two sections on, was the problem of honor utilization, where you have some lanes being, containing tuples that have been invalidated or should be discarded. But because we don't want to always move things in and out of the registers, you may have to continue processing dead tuples, so to speak, but you're essentially wasting resources. Right? So situation would sort of be like this. So if I say I have a query, select count start from table where age is greater than 20. And so in my sort of pseudo code of this, again, I realize this is a branching version sort of branchless, but for now it's fine. Right? As I'm scanning along the table, I may have a bunch of tuples here that would get invalidated, and I don't want to include them in my aggregation. So if it's scalar code, no big deal, right? Because I just loop back around and go get the next batch. But it's vectorized code. I may have, you know, 8, 4 to 8, 12 tuples in my vector, and some of them might not satisfy this predicate, but now they're going to be strung along in my vectors. So you sort of think of like this piece right here is sort of the first pipeline, and then the second pipeline is this piece here. So we want to avoid having to pass along dead tuples in this. Right? And so the idea that I'm going to show you instead of having the materialization point be at the pipeline breaker, we actually could introduce artificial pipeline breakers or synthetic pipeline breakers where we can materialize some results, go back in our loop, get more data, and keep filling up this mini buffer, if you will. And then once that's filled up, we know all the tuples there aren't dead or they're all useful. Then we can proceed up to the rest of the computation in the pipeline. So this is a paper, I think it's citation 16 in the paper you guys read. It's a paper that we worked on here with my PhD student, Prashant. He's now working on the photon vectorized engine at vector-wise. So this is the ideas that we're going to decompose pipelines into sub-stages that can operate on vectors of tuples, just with vectorized processing, using SIMD when possible. But then the idea is that we can start storing things in buffers, fill up a SIMI register, and then move on to the next stage. So we don't have wasted computation, wasted resources. So it's called relaxed operator fusion because the idea is like you're taking the operator fusion approach from the hyper guys and actually relaxing a little bit and introducing these breakpoints. So the first thing is that you figure out these are the vectorization candidates. I want to vectorize the filter operation. But before I maybe do the aggregation step, I want to materialize some results, make sure that all gets filled up, and then I can do the aggregation computation using SIMD and vectorize that without worrying about throwing away unneeded results. So the code basically looks like this. So I'm scanning through as a vector tuple. I do my comparison. If my buffer is full, then I can go fill, if this thing gets full, then I can go to the next stage within my pipeline and do the aggregation. Otherwise, I loop back around and get the next batch. So this buffer is incrementally getting full of values, so then I can then fire this off, again in a vectorized manner. So this is the first part here, and then this is the second part here. And obviously they emit at the end. So one of the tricks that we figured out with this though is that because you have this staging point and this really tight loop, you actually can start doing software prefetching. So there's hardware prefetching where the CPU is going to try to figure out what pieces of memory you're going to need next and starts bringing that into your CPU cache. If you're scanning along some long stride of memory, it starts bringing in cache lines ahead of what you actually need. But in x86, you actually can pass hints to the CPU and say, hey, I'm going to need this memory region pretty soon. It's not required to actually obey your request. It's like a hint. But in some cases, it actually can make a big, big difference. And this staging stuff, because it's having this really long pipeline, so you're breaking up these sub-stages, it's sort of a nice natural boundary for prefetching operations. So again, this is sort of jumping ahead to do query compilation stuff that we talked about before. But this is showing that if you do holistic query compilation in the same way that Hyper does, which we read about x-class, but then you also introduce these relaxed operating effusion stages, you can get a pretty good performance. In this case here, software prefetching doesn't help because there's no join. There wasn't really a good place to say, OK, let me go and prefetch. But over here, this does make a big difference because this query 19 can be broken up into these sub-stages. Q1 has a high selectivity. Q1 has a high selectivity, so you're not discarding. It's basically taking everything. OK, let me skip this. I want to get the hashings. But basically, this is the old Peloton system. So our interpreter engine was total crap. It was garbage. We converted it to compilation. So you got this amount of improvement. And then we're putting rect-operated effusion with Cindy plus rect-operated effusion with Cindy plus prefetching. You get a pretty significant win. So again, this will be next week. But going from a, this is not really like, let me say this. I don't get any impression that like, oh, we switched to compilation. You're going to get 97 improvement. This is like crappy student code to high end Prashant code, who's now a Databricks. That'll get you 97% any day of the week. The thing I really care about is going down here. Again, that you can still get a pretty significant bump by introducing these stages and vectorizing it as much as possible. The newer version of Hyper and Umbra before this paper came out can actually use SIMD and vectorization. But at the time in 2017 or 2016, they didn't support that because they were doing entire, you know, doing the push-based execution with complete compilation of the queries. All right, so this is one way to go, sorry, making sure that we're always utilizing all our buffers. But again, we did this before ABX 512. And in the paper you guys read, they called this a materialization approach. They also discussed two different algorithms you could use that try to be clever about deciding when to go back and get more tubeless from the operator below you. I think somebody asked a question about this, and I said most systems don't do this, but this is one way to do it. And the challenge, of course, is it's going to be the bookkeeping to keep track of, like, where did I leave off from the operator below me and where can I write results into. And they can do this in ABX 512 because there's a lot more registers now. So the idea is that while my operator is running, if I realize that I have unutilized lanes, I can just leave all that data in that register, go then execute another part of the query and have that write to other registers, and then once that thing gets full, then I can combine the two of them together. At a high level, that's what they're doing for these real-flow algorithms. The question is whether you go get more tubeless within your own operator by iterating over the loop again, or do you jump out of that operator, go below you in the query plan and let the operator below you now start producing tubeless up the query plan. So the buffered one is the one where you stay in the same operator, and the idea is that you use additional registers to sort of stage results, and so the next iteration doesn't overwrite them, which is writes into another register, and then once that gets full, you can then use some of the instructions to compile them. The partial one is where they basically spill out all the results within the current operator to a bunch of registers, and then once that's full, then you can combine the two of them together. So you think the top one is more simple because it's like, okay, let me just call it, let me call it next on my loop within my same operator, but I just make sure that I don't write the same register that I wrote before, and I don't need to keep track of where I'm actually writing to, other than I don't write where I write before. And this one, you're trying to get to the top one, and you're trying to get to the top one right where I write before. And this one, you're trying to be clever and like, okay, I know that there's things up above that I could write into, but I can't right now because these links would be inoculated. So you're trying to fill things in at a more fine-grained level. So again, other than Umbra, I don't think anybody actually does this. I think everyone just naively carries along the unused buffers, carries along the dead tuples, and then it gets just easier at the end. Okay, so so far we covered selection scans and vector refills. I want to quickly go through two variations of hash tables and then finish up with partitioning histograms. So in hash tables, the challenge here is that we have this data structure, this hash table, that Khan is not really SIMD friendly, because it's this long stride of memory, but then we need to be able to do comparisons in contiguous regions of memory and not within different lanes of the same time that contain different elements. So the scalar approach would be you have some input key, you hash it with some hash function, you jump to that offset, and then now you just do a linear scan looking at all the keys within the hash table until you find your match and then you're done. So the way to use horizontal vectorization to make this run faster is that within each offset within the hash table we're actually going to store four keys with four corresponding values. So now when I do a lookup on a single key, I hash it and mod it by the number of buckets or slots, and then now I landed some memory address, now I get four keys. And now if I want to compare and see if I want to match, I just duplicate this key in a single register, make four copies of it and then do the same big comparison and that's going to produce a bit mask that says whether I have a match or not. And then whether or not you do the rank to see whether these are all 0s or all 1s, then you just do the same thing going down to loop around, right? So that's kind of cute, that's kind of clever. The problem with this one though is like that like, it's, you know, how to say this? Like what if it's not, these keys, these slots may be empty and so I may be going for some location and there's like two out of the four keys there. So I can't guarantee that I'm always doing, all my lanes are fully utilized when I do the comparison. So the alternative is to do vertical vectorization and the idea is now I want to compare four keys at the same time my hash table is just like before before I do the, you have multiple elements for each slot, now it's just again a single slot, single key per slot. So I take my four keys, there's SIMD operations or SIMD instructions or SIMD hash functions, you can use murmur 2, I think there's a SIMD version of that that vector-wise use and then now I'm going to produce some hash vector then I use my SIMD gather to go grab these different memory regions put it into now a SIMD vector then do my SIMD compare to see whether I have any matches, right? Of course now the challenge is going to be some of these tuples are going to match some of these tuples aren't going to match. So then now in the next iteration I need to, for the ones that don't match they need to all go down by one in my hash table to figure out whether there's a match but again I don't want to be doing the same computation over and over again for tuples that didn't match. So I want to go back and get two new keys to fill in the spots that did match before and then now I've run another round and I just need to keep track of which lane as I'm going along what iteration are they at what location in my hash table should they need to go look at. So maybe this thing is sort of waste of computation for the middle guys but that might be okay, that might be enough, right? So then do the same thing do the gather to go bring them in to send me registers and do the comparison until I satisfy all my checks. So this question is vertical clearly better? Most times yes. That's it. This question is what is the benefit of the horizontal one? What is the benefit of the horizontal one? Again the paper basically trying to, for every single core operator in a diva system they had a vertical and horizontal variant of them to show that you could do it, right? And then the measurement determined that the vertical one is always better. There's something else though that's tricky about this that might not be obvious and that is the there's not always going to be a guarantee that the output tuples are always going to be in the same order every single time you run this algorithm, right? Because the you're sort of reading the keys in a different order sorry, the output is going to be in a different order than the keys as they come into the operator and on a relational algebra that's okay, right? There is no ordering but if you're trying to debug this then sometimes you run the same query on the same data and you'll produce output in a different order and it's hard to kind of debug things if there's problems you're trying to figure out. So it's not really a I would not say that's enough to discourage people not to do this it's just to say like it's something to be aware of that like the I mean hashing always sort of randomizes things but this takes it to a higher degree it makes things even more challenging to work through because again the way the SIMD stuff is trying to do multiple comparisons at the same time. All right so I'm going to skip the this is the result from the paper I don't want to spend too much time on it but basically everything goes once you run out of CPU cache for these different limitations and this is running on the Xeon Phi which is an older Intel accelerator Intel's version of a GPU in the 2010s they don't exist anymore again SIMD over here running on Xeon's once you run out of cache there's actually no difference but the if your hash table size is actually small enough again that's why you always put the small table on the build side then you might be okay and CPU caches have gotten way way bigger I think there's the one AMD chip it's like 800 megs for L3 it's insane right it's almost a gigabyte L3 cache on a single socket that's insane so all right let me show one other cool thing I like and this is a really simple way to see like okay how can I parallelize things with SIMD to do another basic operation in my data system so how do we build it we want to build a histogram and so we want the problem is going to be that if we just do the naive thing say these are input keys and we use SIMD RAIDX which we'll cover in a few weeks but basically like think of poor man's hash function you just basically grab the first bit and it tells you where something goes and so we want to get the RAIDX on this we have our hash keys and then we're going to fill out some histogram the problem is going to be we're going to have two keys mapped to the same location in our histogram and they're going to clobber each other when I try to put things somewhat together right because I'm going to try to overwrite to the same position so to get around this problem I can just replicate my hash table where for every single lane in my SIMD register I'm just going to have another array and so now I know at lane 0 it's going to write to this column here lane 1 is going to write to this column here so for each column there's going to be one entry for the key in my histogram and then I should just use the SIMD add to put it together across the lanes and then produce the final counts right again there's a bunch of different clobber ways that you can combine SIMD operations and structures together to produce results again keeping everything in SIMD registers so this one I like and this is clearly when so we've covered this a lot already we just put it out now on the slide so ADX 512 is not always going to be faster than ADX 2 and as I said in the paper you guys read there's this little footnote down here they mentioned that in their experiments they didn't see any downclocking issues with either the Skylake Xeon CPU or the night's landing Xeon 5 and that it was always running at a stable 4 gigahertz clock speed but there's a lot of blog articles out there and a lot of Stack Overflow posts about hey my program is running slow why and I traced it down to my CPU clock getting downcycled why is this the case and the issue has to do with in the case of Intel they identify whether some instructions are either light or heavy and if you run too many heavy instructions then they dial down the clock speed and your thing actually runs slower if the CPU recognizes that it's getting too hot because the fan's not running or something it'll downclock itself so it doesn't damage itself he says it needs to be bigger heatsink I don't think it's even that I think it's like hardwired that it just always down cycles it I don't think it's like trying to sense the temperature yes yeah so his statement is newer versions of Intel a little bit better at this I know for the consumer ones they always turn it off by default it's off it's actually fused off I don't think you can even turn it back on and that AMD doesn't really do true 512 they do two 256 registers and they put it together and they say that's always faster it's always faster like it should be kind of the same two in a lot of cases but for encoding decoding it's a little bit faster but does it have that the bitmasting is the key difference in terms of databases I don't know whether it has those capabilities anyway and I can post on Piazza there's some blog articles from like the clang people or the GCC people were like they will always try to use AVX2 instead of AVX512 to avoid these issues right now you may be careful and say okay I'm going to make sure if I'm using intrinsics that I make sure I only use AVX2 to avoid this downclocking issue but you may link in some library that then gets auto vectorized in AVX512 then your your data is running slower because of some third-party thing that you didn't you didn't expect right so I don't know when this all gets fixed who knows but like the safe bet is probably going to always be AVX2 but I do know some of the commercial systems do run AVX512 and maybe they just try to be more careful if and when they use it okay so to finish up so vectorization is going to be obviously super important doesn't always going to be the biggest win and ideally you know we want to rely on the compiler to auto vector as much as possible but in some cases we do have to come in and either using intrinsics which is more common or one of those libraries that can mask the actual details of of what you know what SIMB extension package we're using and again all the things we talked about so far about doing interquery parallelism this is all in conjunction to the SIMB stuff so every core is going to have its own set of SIMB registers and so we want to use data parallelism within each core as much as possible and again we'll cover query compilation next class but that's another tool we can use to control the movement of data within our query plan so that we have precise control where things when things are going registers when they come out of registers and how things are moved through memory or CPU caches okay again so next class will be compilation so it's going to be a German paper it's very dense it's a lot of LL and IR don't don't sweat the details of that the main thing I want I want you to get away from it is out of it is the this notion what he calls sort of data center computation it just really means the push model in this query processing approach and that again how he's going to have fine-grained control of what goes in the the CPU registers as things move at the query plan okay and then we talk a little bit about the project status in preparation the status update later this month and then next class as well okay any final questions alright guys see you