 So today's lecture is going to be on examining the difference between vectorization and compilation. So the last two lectures, or the last three lectures, we've been looking at different techniques to speed up query execution. And we've been looking at vectorization, and we looked at compilation. And as I showed, you can get better performance if you do this. So the question is now, which one is better? Which one is the right approach you should focus on if you're building the database system? Now, for the vectorization stuff, we talked about how from the Columbia paper, none of the techniques actually work as soon as you exceed your CPU cache. Even though in the paper you guys read today, that's using AVX 512, the newer version of SIMD, that still holds true. Just because you have wider registers doesn't mean that SIMD now works if you exceed your CPU cache. So we'll see again to understand this a little bit better today, and we'll also look at what types of queries and what conditions will one approach be better. So this is more of like a micro benchmark paper that says, here's two ways you can actually implement a database system, and let's compare them in a bunch of ways, which I actually really like. So the two approaches they're going to look at are going to be vector-wise and hyper. So just as a refresher, in case of vector-wise, it was the commercial version of the Mone ADB X100 paper you guys read, and then later this was bought by Actian, and it's now called Actian Vector. So the way vector-wise is going to work is that they're not going to do compilation on the fly. And they're instead going to have the database developers, the people actually building the database system itself, they're going to write a bunch of these primitives that do simple operations on data, to be a sort of simple for loop and do one thing, like, to pair to see whether a value in a column is less than another value. So they're going to have a bunch of these that they're all going to pre-compile, and that gets shipped in the database system binary. Then at runtime, what happens is that the database system figures out, well, here's what my query plan looks like, here's what the primitives I have, and sort of stitches together these pre-compiled primitives into the full query plan itself. And so yes, you may be thinking, well, now you're making a bunch of function calls, isn't that going to be expensive? But because it's called vector-wise because they're going to operate on vectors of tuples, and so the function calls will be amortized because you're doing over a batch of tuples. If you're doing it on a per-tuple basis, this approach would be terrible, and they avoid that by doing this in vector batches. The other alternative is the hyper approach using just-in-time compilation, which again, we talked about before, and the idea here is that when the query shows up, you're going to generate the code, you then compile for that single query. So you're going to generate machine code or instructions that are hard-coded just for that single query plan. And so that way you don't have any conditionals, you don't have any, you have very little branches because you're baking in the actual machine code itself, the steps you need to do for the single query. But more than just doing compilation, Hyper also does a sort of bottom-to-top or push-based query processing model as opposed to the vectorized model from vector-wise. So that means that starting from the bottom, your scan operators, they're going to emit a single tuple and you're going to push that up into different steps in the pipeline. And you keep going as far as you can up on the pipeline until you hit a pipeline breaker, meaning you have to go back and get the next tuple before you can go on to the next pipeline. So you try to ride that tuple all the way up, keeping things in CPU registers, and that way you have fewer cache misses. So the question we're trying to figure out here is which of these two approaches is actually going to be better. Now actually the last, I'll also say it too, is the Hyper1 adds written in the paper you guys read and from the original Hyper compilation paper, they don't vectorize any operations. So this is sort of like a scalar operations on single tuple. We'll see how to actually fix this at the end, what we've done here at Carnegie Mellon. The title of this is not exactly correct, it's not really vectorization versus compilation because in all our examples here, we're going to be doing compilation for everything. So in some ways the primitives and vector-wise are pre-compiled and then we'll compile the query plans in the Hyper approach. So it's this higher level thing about is it better to sort of stitch together these smaller primitives and do batch processing on vectors or is it better to do the whole list of compilations with the tau query plan and do the push-based processing model. That's the really way to understand this. These would be the strict dichotomy between these two different approaches and that will show in our relaxed operator fusion paper, the work we've done here at Carnegie Mellon, this is where we actually get the best of both worlds. We're going to do the vectorized processing the same way that vector-wise does but we're also going to do the push-based processing or compilation the way that Hyper does. Sort of a hybrid approach. And they briefly mentioned about this in the paper, of course they say that it's hard to do and so that's why they didn't evaluate it. Remains to be seen whether that's actually true or not, whether how difficult it actually is. It works for us but it worked for a single paper. As we go off and build the bigger system, we'll see what actually happens. All right, so in this paper, they're going to implement a single testbed system. So rather than taking vector-wise or Monabee X100 and Hyper and try to do a comparison to see these two different systems, they're going to build a single testbed system that uses approaches to both of them. And it's obviously the reason why you want to do this because there's a bunch of other crap the data system does as we've been talking about this entire semester that's going to prevent you from having a true apples-to-apples comparison. The way that vector-wise parses a SQL may be different than how Hyper does it. Or Hyper uses an art index and vector-wise might use something else, right? Like these are different things that we don't care about in this evaluation. So therefore we don't want to use the full system. So we'll go and use a testbed system that we write from scratch as our testbed. So at a high level, all the algorithms in these two different implementations will be the same. You know, they're going to do the same kind of hash join, your linear probing or the same kind of scan operators. The difference is going to be in the actual low-level implementation will be slightly different. Again, the paper discusses some of these for some aspects of it. The most obvious example was what hash function to use when you're doing, you know, building your hash table and doing the probe. In the case of vector-wise, they're going to use murmur two because although that uses more instructions than CRC, it actually has a higher throughput and which actually get better performance in vector-wise because they separate the hashing versus the probing phases, right? Whereas in hyper, it's also all put together into a single pipeline, right? In the case of hyper, they're going to use CRC and this is because when they use this over murmur two, they get 40% better performance and this is because they are, CRC is using fewer instructions than murmur two and therefore the CPU is actually able to get better speculative execution. So again, the high-level, the algorithms are the same. They're both, you can do hash join. They're both doing linear probing but how the actual steps or the low-level things or the low-level steps in those algorithms will be slightly different. They're tuned exactly or they pick the best approach for the two different architectures. So the two different implications they're going to have in their test bed are colorfully named as vector-wise and typer. Quick guess, and I'm going to guess why it's called vector-wise and typer. It was actually curious if anybody knows. What's that? He says test, no. He says, no. First of all, the name is Timo. T-I-M-O, it's just after his name. That's all it is, all right? Not test, not my dog. All right, so and this was very impressive, right? So, right, so vector-wise is going to be based on vector-wise and again, we're going to break up the operations in our query plan into these precompiled primitive steps and the key important thing you need to understand about how vector-wise works or tector-wise is going to work is that these primitives have to materialize the result of the function invocation every time you invoke the function, right? Like I'm passing in a vector to the function, I'm going to crunch on it and I'm going to write it out to an output vector that then has to be written to someplace in memory and then I pass that back as the function return value. So that now needs to be materialized in memory and then that needs to be passed on to the next primitive that I invoke, right? So there's quite a bit of mem copy being done here, right? Between the different primitives. In the case of Hyper, because as I said, they're doing this push-based approach where they try to maximize the pipeline length, meaning the same tuple is going to hit every single step as far as you can up in the pipeline. So under this approach, you can keep the same tuple or the values, the intermediate results that you're generating for each step in the pipeline in your CPU registers and therefore that's going to be super fast and therefore you're not going to have as many cache misses as you possibly will have in the vector-wise case, right? So again, the difference here is that again, vector-wise can process vectors at a time. So therefore you can use SIMD and other vectorization, SIMD vectorization for this. In Hyper, you can't do it as this tribe here because again, it's a single tuple at a time. So in their evaluation, they're going to look at five different queries from TPCH and these are sort of hand-selected to be representative of the different types of work you need to do in your database system for do analytical queries. So the original TPCH benchmark has 22 queries. Again, they're just selecting five here. So this is great paper actually written also by the Germans, Peter Bantz and Thomas Neumann from Hyper and vector-wise where they do an analysis of TPCH and they look at every single query and they discuss what are actually going on in the query plans, what are the pain points inside of an in-memory database system. So if you want to see sort of why they pick these as being representative of what real-world workloads look like, it comes from this paper they published in 2013. So we're doing Q1, Q6, Q3, 9 and 18. So Q1 is doing fixed-point arithmetic and then it's a group by but it only has four groups. Q6 is doing predicate evaluation so it has filters that prune out tuples you don't need. Three and nine are doing joins but nine is joining a much larger, has a much larger build side than the probe side and then 18 is doing a high cardinality aggregation. So it's a group by with 1.5 million groups which is pretty massive. So it's a big hash table. All right, so again, these are, we can use these to figure out what are, what are the different workload scenarios or what types of queries that one approach might be better than another. So the first experiment we're going to look at is the single-threaded performance. So this is running on a 10-core machine with hyper-threading, so 20 threads in total but for this one they're just going to run it on a single thread and they're just going to see what the runtime differences between different systems. So the first thing to point out is that for Q1 and Q18 hyper actually performs better but for the six, three and nine, vector-wise actually performs better. So again, these are the two joins and then this one was the aggregation, right? Go back, sorry, selective filters. This was the selective filters and the two joins. So for the aggregation queries, one and 18 do better. So what do these graphs tell us? Can we infer anything from this? It depends, right? No, because these are just sort of, here's the runtime, poof, it's done, right? So what you actually have to do is actually look at the performance counters. So for this one we're going to measure cycles, instructions per cycle, instructions total, L1, last-level cache misses and then branch misses. And then the little star here just says to remind you of which one ran the fastest here, right? So again, these are a lot of numbers. Can we infer anything from this? Well, the first thing is that, I would say also too, these numbers are normalized based on the number of tuples processed per query, right? So to read this is like per query, they had, this one took 56 cycles, right? Sorry, per tuple this took 56 cycles. And they're doing scale factor one, which is a one gigabyte database. So the first thing we can look at is the two cases where hyper-executes perform better. And this obvious thing here is that the number of instructions that hyper-executes versus vector-wise is half, or yeah, in this case here more than half, right? So this is an example of clearly executing fewer instructions per tuple will get you better performance, right? And so for these queries here, these are more computation expensive, right? Q1 is doing fixed point arithmetic, and it's doing inline in-cache aggregation because it's only four elements, right? So just really how fast can you, or how many instructions you have to execute per tuple is the thing that's gonna matter the most, right? And that's why in this case here, right? It's executing fewer instructions, and therefore it does better. But now, can we apply the same kind of metric to other queries? Well, right, so, sorry. The point I'm trying to make here is that again, you execute fewer instructions, you do better. You may think actually getting more instructions per cycle would be another metric you can use to determine where you're doing better. But in this case here, right, this is actually doing 1.6 instructions per cycle per tuple, and this is 2.1. So this is technically doing more work per cycle on the CPU, right, but it's actually doing worse because it's just executing way more instructions. And the reason why vector-wise can get better instructions per cycle is because it just has those tight loops, right? There's tight kernels that are just doing one operation on a vector tuples at a time, whereas Hyper actually has these more complex loops because it's pipelining something all the way up the plan. So in this case here, in Q1, the instructions per cycle is 40% better than vector-wise than over Hyper, but it does 74% worse, or 75% worse, because it's just doing more work per tuple, right? So now the question is, can we apply that same evaluation criteria to the other queries and see whether that still holds, right? So if we look at this one here, Q3, which is the join on the larger probe side than the build side, Hyper executes fewer instructions, 21 versus 42, and again, the instructions per cycle is different, right? This is getting much better CPU utilization, but vector-wise actually performs better here. So again, up in Q1, or the last one here too as well, Hyper executed fewer instructions, but vector-wise got more instructions per cycle, but it's Hyper actually still was better. For this one here, we're still exceeding fewer instructions. This is still getting better instructions per cycle, but vector-wise is doing better, right? And let's take guess why. Branch misses, right? Look at how much difference here, right? This is has 0.8, 0.08, this is 0.27, right? Because again, the scans or the extra pruners themselves don't have any if clauses, there's no conditionals, it's just ripping through as fast as possible, and again, the loops are really tight. Whereas in Hyper, these longer pipelines mean that if the branch prediction gets wrong, right? It has to, it's more expensive to roll back a bunch of crap that you did that you're not actually gonna need, and then go back and go down the right path. And because there's less, there's more indirection or more conditionals in the hypercode itself that they're generating, that's why they have higher branch misses. All right, so maybe say, all right, now is it branch misses that matters the most? Well, if we look at this one here, Hyper has fewer branch misses, Hyper exceeds fewer instructions, and still Hyper has worse instructions per cycle. So now the question is, is it this one that actually matters? And again, this is just another join, the same way that Q3 is, it's just the build side is now larger. So in this case here, right, it's not any of these, it's actually just the cycles, right? So in this query here, the hash table is larger, because there's more tuples on the build side, so that means that there's the likelihood that the thing we're looking for, the entry in our hash table is not in our CPU cache, so therefore we have memory stalls, and therefore that increases the number of cycles we're executing per query, right? In the case of, again, of vector-wise, these simple loops that are doing the probes, it's doing them on vectors, right? It's just taking the tuple, not playing any predicates, just doing the hash and then checking to see whether the hash matches from the vector hashes. So in that case here, the CPU can speculate it farther ahead, sort of unroll the loop and say, here's a bunch of stuff, I think it's gonna happen, and then it can go ahead and prefetch that memory if possible, as you're taking back, you can't prefetch, you still speculate. You speculate, execute the instructions, which bring in the things you're gonna need, and that mass your cache stalls, or memory stalls. So again, the main takeaway from all these results is just showing you that there's not one approach is better than another, right? We have a bunch of different queries that do different things in TPCH, and in some cases the hyper approach is better, in some cases the vector-wise approach is better. And then under the case of where one is better than another, it's not any one single CPU performance counter we can point out and say, this is actually what is why we're running slower, or why another one is running faster. It depends. So that's very dissatisfying, I'm sorry, but that's the way it is, right? So the main findings are that the both of the models generate efficient query execution and get roughly the same performance. Just to be clear also too, when we go back to the runtime here, right? Yes, it looks like hyper zoom is much better, but we're talking, this is what, about 50 milliseconds, and this is like 85 milliseconds. So a 30 millisecond difference on a one gigabyte database is like nothing. And as the paper pointed out, these are still two orders, both of these systems are still two orders magnitude faster than what Postgres and a disk-based system can do. So we're really shaving off cycles in milliseconds here, and at some point you get diminishing returns. We actually, with Prashant, before he did, or after he did the ROF paper we'll talk about next, he spent about a half a year trying to make hash joins work even faster. And it got down to the point where like, ah, this is 12 cycles, and now we got ours to be 11 cycles. Like it's got to the point where it's so fast that it doesn't matter, right? So I would say that in both these cases there's no glaring difference between these two different systems, right? They're still gonna crush all the disk-based or traditional database systems. And then under some queries, one will be better than another. Right, the other main takeaway also too is that the hyper approach, the data-centric push model is better for computational queries with your cache misses because you can ride along and do, you can execute the pipeline of instructions for that single tuple for as much as possible. But as soon as you have cache misses, then the vectorized approach is better. Okay? All right, so the next question is, how much does SIMD vectorization help the vectorized approach to get better performance? And so in all the results I just showed you, that was with SIMD enabled in vector-wise. So the question is now, how much does SIMD actually contribute to the performance benefit you get in the vectorized approach? And so for this, we should have spent more time covering this maybe last class, but they're using AVX 512, which came out in 2017, because this now, which came after the Columbia paper you guys read, this now has a bunch of new instructions to make it easier to implement the algorithms that we talked about. And everything's gonna be vertical vectorization. We're not gonna do any of the horizontal vectorization. So the first experiment they're gonna do, this is running on a single thread. They're gonna take the primitives in vector-wise and to execute a querying, and I think this was Q3. No, it's Q3 and Q9. And they're gonna implement it in a scalar version and a SIMD version. And they're just gonna measure, if we're just executing that one primitive, what the performance benefit you actually get, right? And this is when everything fits into CPU cache. And this basically follows up, or it matches, it corroborates the results we saw from the Columbia paper, right? For hashing, you get a 2.3X speedup. The SIMD with the gather gets 1.1 and then the SIMD join gets you 1.4. So this is showing again, for the individual operations, if you vectorize them with SIMD, you get better performance, right? Sort of as expected. The problem now is then, if you put it into the full query, you see you don't get that big of a speedup. So Q3 and Q9 again are the joins. And for this case here, the performance benefit you get is marginal at best. Again, which is corroborates what we said before. Once you exceed your CPU cache, it's the memory stalls that kill you and vectorization doesn't help. The next question we had about this paper was how, what does actually take to actually vectorize the code, the primitives and vector-wise, right? In the original vector-wise paper, they talked about, oh yeah, if you have a compiler, they can sort of figure out how to vectorize your primitives automatically for you. But then remember I said last class, there's essentially three ways to get vectorization. You can pray and hope your compiler does it for you. You can provide hints or you can do explicit vectorization. So we wanted to figure out how good the compiler actually could be. So for this, we compared against the sort of three major compilers, so GCC, Clang, and ICC is Intel's, I think it's free, I always forget. They're custom compiler that they build and sell. So for all of these, we found that the Intel compiler worked way better than everyone else. So for this one, we're just doing auto vectorization. We're not doing any hints, right? We're just seeing how well the compiler can identify that these little primitives and vector-wise can be vectorized. And the ICC put everything into 5.12, which I think was interesting. So it was able to vectorize the hashing, selection, and projection primitives, but it wasn't able to vectorize the hash table probing and aggregation. And this is expected, right? Because this is like, these are random memory lookups that can't easily be put into SIMD instructions. The both of these are doing it with hash tables. So for this, now we're actually going to see how well the system performs. Only using ICC, we can ignore GCC and Clang. How well it actually performs versus when we manually add the intrinsics to vectorize our primitives, versus sort of doing a hybrid approach that uses a combination of mainly written code or handwritten code and the auto vectorization. So for this one, we're first going to measure what is the reduction in instructions that the three approaches generate. So again, the first one is the auto vectorization ICC. Then we have our handwritten ones and then we have the autoed manual. So Q6, again, was the selective scans. It actually made it worse. It made the instruction, the number of instructions larger. But I think it's kind of interesting that the hybrid two approaches, these two actually match exactly. And this is sort of expected because it's just doing selective scans. It's not doing anything fancy. So this shows you that the auto vectorization does reasonably okay, like for not having to do any extra work and to get almost as good of what you can get manually, and if not better in some cases, is not bad. But again, this is just measuring the number of instructions. It's not measuring what the actual performance is. So when we actually measure performance, now we see the difference is marginal at best and in some cases worse. So for this one, this is running, I think this is TPCH running with a scale factor 10. And again, we're just comparing what the runtime is. So it's the difference of the runtime versus the non-vectorized version. So again, it's hit or miss. The auto vectorization actually does worse for the two joins and the final aggregation, whereas the handwritten one actually do slightly better. But again, this matches up with what we saw before where you only get a 1.x speed up in performance or reduction in time. So the main takeaway here is that you probably don't want to do auto vectorization if you're doing a vectorized approach because the performance actually is gonna get worse. And then the cache misses are just, again, still gonna overwhelm about any benefit you can actually get. Any questions about this work here? Tector-wise versus type-er. Okay, so the next thing I wanna talk about is this comment in the paper when they talked about how the vectorization and the compilation from vector-wise versus type-er, they're essentially mutually exclusive. They talk about how you could have a hybrid approach, but the software engineering overhead for actually supporting that was quite difficult. So in order to understand how we can combine these two, we need to understand what hyper is actually doing when it fuses its operators together to have these long pipelines. And again, the big idea that hyper does is that they're gonna fuse by combining the operators together and having this sort of giant for loop where you're doing all the operations for a single tuple until you hit your pipeline breaker, this is gonna maximize your CPU register reuse because you don't put anything out into your cache you just take the same tuple and keep going up in it. And this is also gonna allow you to minimize cache misses because you're not gonna do one small thing on a tuple, then go back and get the next tuple and do the same operation over again the way that vector-wise does. You take the same tuple and then process it all the way up. So they have this nice chart here where they talk about the sort of trade-off between the vectorization and compilation. And along the x-axis you have interpretation versus compilation, right? You have interpretation is gonna, you have something that has to figure out on the fly what it needs to do for a particular query whereas in compilation is everything's baked in and then tuple time versus vectorization. So the two extremes again, compilation would be hyper over here, vector-wise over there. And so I wanna talk about us over here. I don't know why we're like not here, but whatever, I'm okay with that. Right, so let's understand a little bit more now what, let's go in more detail what hyper is actually doing. So again, the idea is that you have this pipeline of these operators and some sort of slice of your query plan tree and you're gonna fuse them together and compile that as a single loop, right? And under the hyper approach, again, they're doing this on a single tuple at a time. So say your query plan looks like this, say this is your pipeline under hyper, you're basically the code it would generate would be like this, right? You have the first set of pipeline, you're iterating every single tuple in A, then you apply your filter and then you store it into your aggregation, your hash table, because you're putting an aggregate, and then in the second pipeline, then you iterate over every single tuple in your aggregation after you computed it and show it up as the output of your pipeline, right? So the issue again is gonna be, we showed how you can get some benefit from vectorization, but the issue is gonna be that the cache misses overwhelm you. So the idea is that is there a way we can sort of rethink of how we wanna organize this, these pipelines, in such a way that we can mask those cache misses and then still be able to do vectorize and be able to do vectorize execution. So the thing we need to do in order to do this, we understand where are the problems, right? With this sort of tuple at a time approach, right? So again, so these are our three steps, or three parts operations in our pipeline scan filter aggregation. So the first issue is gonna be this filter here, right? Because we're operating on a single tuple at a time to do this comparison inside this for loop, we can't vectorize this, right? You can't use SIMD, right? Because it's one tuple, do the evaluation, then go back and get the next one, all right? And the next issue is this aggregation here. We're not gonna be able to prefetch anything because we have to do a hash table lookup because we're operating on the single tuple here, right? So the idea is that the issues get said that the tuple at a time processing model does not expose any sort of inner tuple parallelism. We can't do a bunch of stuff on a bunch of steps on a bunch of tuples at the same time and do that in parallel in SIMD, right? As it's written again, this is a for loop that's doing, that's scalar, doing a single tuple operation at a time. So the technique that my student Prashant came up with is called relaxed operator fusion. And the idea here is that you identify where are steps in the pipeline for that code I just showed where we can introduce sort of buffering stages and allow us to store things in vectors and then prefetch those vectors as needed as we iterate over the loop. Again, the idea is we want to get the best of both worlds. We want to get vectorization and we want to get the compilation benefits that Hyper has, right? So in each stage, we're gonna still have multiple operators, right? Again, we're taking a single pipeline, break it up, it's gonna have multiple operators. So it's not like we're going back to the volcano approach, the iterator model where we have one tuple or one vector per operation. And we can do a bunch of stuff. And then just like in Hyper, instead of passing data, we're not gonna be able to pass data through registers, but now we'll be able to at least pass them in our caches and that'll be just as fast. So going back to our example here, these are our two pipelines, right? We identified that this is what we actually want to vectorize, the filter candidate. So what we're gonna do is we're gonna introduce this new staging buffer for our pipeline. So actually this should be moved up here. So the pipeline should be, the first pipeline is aggregation filter scan. So the pipeline is still this, but now we're gonna introduce a stage between the scan of filter and the aggregation. Again, this sits in our CPU caches. We can write a bunch of tuples at a time to this and then when it fills up, we go to the next one, move it up to the next stage. So now if we go back to our query plan, I'm sort of showing some pseudocode here, but here we're now in our for loop for iterating over the table A in steps of 1024. So we're grabbing 1024 tuples at a time in a vector. Then we can do our SIMD comparison, right? Cause we have everything, we just shove it into the register and then crunch on it. And then we take the selection map that comes out of this from our vector and then we can then update our hash table here, right? So in this case here, the first stage is grabbing the tuples and doing the comparison. Then we write all the output of this into our stage buffer and then in the second stage, we do our aggregation. And then the top is still the same. So, all right, so this is awesome. There's a lot to do vectorization, but this doesn't solve the problem we had before where if we have cache stalls or cache misses, then we're gonna get bad performance or SIMD is gonna help us, right? So it's really this part here, the for loop or the getting the vector from the table. That's where we can introduce cache misses and cause problems. I mean, this as well, but this is the main thing that we wanna deal with here. So what we're gonna introduce is explicit software prefetching in now our stages that to instruct the CPU as we make a pass on one for loop, we tell it, hey, we're gonna come back around and execute this for loop again. Go prefetch that the next data we're gonna need and put it in into our CPU caches. And this avoids having the cache miss when you come back around in our for loop. Again, Hyper doesn't have this problem because it's a single tuple and you do a bunch of stuff on it before you go back and get the next one. So you're not paying that, you can still potentially have cache misses, but for each tuple you bring in, you're doing a bunch of work. Whereas in this case here, because now I have two for loops, I can have cache misses here and cache misses there and cache misses here as well, the last one here. So the idea is that we're gonna do software prefetching to tell the CPU, hey, we know what we're gonna need, please go get it for us. So I'm not gonna go details of how software prefetching works. I will say that this is not new to us, like this is not, we're not the first data system to actually explore this. There was a bunch of work done in 2006, 2005 in this area. Actually somebody messaged me on Twitter from Postgres, they were thinking about adding this in Postgres now as well. The reason why this works now and it didn't quite work before is that the number of outstanding prefetch requests you can have in your CPU has increased. Intel and Xeons have increased the number of memory locations you can have prefetched. It used to be much smaller. So for do this vectorization, it wouldn't work because the number of memory locations you need to prefetch to the CPU was not enough when you come back around and you still had stalls. So to real quick is look at evaluation of this. So this is actually the same graph I showed before when we talked about compilation in Peloton, we're showing that the interpretive version versus the LLVM version. But now I'm also including LLVM plus the relaxed operative fusion. So the first point here in this particular query in Q1, this doesn't help you because this is very competition expensive. We're not having a lot of cache misses. So all the prefetching and the operative fusion stuff doesn't help us. You see a bigger difference over here, this one here and actually for all of these, I guess this is where you're doing, you're doing again longer scans, more complex aggregations and this is where the prefetching does help. So to now do a breakdown and say, well what actually is the benefit we're getting from the relaxed operative fusion versus the compilation versus the prefetching? We have a breakdown chart here. So this is Q19 and again this is the old Peloton system. So this is what we used to do before. This is the interpreted engine. This is why we throw away all the code and wanted to build out an LLVM engine. It's just crap, there's nothing more discussing. In the compiled version going from this and this you get a 97% speed up. Again in a well-written interpreted database system it would not be 97%, it'll be something smaller but because this was crap, this was so much better, that's why it's so high. But now you see again, if you add now ROF with SIMD, you get a 65% speed up and then for prefetching you get a 3.5% speed up. So again this is sort of what I was saying before. We're sort of cutting down cycles and trying to get this thing running as fast as possible and we're down to 189 milliseconds. There's not much more room we can slice off. Okay? So again the main thing I wanted to show here is that you can do SIMD and compilation together. The paper you guys read, they just didn't do that comparison. It is what it is. Okay. So to finish up, the vector-wise versus the hyper is one approach better than another? The answer is no, it's inconclusive. Right? There's a bunch of other things that I don't think are really relevant to our discussion in this class but like looking at different CPU architectures and seeing whether there's a difference. I think all that's very interesting but at a high level, most data systems are gonna run on Intel Xeons so it makes sense to target those architectures. And then again, the ROF stuff that we came out with allows us to actually do both. And again, the main idea of what we're handling doing, those stages we have to generate, those are more instructions. There's more work we have to do to prepare the data, put it into the stages, keep track of the stages, read them back in as we go and come back around. But in exchange we're getting better instructions, we get better instructions per cycle. I don't know what it's CPI, it's IPC. Okay? All right, so next class, we're gonna talk about query optimization. And it's the hardest part of, it's probably the hardest topic we're gonna discuss all semester. And so there's this great quote in the Davies community that says query optimization is not rocket science. If you flunk out of query optimization then you can go build rockets. This is the hardest part of actually in a database system to actually get right. And the paper you guys are gonna read for Wednesday will be a description of the Cascades architecture, the Cascades model that our query optimizer is based on. And then which in my opinion is better, the Germans disagree, then you'll read next Monday's paper will be from the Germans saying that the paper reading next Monday will refute the paper reading this Wednesday and say why it's bad and why the German way is better, all right? But it's hard, right? So I'll try to go through as sort of, I'll sort of guide you into the process as easy as possible. Then after I understand the different ways to implement an optimizer, then we'll discuss how you actually do query cost estimation, which is another hard part of this that all the models will talk about don't even work unless you have a cost model. So the next three weeks again, it sucks as it's the end of the semester, but like at this point where we are in our system, I kind of had to put it here because we don't have a query optimizer in the new system yet. But in my opinion, this is one of the most interesting parts and as I said before, if you can do query optimization well, then you can get a job anywhere, right? You can get a job tomorrow, okay? Because everyone has query optimization problems. Okay? Any questions? Take off, you know. Got a bounce to get the 40 ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't it no puzzle I guzzled because I'm more a man. I'm down in the 40 and my shorty's got sore cans. Stacked some six packs on a table. And I'm able to see St. I's on the label. No shorts with the cost, you know I got them. I take off the cap, my first attack on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom, baby.