 Welcome to lecture 16. I am obviously not on campus, Carnegie Mellon shut down all in class lectures. And I can't even go into classrooms anymore, the campus is shut down, I can't even go in there to my office. So I'm here at home recording this in my own office. It's just me and then I have the interior over here sitting in the audience that's watching this the entire time. So just the two of us are going to go through this together. So today's lecture is now going to be a follow up on where we left off before the spring break discussing the two different main methods to speed up query performance. And that was the vectorization and compilation. So in both cases we saw that you can get better query performance, execution performance in either of these approaches. We mostly looked at them in isolation of each other. So now today's lecture or today's paper you guys were assigned to read is about now to understand what circumstances would one approach be better than another. And the reason why we want to discuss this two of these things together is these are major architectural decisions you have to make about the design of a database system whether you're going to the compilation route or the vectorization route. And so it's important for us to understand what are the tradeoffs for each of them so that if we're building a brand new system we can make this decision early on because it'd be very difficult to go back and retrofit an existing database system to either introduce these different concepts. I will say unscientifically I think adding in vectorization to an existing system would be easier to do than compilation. But either way that both of them are pretty significant changes. So today's lecture is that today's paper was sort of comparing this the vector wise approach versus the hyper approach. Now they sort of claim that it's that the two are mutually exclusive and we'll see something at the end that we've done here at CMU that shows that you can actually blend the two together. But at least for initially let's assume that it's either one or the other. So let's go back now and let's sort of have to do a quick refresher of what vector wise looks like or what vectorized execution model looks like and what is the compilation execution model look like in hyper. So again the vector wise approach was that they were going to have all these primitives written in the database system that would do a small portion of the execution of a query. So primitive might be like evaluating a predicate with inequality less than, greater than, right? You have separate primitives for all of these and then you have one for each type of your database system supported. So the idea is that the database developers would write all these primitives and then make it compiled into the binary, the database system binary that's shipped to any customers or users. And then at run time when a query shows up, the way they're going to execute the query plan is that they're essentially going to stitch together all these primitives in a plan overall that's interpreted. That's going to allow them to invoke these primitives at the right time in order to execute the query and produce the correct result. So now you'd be thinking, well these primitives are just all function calls. Isn't this going to get expensive? Well yes, because again that's like a jump in your execution code and that can be expensive for modern super scalar CPUs. But the way they're going to avoid having these function calls slow them down is that they're going to amortize their calls because for each function call to a primitive, they're going to be passing in a vector of values that they can then process in batches. So you wouldn't be doing this for every single tuple, invoking a primitive. So the output of all the primitives are just going to be now offsets of the tuples that matched or whatever that satisfy whatever the primitive is actually trying to do. So let's look at a high level example like this. Say that we have a select statement, select star from foo. On a string column equals ABC and an integer column equals four. So the query plan for this is super simple, right? It's just the sequential scan on the table and then I have my predicate operator, my filter operator, that's going to apply the conjunction clauses of these two. So the way vector wise would work is that for each of these predicates I want to evaluate, they would be a separate primitive based on the column type. So in this first one here, it's a string column. So I would have a separate primitive that would be able to process a vector of strings from the column I'm scanning, passing in the value that I'm looking for, then now I can just do a forlip over that column and check to see whether my value I'm passing in matches. And if so, then I append the offset of the tuple that matched in my output vector. And then calling this the output vector of offsets, that's a synthetic thing, usually just integers. But here's just a way to indicate that it's an offset, not a regular integer. So then now for the second predicate, I have another primitive. But in this case here, instead of actually taking in just the column and the value that I'm looking for, we also have to be provided a vector of the offset positions that satisfy my first predicate. So then now in my forlip, I'm scanning over those positions rather than just the opposite of the columns. So what's that? What's that? Okay, all right. So the Terry asked the question, do I have to size these vectors ahead of time? Because in this case here, I don't know how many tuples are gonna match in my offset vector, should that be a fixed size or should it be a variable length size? Again, we are doing this on batches of tuples. So you can just allocate this just to be the max size of the output. And then you just record where you left off. This is a simplified view, so I'm not really showing that. It's a good question, Terry, thank you. All right, so the hyper approach, again, is to compile queries. So instead of having these pre-compiled primitives, you're gonna generate on the fly at runtime for every query the code that you then need to compile for that particular query plan. So no longer do you have to do any interpretation to know I'm accessing this column and this column's this type and therefore I need to evoke this primitive. You're just baking into, directly into the query plan, exactly what it should be doing based on what you know the schema of the table is from the catalog. So the hyper approach is sort of more than this compilation and this paper sort of gets into that, right? They're doing this bottom up or push base model where the bottom of every single pipeline, you're gonna scan over the input table or whatever the output, the pipeline below you, you're scanning over the tuples one by one. And then what happens is, instead of scanning one tuple, we're doing one operation and then going to the next tuple for that pipeline, you're actually gonna ride that tuple up the pipeline all the way as far as you can until you get to the top with the pipeline breaker. And then you loop back up down and go get the next tuple. The idea here, because we're pushing that tuple all the way up, we can maximize the cash locality. In some cases, we're able to actually guarantee that the tuple will reside in the CPU registers, which are faster than in CPU caches. And that's how they're gonna get better performance. So the as written or as described in the 2011 paper with hyper and with the newer version, Ombra, that the Germans have been building. This compilation model is not assuming that you're doing any vectorization. It's that push base model is grabbing one tuple, riding it up the pipeline. But we'll see at the end of this lecture, we'll see a technique we've not redeveloped that can combine both of them. So this is just an example of what that query would look like in hyper, just to do the scam. Again, so now instead of having two separate functions for each predicate, I'm just gonna have sort of one function that I would code in that would take in the string column, take in the integer column, and the two constant values I'm trying to compare against them with. And then there's now I just iterate over one tuple at a time. And then I do all the predicate evaluation for that one tuple before I go on to the next one. All right? Okay, so again, the good paper that you guys were trying to read is sort of trying to answer this question of, is it better to stitch together those pre-compiled primitives as in vector wise, or should we do the holistic compilation like in hyper with that push-based processing model? So again, the first part of the lecture is trying to understand these two distinct design approaches. And then the second part will be finishing up with how the way for sort of blending them together into a hybrid query processing model, okay? All right, so the paper you guys are trying to read was about evaluating these two different approaches. And the way they went ahead, the way they did this was that they built a single test-bed system that would allow us to do a true apples-to-apples comparison of the two approaches without having the results encumbered or affected by other features or other implementation aspects of the database systems. So what we meant that is they're going to implement all the algorithms for the queries that we're going to evaluate exactly the same at a high level for the two different approaches in sort of the single test system. And then we don't worry about other things like how are they doing thread scheduling, how are they representing integers or numerics or floating point numbers, how they're actually going to do any comparison of the strings. All of those things like that are the same across all systems. And so that way we need to sort of focus on exactly how the two different approaches are going to perform. So again, all the high-level algorithms are the same, but then there'll be some variations in the implementation details that are specific to each system for these algorithms. So again, the threading approach is going to be the same, how they represent numbers are going to be the same. But certain aspects of the operator implementations that we're evaluating will be slightly different. So in the case of the hash join is a good example. The high-level hash join algorithm is going to be exactly the same. You're going to build a hash table and then probe it. And they're going to use linear probing hashing for both of them. But the systems are going to end up using different hash functions for these hash table because there's certain characteristics of those hash functions that are more amenable or better make better utilization of the two different architectures. So in the case of vector-wise, they're going to use murmur 2 because this is an execute with twice as many instructions as CRSC32. But they're going to get better throughput, which is better for a vectorized architecture. In the case of Hyper, since they're looking at a single tuple at a time when they do a hash and they put it into the hash table or do the probe, they want a really, really fast hash function that can operate on one tuple at a time. So that's what CRSC is going to do, right? Because it'll execute 40% fewer instructions than murmur 2. And this is going to be better for these sort of simple loops that Hyper is going to try to support, right? Okay, so the two systems that we're going to implement are variations or they are simplified versions of vector-wise and Hyper. So the first one will be called Tector-wise, the second one is called Typer. What's that? So the tedious question is, why is it called Tector-wise or typing with a T instead of vector-wise or Hyper with the virtual names? So the name of the first author was Teemo. So you just put the T in front of both of these names. Thank you, Teager. Okay, so in the case of vector-wise, again, what they're going to do is going to break up all the operations we can have when we execute queries into these precompiled primitives, and then they're going to stitch together the query plan in this sort of interpreted model where I know what function, a primitive I need to invoke in order to execute the query based on the type of the column that I process in. So they're also going to need to materialize the output of all the primitives at each step. So like if I invoke that primitive and then it generates me a vector of offsets, I have to write that somewhere. And then I have to then feed that as input into my other primitives that take in previously matching offsets. In the case of Typer or Hyper, they're going to do the LLVM JIT compilation, the push-based query-pastics model, where you're going to take a single tuple and write it up as far as you can in the pipeline as far as possible. And for each sort of step within the pipeline, you don't need to materialize results in a separate buffer, you just take the output immediately of the whatever step you're doing and then that's used as the input of the next one without having to do a mem copy. So for this one, they're going to evaluate five queries in TPCH. TPCH is a widely used analytical workload. And they pick these five queries because they've selected them based on having that they would cover each sort of the different aspects that query would cover enough about the emblematic of what real workloads look like. So this is great paper written by Thomas Neumann and Peter Bantz, the vector-wise and hyper guy, where they evaluate the different type of TPCH queries, compare them against what real workload looks like, and they sort of come up with sort of five categories of like, hey, here's the standard things or common patterns we see in queries, and here's ones in TPCH that that sort of cover each category. So these are just picked at random. These are selected for a very specific reason. So Q1 will be a really simple one is just a scan of the fixed point arithmetic and then doing a five group aggregation and a group by six will be query six will be just a scan with some filters Q3 Q9 are going to be doing joins, but they're going to vary in the build side, the join, right, and the ratio between the probe and the build, right? In this case here, the Q3, the build side is much smaller than the probe side. And this one here, although it's still smaller, it's not the differences is not as significant between as in Q3. And finally Q18, we're doing a high card and high card and now the aggregation just means an aggregation, which is a lot of unique keys that are in the hash table for the group by clause. So it's going to have 1.5 million groups, which is pretty massive, right? So the first graph we're going to look at is just the raw performance difference between the two, the different approaches. So the first thing to point out is for Q1 and Q18, hyper is actually doing the best, right? Hyper is actually apple forming vector wise, but for Q6, Q3 and Q9, the predicate scan, the selective filtering, and then the two joins vector wise is actually doing better. So these graphs are interesting, right? They sort of tell us which one is better, but they don't really explain why, right? And so to understand why these are doing better, to understand what these different models are actually doing as they execute these queries, we need to look at the low level performance counters that the Harvard is going to provide us for, you know, during query execution. So for this table here, this is running this with collecting the Harvard performance counter from Linux. So you have cycles, instructions per cycle, total number of instructions, sorry cycles, instructions per cycle, instructions, L1 cache misses, last level cache misses and branch misses. And then the little star here just indicates for which of these two systems from the previous graph actually had the best performance. So this is going to be running again at the TPCH queries on a scale factor of one. So it's a database with one gigabyte with total data. And then for all of these, these results here are going to be normalized based on the number of tuples that were processed per query. So in the case of like this one up here, Hyper does 34 cycles. So this means that for every tuple that the database system had to process in order to execute this query, each tuple took 34 cycles to actually complete and do whatever processing it needed to do. So the third thing to point out here, though, is that we see in the case of Q1 and Q18 where Hyper was actually performing the best, we see a clear indication that these two systems that they're executing a different number of instructions per tuple. So in the case of Q1, this is executing 162 instructions per tuple. This is 68, 46 to 102. So this is sort of explaining why Hyper is actually performing better. Well, it's because it's just executing simply fewer number of instructions than vector wise. And this is to be expected because these queries are computationally more heavy than the other queries. So the Hyper is getting better performance for the fixed point arithmetic and the higher utilization of in-cache aggregation. So therefore, because it doesn't have to materialize any of the values as it's going along where vector wise does, it ends up executing much fewer instructions. So and you see this also, this is also interesting point out too is that the instruction count is lower, but the instructions per cycle is actually worse for Hyper. So instructions per cycle, the higher the better means for every cycle I can actually more things. So in the case of Hyper, it's executing two instructions per cycle where vector wise is doing 2.8. So you would think again that vector wise would be better here because it's doing more work for every single cycle, but the difference is that it's just executing way more instructions to actually process the query. So that's why it's getting, that's why it loses to Hyper. All right, so next one we're going to look at are the two joints Q3 and Q9. So again, if we look at the instructions per cycle and the number of instructions in total, just like in Q1 and Q18, Hyper is executing fewer instructions and is doing fewer instructions per cycle. But we saw that vector wise was actually doing the best here. So this is showing you now that you just can't look at instructions per cycle or instruction count to understand why performance is bad. Right. In this case here, if we assume that, oh, if number of instructions is less, as we saw with Hyper and Q1, it's less here, therefore it should be faster. It's actually not, right. Vector wise is doing faster here. So there's something else we need to look at now to explain why we're seeing this performance difference. So if we go now and look at the branch mis-prediction from the CPU, we see that the number, the difference between these two queries or two systems is actually quite significant, right. So you would think, all right, this is now explaining why there's such a difference in performance for doing these joints. But this also now doesn't, this same observation cannot be then applied for the other joint Q9, because now if we go look at its joint, you know, its branch mis-prediction, it's actually, they're about about the same and the, the difference in instructions per cycle and the, and then total number of instructions is about the same for Q3. So there's something else going on about this query that's causing it to, to do better on vector wise. So the difference now, though, is that just the total number of cycles per tuple is much higher than the other ones. And so what's going on here? Well, vector wise is going to have, when, when you're doing the hash probe on the, for the joint, because there's way more tuples than this one in Q9 than Q3, those, all those probes are always going to be cash dolls because or cash memory, you know, cash dolls or cash, yeah, cash misses because you're going out to a random location that may not be in your CPU cash. If your hash table is really large and therefore you have a stall from the CPU because you're going to have to go fetch things from memory and there's not really any way to have speculative execution speed that up because, you know, you're trying to execute this tuple in, in, in hyper, there's nothing really else to speculative execute. Like I can't proceed until I go get the, that next memory location to put it for the hash table to put the tuple out, then I'm trying to, trying to probe or trying to insert, right? So in the case of hyper, the probe loop is way more complex than it is in vector-wise, even though it's using a more, more efficient hash function, but for this reason, the way that it's sort of set up with these vectorize, you know, processing these, these vectors of data, it's able to sort of amortize those memory stalls across, across, you know, every, every single vector. So again, the main takeaway from, from this table is just saying that there isn't one approach or there's isn't one metric we can look at to explain why one database system is going to perform better than another in these, for these different queries. We have to sort of take it on a case-by-case basis to understand what's going on. So the main findings, again, as I said, both these models are actually going to be actually quite efficient, and the, the difference to explain their performance discrepancies, where one might be better than another, can vary from one query to the next. So I should also point out to going back here, the, the difference of performance for these, these different engines is actually quite small, like in this case here, this is probably the largest gap. This is what, so say hyper is doing 50 milliseconds for Q1, and vector-wise is doing a little over 80 milliseconds. So we're really only talking about a 30 millisecond difference. In this case here, it's 150 mil, it's a 50 millisecond difference. The grand scale of things compared to existing or traditional database systems, this is a drop in the bucket. As the paper mentions, the performance difference you're going to have between, you know, either these approaches for in-memory databases is going to be two orders of magnitude faster than traditional database system approaches, like with Postgres using the volcano model, things like that. So we're really starting to like split hairs here, but the, you know, the difference, so the difference between these two systems and absolute numbers is not that great. Although in the relative numbers could be actually quite significant. But the main takeaway is that maybe it actually, it doesn't matter that much. Again, when you start, start counting cycles, it, you know, things, things don't, don't matter that much. To give an example also to my PhD student, Prashant, after he finished the ROF paper, which I'll talk about next, he then spent maybe like a half a year trying to make hash joins go faster for in-memory databases. And it got to the point where we were like saying, oh, well, the state of the art can do 12 cycles per tuple, and we can do 11 cycles per tuple. And at that point you don't really see that significant of a difference. So the other two main takeaways from this as well is that the, the data-centric approach from like in Hyper was better for the calculation heavy queries where you had fewer cache misses, because I can just sort of have these tight loops to process the same tuple where it's in my register over and over again, and that was really fast. And the vectorization model was a bit better for hiding the cache misses problems when you're doing the hash join, the bill side or the probe side. Okay. All right. So now the next thing we want to talk about is how much is SIMD actually helping us in the case of vector-wise. They make a big deal about, oh, because we're doing vectorized primitives or we're processing columns of data. Therefore, we should be able to take advantage of SIMD to get better performance and use more efficient instructions, you know, execute our primitives. And so now we want to understand what was, you know, what is actually the benefit we're getting from using SIMD. So for this, they're going to be doing all of the algorithms using AVX 512. So 512 bit registers. Contrast this with the Columbia paper we read that was using everything as 256 bit registers, AVX 2 because AVX 512 came out in 2017. We did this paper in 2018 or 2017, 2018 and the Columbia paper was in 2016. So the Columbia paper didn't have AVX 512 at that time. The other thing to point out too is that because now we have 512, in addition to having larger registers, Intel added some additional instructions that are going to make it easier for us to do more vertical vectorization in our implementations of our operators or of our primitives. So we're going a little bit farther beyond than what the Columbia guys were actually able to do as well. So there's really only two graphs to look at. The first is breaking down the performance benefit you're going to get for a vectorized or SIMD primitive compared to the scalar one. So this is taking the parameters to do hashing, gather and the join. And we have one implementation that's the scalar, the scalar code and then one that's doing explicit vectorization with SIMD using intrinsics. And for this, we're going to run a, they're just going to run on one gigabyte database and process these tuples, the two approaches in a single thread and then see what the difference is performance. And so just like in the Columbia paper, you can actually get a quite a bit of significant improvement in performance when you use SIMD. So in the case of the hashing primitive, you get a 2.3X improvement. That's actually a lot over the scalar version. For the gather, not so much, but for the join case, you get 1.4X. Again, that's roughly about what the Columbia paper was talking about. And that's actually quite, that's a pretty good number for just changing this one primitive that you're using all the time. The problem is going to be though, is now, once you put this into a full system and the data you're processing, no longer fits in your CPU caches, just like, again, we said in the Columbia paper, that all this sort of starts to fall apart. So this graph here is showing the performance of executing Q3 and Q9, the full query plan now, with invoking all of the different primitives. And now you see, if you incorporate the vectorized primitives that we've shown here, you're not getting this up to, over 2X performance improvement, you're getting at most 1.1X. Because again, once things no longer fit in the CPU caches, once you have to incorporate the overhead of materializing the output, copying it from one buffer to the next or one SIMD register to the next and over and over again, that all that sort of coordination to use SIMD and execute the query becomes the main bottleneck and you don't get that great performance benefit you get from just looking at the SIMD primitives by themselves. Right? So again, the main takeaway here is that, yes, SIMD is going to help, certainly when you look at the primitives by themselves, but when you look at the, the holistically, the entire query, it's not as significant. All right, so the last thing we're gonna look at now is in, for this paper, is how well can the compiler support auto vectorization? So remember when we talked about before, when we talked about vectorization, we said there was three ways to get it, right? You could do explicit vectorization and we're using intrinsics in your code. You could do compiler hints, like the pragmas, the prepositive directors that say, hey, for this function, don't worry about the memory overlapping, you can go ahead and vectorize everything or you can rely on the compiler to try to figure out that a certain function or primitive in the case of vector-wise could be vectorized and in the hope that it does it for you. So we evaluated three different compilers, GCC, Clang, and ICC. Turns out that the ICC from Intel, the Intel C++ compiler, turns out it was actually the best at auto vectorizing the different primitives in the vectorized implementation and they were actually able to vectorize everything in ABX 512 instructions. So they were getting the widest, using registers with the most lanes to do your SIMD evaluation. So you're getting the amount of useful work you're doing per instruction or per cycle is in the maximal case here. So we were able to vectorize the hashing, the selection, and the projection. We weren't able to automatically vectorize the hash table probing or the aggregation because again, these are random lookups into hash table and it's hard to vectorize that so it wasn't able to do that. So for this graph here, what we're showing is the reduction in number of instructions we would execute for running these different queries relative to a implementation that doesn't have any vectorization. So auto vectorization is turned off. It's no manual instrumentation of the source code. It's just like the scalar implementation of each primitive. And so what you see is that with exception for Q6 in the manual implementation, you're getting quite a significant reduction in the number of instructions that you're going to actually execute, right? In some case here for Q9, you're getting an 82% reduction in instructions. And so also too, this is the auto vectorization. This is manually written with intrinsics and this is sort of a hybrid approach that tries to, for some cases where the compiler couldn't figure things out, we would use the intrinsic implementation. So again, across all the board, especially in Q9, you're getting quite a reduction in the number of instructions. But now the problem is going to be when we actually run the queries, just because we do some of our instructions doesn't mean we're going to get better performance. Again, so the reduction in performance here is relative to the scalar implementation. And so the reduction in time, it would be higher is better. So if I'm above 0%, anything above this middle line here means I'm getting better performance. So in the case of Q6 and Q3 and Q9 and actually Q18 as well, for much of these, you're actually getting a worse performance. So in the case of auto vectorization with Q9, the Intel compiler generated code that ended up being 14% slower than what the non-scalar implementation actually was, right? And the reason is because you're going to pay this penalty of having more, because you have more complex, you have more complex implications and you're trying to do a bunch more work inside every single loop. The cache misses are just going to crush you. And so a sort of more simplistic implementation of just iterating over one, two per at a time actually turns out to be better than what Intel can generate for their code. Again, mainly written code for vectorization is always going to be better in this case here for these other queries. But for this reason, the ICC actually generates bad code for us, for databases. And I don't remember what the paper said about clang and GCC. I think in those cases, it couldn't automatically vectorize it. So it'd be equivalent to the scalar implementation. All right, so the next thing I want to talk about though is again, as this paper sort of, this paper made the big assumption, somewhat until you get to the very end, they make this assumption that vectorization and compilation are mutually exclusive. So if I'm going to build a database system, I either go vectorize or go compilation and I can't have the two together. There's a little blurb though at the end that says, oh, while you could combine them together, Peloton, Carnegie Mellon's old database system does actually do this, but they claim that the engineering overhead of trying to do this is actually quite significant. And so they have this graph here that shows sort of these different design decisions over the spectrum here on this quad chart about whether you're doing two part of time or vectorization or interpretation of this compilation. And hyper is over here because it's doing compilation, but it's two part of time. Vector-wise is over here because they're doing vectorization, but it's all through interpretation. And then Peloton is sort of in the middle here. It's sort of true, yes, because we are doing compilation and we do get some vectorization, maybe not to the full extent that vector-wise does because they try to vectorize everything, but it's just showing that you can actually do this. You can sort of combine the two of them together. All right, so let's look at what sort of how hyper views use pipelines. So again, they're doing operator fusion, which means that within a pipeline, they wanna fuse together, combine together the bunch of, the different operators we're gonna have in that pipeline so that within one iteration of a for loop in the pipeline, we're gonna execute as many steps on that tuple before we go back to the next tuple. All right, and the idea again, that just means we're writing it up the query plan as far as possible. So this apparently means that the pipeline itself is a two part of time process because again, for one tuple we're gonna have a bunch of operators that fuse together, we execute it and then go to the next tuple. So let's go to a really simple query here. We do a scan, a filter, probably an aggregation, and then we emit the tuple as the output. So the hyper could generate code that essentially looks like this where we're gonna scan through the tuple A and then we evaluate the predicate on the age and then we update our counter for our aggregation. So the first pipeline is here, right? The first for loop is gonna scan through every tuple in A and then check, do the filter and then update the aggregation as needed. The second pipeline is just to integrate now over all the tuples in our aggregation hash table or dictionary and then produce that as our output. So what are some problems with this approach, right? If you wanna try to vectorize it. Well, the first problem is that we're not able to look ahead in anything in our tuple stream and then we're not able to overlap the computation work we're doing with the memory access that we wanna do. So again, here's our scan, filter and aggregation steps. The, in the case of the filter, what's happening is that because we're looking at a single tuple in our for loop, we can't vectorize this, right? Because we have one tuple for each invocation of the for loop and we go get a bunch of stuff. The next issue is that for the aggregation here, this is going to be a random lookup in our memory address for this hash table to update the counter. And then when we're done, then we loop back and go get now in the next tuple. So there's actually two places we could have cache misses here, right? We could have one here when we do the probe and to update the hash table and we have another one when we come back around and go get the next tuple. So what we could potentially wanna do is this may be unavoidable because we don't know what this address is until we actually go and do the lookup but certainly we could potentially prefetch the next tuple in our vector that we're trying to process so that we do all this work, computational work here. By the time we come back around, the next tuple that we're looking for is available to us. So this is what the goal is we're trying to do with laxed operator fusion. We're trying to get the best of both worlds. We're trying to do vectorized processing as we did in vectorize but we're doing this in a system that is using the query compilation in the push base model that Hyper is doing. So the way this is gonna work is that we are gonna now take a pipeline and actually decompose it into stages in the right locations that are gonna allow us to build buffers of tuples or vectors of tuples in our intermediate output and then we can then process that vector within the stage, potentially prefetch the next vector we need for that stage while we go ahead and processing the current one, right? So it's just like we're doing the operator fusion we're gonna combine together multiple operators within a stage. We're gonna have past data from one stage to the next through our CPU cache by having these buffers reside in their CPU caches. So we're not worrying about stalling out the memory. So our vectors are gonna be too large that they have to go out the memory. And then the idea is that the stages will be our sort of natural boundaries of how we do vectorization and how we're gonna do fusion. So we go back to our example four, right? Of our two stages. This is the part we can actually vectorize because we know we had a vectorized filters because we saw that in the Columbia paper we saw in the vectorize from the evaluation paper from today. So this is what we wanna vectorize. So now to make this work we're gonna introduce a stage buffer here that allows us to process the tuples in stage one in a vectorized manner. And then we're gonna write out the tuples that we collect from the filter into the stage buffer so that then because we're processing multiple tuples at a time. And then we can take this buffer now and then invoke the next stage in our pipeline to do the aggregation. We can either do this in vectorized manner or on vectorized manner. In this case here it'd be unvectorized because it's doing a hash table lookup hash probe. So if we go back now to our query code here we see that in the first stage we're iterating over table A in chunks of 10, 24 tuples and then we're doing a SIMD comparison across that vector but then we're writing it out into the stage buffer. And then in the next step is when we just iterate over the stage buffer produced in the previous stage that then update our aggregation. In this case here, this is a scalar implementation. And then the last pipeline only has one stage we're just iterating over the dictionary produced in the or the aggregation table produced in the previous pipeline and emitting that as our output. So the magic that's gonna make this work to avoid the cashmas stalls that we saw in vectorized for this stage here is that we're gonna use software pre-fetching. So with software pre-fetching, the idea is that in x86 we can explicitly tell the CPU hey, we're probably gonna need this next chunk of memory go ahead and pre-fetch it for us. This is different than Harvard pre-fetching. Harvard pre-fetching is where the Harvard tries to recognize oh, I see you doing a scan over some stride of memory. So maybe you access this chunk right now and it's very likely you're gonna access the next chunk because it infers that you're in some kind of loop. So therefore it's gonna go ahead and pre-fetch that for you. This is where we're actually explicitly telling the CPU hey, we want this chunk of memory go ahead and pre-fetch it for us. Now this doesn't come for free. The in x86 there's a limit of the number of pre-fetch instructions or pre-fetched, outstanding pre-fetched invocations you're allowed to have before it starts dropping them or ignoring them. And certainly you can do this incorrectly so that you start pre-fetching things you don't actually need or you pre-fetch them too soon or too late. So by the time you get around in your fuller to go actually process that next chunk of data it's either was there and it got evicted or it hasn't arrived yet. So it's not this magic thing, you just invoke the you sort of have to be mindful of the amount of work you're gonna do before you go need the next data and you have to make sure you actually time this correctly. But in a database system we know what the query is you know what the query is trying to execute, you know what the data looks like, you know what the hardware looks like. So you can actually code gen these pre-fetch sizes and the timing just right on a per query basis or per system basis. So the idea is that the in our for loops we'll go ahead and pre-fetch the start of the next stage so that when we come back or sort of yeah pre-fetch the next chunk of data we're gonna access within a stage so that we do that work for our fused operators right into our output buffer for the stage and we come back around and the data we're looking for is available for us. So there's different types of software pre-fetching techniques you can use in the case of x86 we're just gonna do great pre-fetching because that's the simplest right? Just saying here's the chunk of data go fetching pre-fetch before us. All right so now let's go look at performance we can get in our system when we just do the regular push base approach implementation like in hyper versus introducing the reflect operator fusion with the pre-fetching. So we're doing q1 q3 q13 14 and 19 on a 10 gigabyte database. So in q1 you don't actually see that big of a difference because this query is just computationally heavy and I'm just trying to do as much work as I can within a within a single for loop so pre-fetching and the fusion stuff doesn't actually help you here but in the case of q3 where there's joins this actually does help and same for all these other ones here right? So again so this is just saying pre-fetch doesn't help and this can here is example where pre-fetching does help. I should be also clear too these here could be just to give an image of SIMD it may not necessarily using pre-fetching this is example where you are using SIMD with pre-fetching. So let's look now of across all the different things you can do in a compilation based engine between SIMD, the ROF and the ROF with pre-fetching and how these things can accumulate and get better performance. So again this is running in Peloton which is the old system. So in this column here is just saying if you don't do any compilation it's just the interpreted engine how much you know what's the sort of the upper bound of how worse the system actually can be. So this is actually terrible right? This is running in 21,000 milliseconds. So this is running in 21 seconds. Like this is like stupidly slow for a 10 gigabyte 10 gigabyte database that's all in memory. And so when you go now with the LLM compilation you're gonna reduce performance by 97% over the interpreted one. So this is not a, I mean this result is real it actually did happen but I would say that don't get the impression that this is indicative of if you take an interpreted engine and you go out of compilation you'll get this performance benefit. This is like a crappy interpreted engine and then a really well written compilation engine in actual reality it's usually about a 50% reduction. But now within the compilation engine we can see as we add these new techniques together how much better you're getting. So for Q19, this is just one query. When you add in the ROF with the SIMD so the stage offers and using vector as execution for some stages you can reduce performance by 65% and then when now you add in the software prefetching this can reduce it further by 3.5%. So again, like this doesn't 3.5% compared to getting 65% drop or 97% drop doesn't sound that significant but at this low level of such a short query execution time you're again just shaving off milliseconds just shaving off cycles there isn't that much more room of how to strip it down the query execution down further and get better performance. Okay, so just to finish up we saw that in the case of vector-wise and hyper no one approach was significantly better than any other one and that there was the difference of performance as you did see was attributable to the way that it was paying penalty for branch mis predictions, cash misses, the number of instructions per cycle that varies per query. And so what I would say is that these results show that either one is actually reasonable but with ROF this is allowed to get the best of both worlds. You can get that push-based compilation approach that you saw in hyper and also the SIMD vectorization stuff that you saw with vector-wise and it blends the two of them nicely and I disagree with their conjecture that they made in the paper even though I was a co-auth on it that the implementing something like ROF is just would be too difficult to try to do. And certainly in the case of the paper that we wrote like yeah, sure, maybe it was maybe too hard to add that but if you're building a new database system to use ROF it's not something you have to sort of have everyone be able to reason about and re-implement over and over again. Once you sort of have implemented in part of the co-gen engine that you've built it sort of takes care of itself. So in our new system that we're building currently temporarily named it for the dog we support this now. All right, so again, this has been awkward. This has been tough. Hopefully everybody is safe and feeling healthy and taking care of everything. For the class again, we will office hours today at 1.30 over on Zoom and then but if you need to meet me further just send me an email and we try to find time to get together and then post any of your questions about the projects on Piazza. So next class on Wednesday we'll discuss hash joins. This will be a sort of two-part lecture. We'll do hash joins on Wednesday and the following week we'll do sort merge joins or how to do parallel sorting. So now we'll go a little bit further now and say all right, how do you actually implement sort of modern join algorithms using either the vectorized or the compilation approach? Okay. All right, hopefully everyone's taken care of and you have enough food and toilet paper and everything's fine. Okay, see you guys. Bank it in the side pocket. What is this? Some old bullshit. Ay yo, ay yo. Took a sip and had to spit cause I ain't quit that beer called the OE cause I'm old cheap. Ay, you looked and it was gone. Grab me a 40. Cause I needed just a little more kick. They brought just dropped up.