 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Today we're going to start talking about how to actually execute queries. So the last two classes were focusing on what the data is actually going to look like and we were designing our encoding schemes in such a way that when we actually start running the queries we would minimize the amount of data we have to fetch from disk or remote object store and bring that into memory. And we can be clever about encoding our data in certain ways that we can ideally do our processing on it in its encoded form or in its compressed form, RLE is an obvious one but there's other ways we showed how to do this as well. And so for the next couple of weeks we're going to talk about how we're actually going to execute queries. And again, in the ODOT world it's all about sequential scans. We're not going to do index lookups, we're ignoring bitmap indexes and other things. There isn't going to be a B by street to go find us single things, single records. We're going to have to scan through large chunks of data. So this is again a list of what I showed a couple of weeks ago of here's the bag of techniques or tricks we can do to make sequential scans run faster and as I said we're not going to discuss make sure there is view and clustering and sorting in this semester. But we've already discussed data encoding and compression. We've already discussed a little bit about data skipping, like how to use zone maps to say here's the min max of this giant block of data and check to see whether the tuple you're looking for or any tuple that could be in that block based on that zone map. So for the rest of the semester we're going to go through all of these and it's not going to be like here's a lecture on test parallelization, here's a lecture on query parallelization. It's going to come up in different points throughout the entire semester because sometimes we'll delay discussing certain things and then we'll have to bring in sort of this bag of tricks for discuss how to do joins efficiently or how to do code specialization and other things. So again we'll go through these throughout the semester. So at its core what this class is somewhat really about is how to build a database system to run efficiently on your data for a given set of queries. And the idea here is that we want to make the full use of the hardware that's available to us so again we can run queries fast and at a lower cost than otherwise than just doing something stupid. And so all of the things I showed on the last slide there's not one I can point to and say, hey if you're building a brand new system here's the one thing you want to do above everything else, just briefly going back to it. All of these things matter. All of these things are going to be cumulative or multiplicative where we can add them on top of each other and get better results and make things run faster. And so it's really about understanding from an engineering perspective what are the tradeoffs that one of these techniques could make both in terms of the performance costs and actually the engineering costs. How much time is it going to make for some action to build this and maintain it is another big problem as well. So the spoiler is going to be just in time query compilation, amazing results, right? You'll get really fast, but it's a big pain to maintain and build. And so therefore most systems are going to choose pre-compiled primitives or operations. This is actually what the vector-wise paper, I don't think they didn't mention in that paper but it's one of the techniques that they used. So this is not a scientific list of what I think the top three optimizations from the previous list, so these aren't the ones that I think they're going to matter most. In terms of query execution, these are going to matter. Obviously data-skipping, like if you're looking for a thing that's not in any possible block and your zone map can help you avoid reading any data, like nothing can go faster than that, reading nothing. But if you actually do have to run queries, these are the ones I think are going to matter the most. So we'll spend a little bit of time talking about vectorization, not actually the vector-wise algorithms to do our query operators, that'll come in a week. Today is really setting up the query processing model so that we can then data in such a way that we can vectorize them. Task parallelization, we'll talk a little bit about it today, is basically how to take a query, break it up into disjoint tasks and run them in parallel on different cores, different threads, different nodes. And then code specialization, again, well, this will be a big thing starting next week. Basically, how can we avoid giant switch statements and indirection in our database system by having exactly what the query wants to use or the exact instructions that the query wants to use to process the data? Again, we'll see the two ways to do this. So at a high level, optimization goals are going to be the following three things. So in order to get queries to run fast, the most obvious thing we can do is just reduce the number of instructions we have to use to execute it. We want to use fewer instructions on the CPU to run the query in the same amount of work. The compiler can help a little bit. If you pass in the O2 flag, I don't know what the equivalent is in Rust, but if you pass O2 and GCC and Clang, it'll be more aggressive in trying to optimize things so that you'll get fewer instructions. As far as I know, people typically don't ship production databases with O3 enabled, O3 compilations, because things can get kind of hairy, but can reorder things in a way that will be incorrect. And so instead, what we're going to do is try to design our databases and design our execution engine just to use fewer instructions. And if you don't take my word that you don't want to ship O3, this is from the Linux mailing list two or three years ago, and basically Linus is saying here he thinks O3 is generally unsafe. So again, as far as I know, most database systems will ship with O2 enabled compilations. All right, so after we've reduced the number of instructions that we want to use to execute queries, the next thing we can do is try to reduce the number of cycles per instruction. And the idea here is that when we actually have to execute instructions to run a query, we want the data that it's going to need to operate on to process to be in L1, L2 cache, even better would be CPU registers. And that means we want to reduce cache misses due to memory loads and stores. We want to maximize the locality of the data that we're going to process on in our operators and our query plan so that they're going to sit in the CPU caches and we'll see how to do this through pipelining and more aggressively with operator fusion in push-based query execution. So the weird thing about this, not weird, but the thing why this one's going to be tricky, everyone can sort of reason out this, the first one. Like, yeah, run fewer instructions, don't do stupid things. Don't make library calls and start computing Pi unnecessarily ready running a query. That's obviously stupid, but things like that. This one's a bit more tricky. And the reason why this is going to be tricky is because we as humans, the way we naturally write system code or code is not going to be always the best way that the CPU actually wants that code or wants to run instructions. Because the out-of-order superscalar CPU will recover in a second. What is ideal for humans for us to reason about and maintain software may actually be the worst thing for the CPU. So we'll have to look at what the algorithms we're going to use when we run our queries or build our system to make sure that we account for what the CPU expects or wants and try to design the code in such a way that it generates that for us. Because the compiler isn't always going to magically do that for us. All right, and the last one is sort of obvious as well. We want to parallelize execution. More is law. It is more or less ending. And we're not getting faster clock speeds, although Intel is more recently rationing that up. But we're going to get a lot more quarts. And the cool thing about this is that on newer CPUs, there's a mix of the high-performance cores and the efficiency cores. So now we can, in theory, start scheduling things based on one core versus another. And then you throw in GPUs, and those things have tens of thousands of cores, which is insane. So we're going to cover all of these throughout the lecture. Today we'll talk a little bit about how to do this one, the second one. And then this last one a little bit, the first one we'll see this when we talk about co-specialization, so query compilation and pre-compiled and jitted. All right, so to make sure we're all using the same terminology of vernacular when we're describing queries and what we're actually going to be executing, you can think of a query plan as a DAG of operators. So we have a SQL query here, and we converted it into a physical plan. So we have scans on the bottom, then our projections feeding to a join followed by a, that's a projection, so the filters join and then projection. So these are the operators that we're going to have in a query plan, and then the database system is going to convert them into operator instances that are going to be invocations of that operator. And the reason why we had the distinguish between operator instance and operators is because we could have an operator run in parallel. So if this table is a billion tuples, I could divide up this scan operator into 10 operator instances that are each going to scan different row groups or different files in S3. And task is going to be a sequence of one or more operator instances. Again, this will basically be the same thing as pipelines but not always. It's basically, you're going to recognize that, oh, as soon as I do the scan, I merely want to do the filter, so I can combine these two operator instances together in a single task, and it's going to be scheduled by the system to run. And then a task set will be just a collection of these executable tasks that we could have for this pipeline that we could then ship out to the different cores. So the pipelines are going to be an important part of what we talked about today and going forward. And so again, the pipeline is a boundary in our query plan that specifies how much we can process a single tuple or a batch of tuples or a set of tuples up through the query plan and we reach an operator where we need to see all the other tuples within our pipeline before we can proceed up into the query plan. So on this side here, we're doing the scan on A, then the filter, and assume the build side of the hash join. The build side here is part of this pipeline. I can't send any tuple up beyond the join until I see what comes on the other side. So assuming I execute pipeline one, again, whether it's a single task or a multiple task running in parallel, it doesn't matter at this point. And then once that's complete, I can then run pipeline two, and now pipeline two could do the filter, so I do the scan on B, filter it, and then do the probe in the hash join. And now we know that any tuple that matches in the join can then be pushed up to the projection operator as part of the output, right? I don't want to start running the join, I'll start probing the join for this query until I've populated everything on the A side because otherwise I could have false negatives. Now, I'm showing this pipeline all the way up on one side. That's ideally what we're going to want to do to maximize the reuse of data, like to minimize the number of cycles for our tuple as we go along. But I could have just done the scan on A and then filter it, materialize the output, scan on B, filter it, materialize the output, and then have these be two separate pipelines. And then a third pipeline could then be, okay, let me actually do the join. That actually would be slow, right? Because you're basically running much of data between these two pipelines, where it's better off to do a pipeline that tries to get all the way to the top. Again, we'll see why this matters when we start doing, you know, operator fusion and other techniques, yes? Wouldn't it allow you to potentially start executing the join for another CPU core while you're still executing the filter? So you're saying that couldn't you start on, couldn't you start running the join at the same time or are you still scanning A and B? Yeah, there would be more bandwidth, you know, certainly, but like, it may be that the whole thing finishes faster because we're taking a better advantage of parallelism. So let's say that the very last tuple, ignore parallel threats, ignore multiple operating instances. This thing's running by itself. The very last tuple that you see in A is the only tuple that's going to match in B. But if I start probing into this hash table before I finish scanning A, I could have a false negative, because I didn't put that last tuple in. Yeah, you can't do that. What's that? You're still building A, but you can't check to see whether something exists until it's populated. That's his question. Can you start Pipeline 2 before you start Pipeline 1? That would be, again, the problem I just said earlier, where like, if I start scanning B, start filtering it, what do I do with that output? It's got to go somewhere, right? So I start writing to disk and memory, then that's this pressure for, you know, for the overall system, where it would have been, it's a better idea to maybe run this in parallel, populate the hash table, then run B. Okay. So, lots of stuff today, but I want to first start talking about the paper you guys read. This, you know, it's an older paper, but it's very important, it's very seminal, about why the designs of database systems at the time when the paper was written in 2005 are insufficient for, you know, if you want to run OLAP queries, high-performance OLAP queries. Now, you may be thinking, why am I making you read a paper that is almost old as some of you guys hear, or like it's, you know, 19 years old now, because that paper is seminal, meaning every OLAP system that's about today, for the most part, followed the design guidelines that was laid out by that paper. And everything, with some exception about the itanium stuff, which we can talk about in a second, like the core ideas still matter a lot, right? Then we'll talk about processing models, the plan processing directions, whether it bottoms up or top down or push it versus pull, fill the representation, we'll talk about this, we'll talk about vectorization next week, but basically, when I start applying predicates and I start matching tuples and tuples don't match, what do I actually store when I go from opera to the next? And then we have time, we'll finish up the different modes of parallel execution, right? The idea is, again, we're going to talk about how do you architect the system so that you can run these operator, these tasks or the operator instances in parallel. And then in a few weeks, we'll then cover how do you actually implement the algorithms within the implementation itself to do parallel execution, like for joins and sorting. All right, so again, the MaldiB X100 paper, it's from 2005, and it's essentially a low-level analysis of the, for in-mary workloads, what are the bottom that you're going to face when you run OLAP queries, right? And the idea, the big idea that they need to break through was they looked at all these existing systems at the time and showed how if you want to run OLAP queries, right, running large scans over and doing joins and so forth, that the existing systems at the time were certainly not well-designed for the modern out-of-order super scalar CPU architectures that Intel was putting out at the time, right? And the idea is that if you can redesign your database to a better target for what the CPU wants from you, what kind of, you know, if you design the system itself, how data flows through it, what instructions you're calling and when, then you get much, much better performance because you're designing systems in such a way that the CPU is happier, right? Instead of you as a programmer trying to make things easier for yourself and those which are harder for the CPU, you make life slightly harder for yourself and the CPU can run much, much more efficiently. So what happened was the background of the story is that there was this project out of CWI where this paper came from for CODMONI to be the decent in the 90s and basically they were doing some experiments on it and they realized that, oh, the way, you know, the way they're going to do their processing model and send, you know, entire columns from one operator to the next, that's terrible for the CPU and there's all this indirection, things are much slower. So they built this X100 prototype. They then spun it out as a startup called VectorWise that was acquired by Acton in 2010. They got then rebranded as Vector, which I think is a terrible name for a database because you search Vector database, you're not going to get this thing, you're going to get Weaviate and Pineco and all these other ones. But then the cloud version of VectorWise is now called Avalanche. Then Acton got bought by, I think, an Indian holding company. H.D.L. H.D.L. two years ago-ish, right? And so it's still there. Acton is also the original, is what Ingress became. So Ingress got bought and sold over the various years and then at some point it got rebranded to Acton and they kept, sort of, had these older databases. And then VectorWise was the, sort of, the high performance columnist or engine for Ingress and then it got rebranded. But anyway, so again, the reason why, again, I had you guys read this paper, even though it's from 2005, is because this is, you know, this is how you want to design a system even today. Now, there's all this other stuff about Itanium, which I'm assuming that's a CPU architecture that no one, actually, who here has heard of Itanium? One. Or three, right? It's basically, it was like, it was another super scalar CPU from Intel in collaboration with HP in like the 2000s. It was meant to replace X86, right? But it had, like, this massive, like, pipeline. It did things a little bit different than how we, you know, Zeons were today. But it didn't go anywhere. They killed it all off. And so we'll say this may be some other papers, too. There's other Intel hardware that people, sort of, target that doesn't exist anymore. We don't care about. But then the high-level ideas actually matter. And again, just to show you that this paper, even though it's from 2005, it's still timely. Earlier in this year at CIDR, this paper won the Test of Time Award, all right? So, because the database research community recognized that how important this paper was and how it has massive influence in this, you know, in the database marketplace for OLEP Quartz, right? So that's, right there, that's Peter Bontz. That's the guy that, you know, it did the early work of an ADB, did the early work in VectorWise. Now he's technically an intern at MotherDuck, but he did early work on DuckDB. Niels is the, I think he's the CEO of the Moen ADB company. That's Marcin Sikowski. So after VectorWise got bought by Actian, he then went and formed Snowflake. And a lot of the ideas that are in this paper is what Snowflake's based on, right? We'll cover him later. That's Magda Pat Helen. She's at UW, but he's at Salesforce. That's the guy that invented Volcker. He invented Apache Flink, right? We're not going to cover Flink, but anyway. Again, this paper is super, super influential. All right, so this is sort of a crash course in what CPUs look like and just for, like, what it matters for us as database people, right? So it's everything you need to know in, like, two slides at a high level. So the CPU is basically going to organize the execution instructions through these pipeline stages. And the CPU's basic goal is to try to keep this pipeline busy at all times. There's always something to do, right? And so that means that if there are instructions that you can't complete in a single cycle, it's going to try to keep executing things in the pipeline because there's a cache mesh, it's got a fetching thing in memory. So it's always going to try to keep executing things, right? And in a super-scaled CPU architecture, there's been multiple pipelines running at the same time in parallel, right? And so they're going to run slightly out of order, meaning like the instructions may execute differently than the order that they appear in the code. But then the CPU, at least in case of Xeons, are going to do a bunch of extra work to make sure that once you get all the data that you needed, you then check to see whether the output of those out-of-order instructions would have meshed the same as if it ran in order, right? AMD is doing essentially the same thing. Like all the super-scaled CPUs are doing essentially the same thing. The GPU cores are not, yes. Okay, precisely what does this is a super-scaled CPU? Is it just multiple cores? No, multiple pipelines. So within one core, you'll have multiple pipelines. Okay, so that's, okay. Yeah. Okay, we're going to run. One core, yes. But then what? Well, CPUs are not super-scaled. So where did you say they're? So where did you say they're? The 1990s. Yeah, the 90s. But GPU cores, I don't think do this as well, right? They're in order. Right, because this is actually very complex to do. You're like basically, it's the same thing as, like, optimistic recurrence control for transactions. You're assuming everything's going to be okay. You let things run sort of specularly. And then you get the check at the end. Did it actually match out, right? So this is fantastic, right? Everything works great when you get it right. And if the CPU recognizes that there is a dependency like I needed to know the output of one instruction but more to do the next instruction, or if I do a misprediction, right? Meaning like there's an if clause. It sees that and says, you know, a branch instruction. It sees that and tries to predict which path down the branch you're going to go. Like if then else, it tries to pick which one you're going to go and then specularly executes whatever it thinks the path that you're going to take. And then if you get it wrong, you have to flush the pipeline and roll everything back and restart. And that's really expensive. So, again, these stairs can occur in the two ways I just said about. So one is dependencies, right? If one instruction depends on the output of another instruction, then you can't immediately put that in the same pipeline. You just have to stall and wait, yes. Does one pipeline correspond to one pipeline? Because in the cycle, do you take one or four items from each pipeline? The question is, does one pipeline correspond to each one core? No. Every core has multiple pipelines. The pipelines are short. I think the latest eons are like 20 instructions on a pipeline. I think in the X100 paper, they talk about how like, I think the Pentium 4s are one of them have like 31 instructions. It was insane. They're now more reasonable. But again, if you still have one of these, if it predicts something wrong, you've got to flush the pipeline and undo stuff and install until you bring back the instructions that you should have executed. So in the case of this first one, dependencies, this could occur if we're scanning a tuple and we need to store the data in an output buffer before we can execute anything else. We need to know what the result of that computation was before we can go on to do the next thing. Yes. So the first one says, then it cannot be pushed immediately into the same pipeline. No, because you basically have to wait until you figure out what's going to happen, then you put it in. Yeah. But actually, I don't know whether you could have like, here's the pipeline I'm really running, but here's the pipeline. As soon as I find out what the first one, then I can run the other one. I don't know if it does that. I don't think it does. All right. The second one is for branch prediction. And this is basically means that like if, as I said, if there's an if clause or some kind of conditional statement, it's going to try to predict what you're going to do. And this part is super sophisticated in CPUs like AMD and Intel. Like this is like the secret sauce of the CPUs. And they don't share what exactly the branch predictor is actually doing. Right. Because the Intel is very, very sophisticated. Actually, they all are. So again, the idea is that if we build our system in such a way that we have a lot of conditionals, we may end up making things worse for us because, you know, think of like your scanning data. You don't actually know what path you're going to take because it's going to depend on what the query is, like the conditionals or the where clause, depending on what your data looks like. So there's no way, the easy way to really predict for every single query what path you're going to take down for different conditionals. Right. So for this last one here, we'll talk a little bit about how that affects it. So again, because we have these long pipelines, we're going to try to expect to execute branches. Because it wants to stall the, sorry, it wants to hide these long stalls between spending instructions and going fetching things from, you know, L3 cache or L2 cache into our registers. Right. The way, the one spot of the data system that's, this is going to come up a lot, is just the basic filter operations when we do sequential scans. As I said, because it's going to depend on the, the filter basically conditional. Like where something, you know, equals something, that's an if clause. And whether or not that predicate is going to value it to true depends on the data. Right. So this is nearly impossible for, for, you know, even us as the database system, we're actually running the code to predict, let alone the CPU, because the CPU doesn't know anything about, you know, what a database system is or what a query is. You already know a compiler hint you can potentially use to resolve this. I don't know how to do it in Rust, but in C++ there's something in the standard. You can call it likely and unlikely. So they have these compiler directors where you can specify whether a conditional clause is going to be, or a code path is going to be likely or unlikely. Right. C doesn't have this. They, I think they avoided this. Again, I don't know whether Rust has this. And this, this is, I did some quick searching to see what systems actually support it. Clickhouse has this. I know DuckDB does not because they're trying to be portable. Postgres has this. Right. But the, these are not hints actually to the CPU. You can't tell the CPU, like the branch manager, hey, I'm lucky to go down this path. Right. I think Intel had, had some capabilities in the early days to do this, but in 2006 they took it out. Right. So this is, this is just a hint to the compiler for it to potentially reorganize your code itself so that the likely path is maybe at the top of something. Right. And then if you read this blog article from a, I think he's a compiler engineer at Intel. He basically says, don't use this. It's going to make, sometimes it can actually make things worse. And it's not often going to actually make things better for you. Right. So I think the interesting thing, we take, Clickhouse, take all these systems that do use this, like, because they're all just pound of fines. So you can just hide it. You can see whether it actually makes a difference. I have no idea. What? Okay. It's unstable. Okay. Yes. Yes. But it's not, it doesn't help the CPU branch predictor. I mean, for A6 or is it like for, like, embedded? What's that? It's for A6. All right. So that's like, specialized hardware. I have no idea. For Xeons, this doesn't do it. That's for the CPU. All right. Sorry. For the CPU, I bet you said FPGA. Okay. So you're saying for the network hardware stuff, or sorry, for the network boxes, they take heavy advantage of this. Yeah. But again, like I'm telling you, you can't tell the CPU that, yeah. So I don't know what they're doing. No. It's a compiler directive. It's up to the couple. Oh, sure. It's changing the order of the assembly. It's just changing the order of the assembly so that the likely paths are closer to the top when you go into the conditional. Again, I don't know if this is custom hardware or not. Like, yeah. I don't know whether this matters for database systems. Not every system actually uses that. Postgres does, Clickhouse does, and a few others do. Yes. Are there, or were there, Yeah. There used to be an opcode that you could tell the CPU, but that was like, 2005 and earlier. Intel supported it. Back in the day. But not now. Yeah. Because people are just like, they're stupid with it, too. Also, database is a tool to work out where you activate your monitor. The CPU is then designed to know that that secret source that I've been talking about, that's been designed for 20, 30 years. It's really where you're figuring out which products you take, which products you want to take. So when you do this, you kind of mess it up. Let's look at a case where, like, even if you had it, it wouldn't help. Right? So let's have a simple query here. Select start from table where key is greater than low value, key less than high value. So this is how you probably would write this code. Again, I don't like to show code in slides, but this is simple enough. I think you guys can get it. This is how you'd write this code, you know, in bus top or a basic implementation. Right? You have a forlip, it iterates over to every two people on the table, you go grab the key, then you have the if clause, if key greater than low and key less than high, then you copy it in the output buffer. Right? Then you iterate whatever the buffer offset by one. Right? So what's the problem for the CPU in this code? The if clause. Right? Yeah. So you can rewrite this to not do any branching. The if clause is going to find a branch, and the CPU is going to try to predict, am I going to go down this path or not? So you can write it like this, as a branchless version, where the very first thing you do as you scan the table, is you immediately copy the tuple in the output buffer. Right? You don't check to see whether it's going to satisfy the conditional, the clause. Then you have this clause here where you then check the low and high, but these are returnary operations where it's going to return one or zero. You add the bits together, and then that tells you whether the delta is zero or one. And then in that case, if it's zero, then you would loop back around and just overwrite the last thing you copied. And again, we said all our columns are fixed length, so I don't have to worry about, am I going to overflow the buffer or underflow it based on one tuple to the next? I just take the bits, plop it down, and overwrite the previous one. Now I'm missing a little piece at the end where it says, okay, if I jump out of the, when I come out of the for loop, is the last thing, was the last delta zero one, and make sure I don't include that as the output, right? So this seems bizarre, right? As humans, we were like, this seems super wasteful. You're copying every single time. It's surely that's more expensive than the if clause, right? But it's not, because the CPU knows how to, you know, if this is just deterministic straight line code, it can rip through that way, way faster than the branch prince prediction. Yes? Is there any advantage to return instead of just like specifically dealing with the returnals? So its question is, oh, is there any advantage of using ternaries versus like, I mean the compiler might just convert it into ones and zeros anyway, if you use Booleans. Right. So I'm saying, is there any reason beyond just thinking that there, like just move everything between greater than low, greater than zero, if I did a key greater than, greater than low, and key less than high, one or zero? Yeah, just, Yeah, you could, the compiler, this part I'm not worried about. The compiler could fix that. No. The thing I, again I'm trying to iterate, it's like you're always copying. That matters. Right. So the, again, you think there's be, there's be terrible because the CPU is blindly copying, but it actually helps. So this is a, this is from the vector-wise people, but this is a few years later, I think 2013 or 2012. And this is showing you, for the two different approaches I showed, like here's the performance you give and you get, as you vary the selectivity of that, of that wear clause. So if, if no tuples are matching, on the, on this side here, up to roughly about, you know, 5% selectivity, right? The, the branching case is actually faster because the CPU is going to say, oh, not going to match, not going to match over and over again, right? And the, avoiding that extra copy cost every single time is way faster. The red line is essentially flat because you're just doing the same work every single time, no matter whether the tuple is going to satisfy the predicate or not. But then you see this, this nice little, this arch here, where, you know, the height, sorry, the top is roughly on 50%. And at this point, it's a flip at the coin every single time, the CPU is predicting, just getting it wrong over and over again. And then, again, it becomes more selective the CPU can figure it out better. So I had students reproduce this graph. I don't have it in the slides. Basically, six years ago, five, six years ago, we basically solved the same thing on newer CPUs, right? So, again, the, this is just showing again, which seems like bizarre thing, or a waitful thing for humans to do, is actually going to be better for the CPU. In the case here, we're counting CPU cycles per tuple, right? It's not exactly runtime. So in terms of how to produce instructions, and again, we'll talk more about this throughout the semester. But the idea here is that we want to specialize our database system, data systems code, so that when we operate on data, we know exactly the data type, the size, and then whatever it is, the operation that we want to do on it. And so we don't have to have these giant switch statements that says, if my data type is in 32 versus in 64 or float, whatever, then, you know, here's my instructions to do addition or subtraction or whatever it is. Likewise, I don't want to have to traverse the expression tree when I have my where clause predicates to say, you know, is it greater than or less than? And again, which is usually implemented as giant switch statements. So we want to avoid all of that as much as possible, and just have exactly the code we want to run our query. Because again, now there's no conditionals. There's no branching. We're just giving the CPU. We're just feeding it. Here's the exact instructions we want you to execute over and over again. So an example of doing this wrong, or not wrong, but like, well, wrong in the context of the worst way to do it for a modern CPU, is you can look at Postgres numeric type. And so it's this function to add two numerics together. And so what do we see? We see a bunch of these if clauses to check to see whether it's a positive number or a negative number, whether it's not a number. And then we have this giant switch statement here that you have to deal with all the different variations of how to do the addition. And again, this is just adding two numerics together. If I have a billion numbers in my column and I'm trying to add it to another billion numbers, then I'm going to execute the instructions over and over again. This is going to be terrible for a modern CPU. Is it the same thing like the match thing in Rust? The question is, is it the same thing as the match thing in Rust? It's like a Pilar time thing, right? Is it? It gets out of the way, but it doesn't. Yeah. But I think it's better than I thought. Yeah, it doesn't matter. It doesn't matter. It's a jump table. Yes, but... It's jumps. Jumps. That's bad. Jumps are bad. Yeah. Function calls are bad. Jumps are bad. Right? But we do need to verify that we're not sure how you realize this. The question is, you do need to get the types. Yeah. Yes? So if you had to realize this without the switch, I'm not even sure how you do that because you need to do the stuff that they want to compare. Well, ignore numerics, right? Because that... We know offline how to optimize this, but just think of like if you had a number plus a number. So is it in 32? Is it in 64? And you would have to have different branches for all those. Because we're... It's sequel to clear language. We have our catalog. We know exactly the data types. So if we set up the system in such a way that we know exactly the instructions we want to execute, that we can design... Yeah, design things ahead of time and be way faster. And whether or not we pre-compile the primitives we want to use to operate on different data types, or we just in time compile it, we'll cover that later. All right. So now with all that in mind, now we want to talk about how we want to design the execution in this processing model to then lead us to the path of enlightenment or whatever you want to call it of being able to achieve those three goals. All right. So this would be a somewhat of a review from the intro class, but I'm going to go a bit more detail of how the system's actually going to operate beyond what we cover in the intro class. And then that will segue into discussing the direction of how we move data between different operators. So the processing model is going to find essentially how the data system is going to execute a query plan, meaning how it's going to tell what operator to run next and then where that operator is going to send data to... data to, right? Or where that operator is going to get data from, so to speak. And there will be different trade-offs we have for OTP systems and OLAP systems or OLAP workloads and OLAP workloads. And we'll see how, in the case of the volcano model, the iterator model, that's the default choice for most database systems like row stores and that's great for OLTP, but it's not going to be so great for OLAP. So every processing model is going to be defined in terms of extrusion paths. And the two types of extrusion paths we can have are the control flow and the data flow. So the control flow is going to be how the data system is going to tell an operator or operator instance, okay, now it's your turn to run. And then the data flow is going to specify for each operator instance where does it send data to and where is it getting data from. And so the output of these operators can either be whole tuples in the case of the row store or a subset of columns. And in care case, what we're going to care about in the OLAP world is going to be NSMs. And what we can ignore late materialization, whether or not it's all the columns or subset of columns, we can worry about that later. So the three processing models that we're going to care about are iterators, materialization or the Kluminar one from the X100 paper. And then that will lead to, again, the vectorized model. And this is going to be this last one here as well, every OLAP system except for a few exceptions are going to implement this approach. Because it's sort of the best of both worlds of the iterator model and the materialization model. So the iterator model or also called the volcano model or the pipeline model, I'll often probably just say a volcano model, is, again, is basically how every database system up until the Mone ADB paper or the vector-wise paper you guys read. This is how pretty much everyone implemented their query processing model. So in your source code in your system you're going to have all your operator implementations and each one is going to provide this next function. And so what's going to happen is every time you want to get a tuple from an operator on the control flow path you would then invoke the next function on that operator who then we responsible for producing either a single tuple or some kind of end-of-file or null marker to say I have no more tuples never ask me for more data. Right? And you can sort of think of like within that operator it's just going to be a for loop that's going to retrieve all the tuples that it needs from a from its shot operators and depending on whether it's a pipeline breaker or not it either gets all the data from its children if it is a pipeline breaker or it can just get one tuple and as long as it satisfies whatever that operator wants to do on it and can produce the output then it's done. So you sort of think of like the way you actually implement this is you have these open and closed functions and the operator is like constructors or deconstructors for your operators and then you open it call next, next, next on it get all the output you want and then when it says I'm done then you call closed and that cleans everything up. So this is a high-level example so we want to join R&S and we have a join conditional on RID, SID and then we have an additional where clause where S value is greater than 100. So you can sort of think of like each of these operators are going to have these implementations as I said they're going to be or make calls to its children operators to pull data up or move data up and then produce output when it's available. So you can think of these blocks of code here these are all the next functions. So we're going to start off the database that says okay I want to run this query assume we're going from the top to the bottom we'll call it next on the first operator here and then immediately inside that we have this for loop that says call next on my child operator so that is the blocking call so the control flow would move from the top operator to the second operator here to do the hash join where immediately inside that one we have a for loop that says call next on the child because it wants to build the hash table. So then the control flow takes us down here and now we're just iterating over the table R and we're calling emit which is the return control or sending one tuple back up to the calling function and so there's some state inside these operators that keeps track of the last time you called next here's where I was here's where my cursor was when I was scanning the table. So we're going to keep calling next on the bottom operator here until we get all the data until we get end of file and at this point here we know our hash table has all the tuples we need so again we won't have any false negatives. So then I go down to the next block in my operator and I'm calling next on my child on the right child so I come down this side same thing scan over to R pass it up to the next operator it applies the predicate if that gets that gets satisfies predicate then it sends it up to do the probe right so you sort of daisy chain these things up like this so again going back to the notion of pipelines sort of think of like this block three and the first half of block two that's pipeline pipeline one and then this five four the bottom two and one that's pipeline two so I execute pipeline one ideally and then sorry in this case here you're there's pipeline boundaries but you're not doing optimizations right because implicitly the code is set up so you know that you can't run the second pipeline until the first part is actually done right so again this will different when we see the operator fusion technique or the push base approach because they're actually going to try to combine these within a single operator instance and not have these within themselves so that's it this is pretty much what everybody implements in the first database systems that they built or a row store systems bus tub is based on this yes this question is shouldn't you want to build the hash table or the one that has the filter you don't know right first of all this is a logical diagram it's a powerpoint slide right I don't have stats here but what if this thing is like one tuple so anyway so the so the iterator models approach everybody implements output control is really easy for this because if you if you know you've got enough tuples as the output you just stop calling next and you finish right the downside though is that the we're basically mixing control flow with data flow and it's going back quickly like there's no way to sort of say okay I don't want to execute this thing anymore because I've got enough data or to stop exceeding certain parts because you know I'm calling next to get things up right everything's like if I call next on this implicitly is going to call next on its children right because that's sort of how the query plan has been set up or that's how the iterator model allows you to control the behavior of the execution it's going to allow us to do pipelining again the idea here is that we want to have for every single tuple that we get from a you know from a child operator we want to do as much processing as we can up the query plan until we get until we get a pipeline breaker or produce the final output and the idea is there we can maintain or we can achieve good cash locality because we bring a tuple in and we do as much work as we can while it's in memory before we go off to the next tuple until we hit a pipeline breaker which again we know we can't have an operator complete until we get all its children to admit their tuples so on the bill side of joins we have this problem subqueries depending whether they get rewritten or not and to joins you have this problem in order by sorting obviously has this problem some aggregations min and max right same thing so the the downside though is that you're basically calling next for every single tuple I have a billion tuples calling a billion function calls to just call next times the number of whatever operators that I have or number of tuples that are being sent up right so an alternative approach that was pioneered by monadb in the late 90s of the 2000s was to do what is called the materialization model where every operator produces all its output all at once anytime you call next on it it generates all the output and then hands that off to the next operator so once you call next you never go back and ask for more data for it right the idea is again called materialization model because you materialize each operator is materializing all its output as in a single result and again the output could either be a single row a single column or the entire table yes pipeline break is basically materialization because you end up materializing the entire result as David is pipeline breakers are essentially the same thing as pipeline breakers because you materialize the entire result not necessarily right because I could call next on a on a pipeline breaker well you're like to point yes like I could produce all the output and then something's going to feed into that pipeline breaker result I could just get a single tuple I wouldn't pass because it's still like one tuple at a time called next I wouldn't have this giant output get shoved all the way up all of a sudden switch to the processing model but to your point that the pipeline breaker you're materializing all the results at that point there yes but in the case of materialization model every next call moves the entire result always up so it's not that simple as saying that materialization model is just saying every operator has a pipeline breaker it's not as simple as just saying that as David is it's not as simple as saying in materialization model that every every operator has a pipeline breaker that's okay yeah that's fine yeah it makes sense yeah I wouldn't describe it that way though but like it makes sense yes so let's see how to go back to our original query we had before and again now in our in our operator implementations now we see that like again we have this output buffer and we just keep adding tuples and then there's a return clause where again all the output goes you know goes up to the next guy so just like before we start at the top the call the root operator calls child.output calls this guy who then has to build a hash table calls down to the scan on R and then we populate the entire output buffer and then show the result up and again if I have a billion tuples even if I want one column for a billion tuples I'm gonna have my output buffer is gonna be a billion tuples in this approach and then again same thing I go down the other sorry all the tuples I go down the other side I call the filter which then calls the scan on S and the same thing the data gets moved up like this right so an obvious optimization here is that for this side of the query plan I'm scanning S and then materialize the result and then needly hand it off to a filter operator who then basically you know throws stuff away so an obvious optimization is to inline these two or fuse them together so that as you scan S then you evaluate the predicate and then if that is true then you preclude in the output right again you could do that branch list taking we saw before I'm showing with the if calls but you could do that thing that optimization would just solve so this is great for OLTP because in that world the queries are accessing the single tuple so even though you're materializing the entire result it's gonna be one tuple and then it's just one less next call to go before you get the end of file you got everything you would ever need for an operator and you can move one you have to do that inlining to make sure you're not passing up more data than you actually need but it works great and then when we built h-store that became VolTB we use this approach but I would argue and the paper you read argues that this is bad for OLAP because you may be coalescing or putting on a bunch of data in higher parts of the query plan but you're moving these large columns from one operator to the next so it's great that again you have fewer next calls but you're moving more data than you potentially actually need so the vectorization model is an obvious optimization or it's obvious now but at the time it wasn't that's sort of getting the best of both worlds right that you're still gonna have this next call that's gonna move tuples up or move tuples from one operator to the next but instead of moving a single tuple as you would an iterator model you're gonna move a batch of tuples with vector tuples again the naming is bad because you say oh it's the vectorized query processing now with vector databases that people may think you're doing like sending embeddings or something like that which we're not doing so we're gonna make a batch of tuples instead of a single tuple and then we're gonna have our operator the loops themselves be designed to operate on these batches of tuples at a time right and the size of the batch can vary depending on what the data looks like what the query action wants to do or what the harbor looks like I think of the paper they were talking about 1024 that's usually roughly what I think most systems are using sometimes it might be a bit smaller we'll see examples of that later on and again the batch is either gonna be one column or subset of columns based on whether or not you're doing late materialization or not you've already done projections on it so if we go back to our query one more time now within our implementations we still have an output buffer we're gonna add things to it but now we're gonna have this conditional clause that says when we've accumulated enough tuples for the size that's expected for our vectors then we can emit it up or send it up so same thing we called that before fill up our vector output buffer and then we use that to send it up send a tuple batch and then same thing down on the other side here yes the question is if I'm at the end of R and if the size of the output buffer is less than N but if I'm done then yeah you send it up so you have a little thing outside the full cost and then you just we'll talk about this in a second you basically keep track of like okay here's the rows that are actually active and whether you use that bitmaps or offsets we'll see that in a second okay so as I said this is the vectorized pre-processing model is what every modern OLAP system is gonna use today and it's because it's greatly reduced the number of next calls we have to have per operator and it's gonna allow out of order of CPU to be able to efficiently execute our operators over batch of tuples assuming we've designed our system in a way to operate on these vectors right so again the authors of the paper talked about they could have called it the array processing model but that's essentially what it comes down to what you're doing these batches of tuples are just arrays and then within each operator as you get the input input vector from your child now you have a for loop of going over arrays and that's the ideal scenario for out of order super scalar CPUs they love processing arrays right so you can do all of the specs execution tricks we talked about before you can do vectorization with SIMD which we'll see more about next week all these things we can take advantage of because we know we're doing the same operation over and over again within a tight kernel on the data that's going to be the same type same length for the most part in your strings but like we can crush that yes so the question is what's the fundamental difference between materialization model and vectorized vectorized that the size of the output isn't everything so that we can we can take advantage of other pipelining we can take advantage of pipelining for our vectors because we're taking batches of tuples in sort of digestible bytes right it's having this whole thing got to process it for the entire operator then move on to the next operator I can have this like pipeline execution where I'm just taking some vector tuples and ripping through it and only going back to the next vector either when I reach my final output or all of the tuples got thrown away or something like that it's like in materialization you can use limit to say I don't even want so many tuples but here you're just saying once or two points in case of materialization model you can get certain tuples but oftentimes that limit clause is applied near the root of the query plan like give me the top 10 top 10 accounts based on some number some column I got to sort them before I can get the top 10 so that means that even though I'm going to throw away most of the data in materialization model I got to pump a lot of that data up now you say okay in vectorization model you still have to do that can deal with these these chunks of data that can fit in your L3, L2 caches you're not dealing with this giant blob of data that's going to have a bunch of cache misses yeah yes so the question is what's the difference between control flow and data flow so sorry so control flow again is how the the part of the data system that says okay time to execute this query how it tells an operator to say start running and in the case of in all the parts that I'm showing here we're using top to bottom so we're calling next and that's the control flow so we call next on the top operator it then calls next on this child operator that's the control flow and the data flow is where the data is moving back right we'll see a better distinction when we talk about pushers as a pool in the pushers as a pool the the execution of the one operator versus the next operator is not embedded in the execution of another operator that we can then say okay now it's time to execute this pipeline and that's there's an asset that makes that call and then that operator then does not or pipeline does not call the another pipeline there's something else that's centralized centralized that's managing all that whereas like go ahead yes yeah yes but like you wouldn't say it that way right that you would have a different approach doing control flow if you're doing a push versus approach we'll come to that in a second and what are the different two pools versus the whole thing to his point his question is what's the different option for data flow you can think of like a pricing processing model is defining that that as well the push versus the pull is part of that as well but like yeah the data flow is it a single tuple all the tuples so another great thing to also do because again we have these tight kernels processing these batches of tuples all the instructions that we're going to execute for that what's inside that kernel for every single iteration is going to be in our instruction cache that's going to be super fast we'll have very few day dependency control dependencies because we don't need to we don't need to see the output of another tuple within our batch to determine what the next thing we need to execute right it's not always entirely true but like you can you know in most cases this will be the case whether or not a predicate a tuple predicate doesn't matter whether the last tuple will satisfy that predicate or not again not always true window functions complicate things but we can ignore that again and we'll see this next week the great thing about having these tight full loops is over raised that's what the CPU wants that's what the compiler wants to be able to vectorize this using SIMD we'll see how to explicitly do that in next week so this is from the Peter Bonson slide from the when he won the test of time award for this paper but in the discussion of why they saw the vectorized model be so much faster than the iterator or the volcano model or the the materialization model that was used in NDB right in case the volcano model the interpretation overhead will cover later on that's precognitive but now you don't have this per tuple navigation of again calling next next next every single tuple it's now if I'm back 1024 size batches I'm just reducing the number of calls by 1024 it's pretty significant in the case of mode ADB we'll see this more later on but the query plans will be much more simpler because it's just like it was in the iterator model except now you're passing batches of tuples whereas in mode ADB they were sort of keeping track of implicitly here's all the columns I'm passing around and it was way more complicated and in all the optimizations you get from the compiler or SIMD that's just you know it's it's in addition to all the other things just by designing this in itself to pass around batches of data things run faster but then oh by the way the compiler can also get through it much better as well so leading to his question about push or pull but in all the examples that I showed there was this next function you know whether it's vectorized materialized or or iterator there's this next call that I'm making on per operator right and I'm always starting at the top calling the root and going down and bringing things up and again this is how most systems are going to implement the exchange engine but it isn't the only way and this gets into the distinction of top to bottom the pull based approach to the bottom to top so again the top to bottom is what I just showed if you want to start exceeding the query you call next on the root and then that will then call next on its children and propagate down to the pulling data from the bottom of the query plan up to the top to the root and that produces the final output right and the you're always going to be calling next to get the next tuple unless it's a pipeline breaker because that will stage data in a sort of intermediate result that you then go access but you're always passing tuples by calling passing data by calling next and that's going to be a function call that's a jump instruction and again that's bad for a superscalar CPU alternative approach is the push based model where you start with the leaf nodes in the query plan and you have some outside controller or scheduler initiate the invocation of that operator the pipeline then it's going to take the data that it generates and push it to the next operator yes we'll get that in a second so this is rare it's probably more common now but again when the paper you guys are in 2005 this approach didn't exist this shows up in a paper that you're signed to read in a few weeks from the Germans in the system called Hyper the dude is insane it's a one person authoring paper he invented but he showed how to do query just kind of query compilation with the LLVM in Hyper plus also he invents the push based model in the paper as well and he's got three Cliids he teaches two classes a semester and he doesn't do drugs it's insane right he's the exact opposite anyway let's see how we do this here's our same query we had before and now now instead of having a bunch of different operators that we're going to have to implement the blocks of code that implement the individual operators now we're just going to have two for loops right and so for the first first pipeline we're going to scan R and then populate the hash table but in the second pipeline we're going to scan S and now you can see we're going to try to do is ride every single tuple all the way up to the top of the query plan before we go back to the next tuple or the batch of tuples even though I'm showing this opportunity in a single tuple you could do this on a batch as well so for every tuple in S then you evaluate the predicate if that matches then you probe the hash table then if that matches then you put it as the output yes this is fusion right here how did that even end up happening in the sense that what you're expecting to be an operator so wouldn't you have to hard code every single combination of operators okay so his question is basically how do you do this would you have to hard code every single possible combination of query plans execute this how would you have that two weeks the answer is going to be you just in time compile this you really generate the code on the fly for the query plan that fuses this together then compile it with LLVM or GCC or Clang then run that that's approach number one approach number two is that you recognize I only have so many data types in my database system and there's only so many things I could do to them so each of these are just functions right and I just put them in an array and I execute one by one that's what vector wise does that's code specialization give me two weeks but that your mind looks blown that fact you can compile this on the fly yeah that's what they do they're German the Swiss are good too so it's hard I don't get too ahead it's even crazier than that the new version in this version in hyper they would generate the LLVM IR then compile that single store will generate C code at least the version in the latest version the new system he doesn't generate IR he generates literally x86 assembly like C++ macros then he runs that through the assembler then on the background he's running the LLVM compiles the assemble code into a shared object and when that's done he then links it in yeah German yes each of those they do a fixed number of functions you compile them from long array but he's like function pointers and when you want to use them together I need this function A, function B and then you rearrange everything in your array so you can execute that word yes the question is the way to do this if there are a bunch of function pointers would it be a bunch of arrays where I'm putting in I need to do this the file by this, file by this and then would you invoke this function pointer on a per tuple basis but if you do batches of tuples in the vectorized model then that amortizes the jump call and then now you don't have giant switch statements of like what branch should I go down yes so why can't you do this in a pool based model good question the fusion part I mean at a high level is this the same at a high level is it the same more or less yes right because this is like you couldn't say okay do the scan then the next call does this the next call does that but again in the pool based approach the way the software is actually engineered and designed is the abstraction is through these next functions so could you take a pool based model and then co-genit to turn it into this yes would you say this is the main reason for this or are there other factors that are not so the control flow matters too right so like when I want to actually get execute this there's some outside sketches okay run this once this populates the hash table right and then you can then also specify where the output is going say some output buffer then that this thing is going to know about when that's done then I schedule the next one and it produces the output right so you have complete control over everything yes and also like if you have like a DAG like you can like since we're pushing the data you're like strangling how does the scheduler specify where it is but you need to admit like say R feeds into two different pipelines so it needs to send to different places so the question is how does if R needs to go to two different places how do we handle that like does the scheduler tell us or just like this like so the way the schedule something before we start executing would specify where this output is going to go if you tell it by the way send it to two locations and it can either be like the operator itself the execution could be responsible for sending it exactly to the location it needs to go to or you could have like a shuffle service which we'll cover later that says like okay well I don't know exactly where to how to get it to where it needs to go but if I know I send it to this other service it will then distribute it for me okay I don't want to get too far ahead of the like the stuff is really really cool but not everyone does exactly the again as I said before the co-gending this on the fly is going to be hard to maintain the Germans can do it a few others can and then no no okay I'll leave a little mystery we'll see why and to be honest we implemented this here right we implemented it twice and the even the second time was meant to be easier for other people to use and it was still like you know it's not impossible it's just harder alright so again this is just reiterating what I've already said top to bottom we have complete control we control the output via limit and basically the child operator has a block until the parent operator blocks until the child comes back with the results and the again the next functions aren't cheap because they're essentially going to be virtual function lookups in C++ because I'm stitching together this query plan with pointers and then at runtime I have to do the virtual function lookup to say okay what is the actual location of the function I want execute for that for that given for that given child operator and then of course these next calls are going to be jumps and that's going to suck for us in the CPU in the case of the bottom the top again if you have tighter control of the caches and registers and the pipelines in the case of the hyper paper again not only are they going to keep data in L1 cache they're going to keep in the CPU registers can't go faster than that right so the only challenge is that in some cases you may not have complete control of limiting the size of an output buffer because you need to you're you have no way to sort of the next call if I got enough data at the top of my query plan then I just don't call next anymore but in the case of the push base model even though I may be still sending entire batches instead of all the output I may get more data in a batch than I actually want yes the only benefit you get from having output control at the top is through limits I think that's true yes with no functions superficially I say yes but I might be wrong in the case of push it's actually tricky to do also sort merge because you need two iterators at the same time and you have to keep extra statement not impossible you can do it it's a little bit more tricky because it's not like nested for loops ripping through a single tube or within a batch yes this question is if we're doing push base there's still a distinction between tuple at a time entire output versus the entire absolute yes so my example here I'm iterating over a single tuple I call eval predicate and again assuming that they're function pointers I'm jumping every single time to eval predicate for one tuple where you could do is like you could pass a batch of tuples call this vectorized version of eval predicate and then it gets a batch of processes of those so they're compatible okay I don't think we're going to get through parallel execution but let's finish up with we'll cover that next class but let's finish up with how to represent filter data so in the in the iterator model because we're operating or every operator is going to process one tuple at a time if something doesn't match like a predicate something's not meant to be produced as an output then we don't send it up to either call next and get the next next tuple from our child or whatever local thing we're processing or we return back end of file meaning we have no more so that means that at no point in the query plan will we send up data that we know has been disqualified or thrown out because we wouldn't have emitted it up right but in the vectorized model you can't do that because you're operating on batches of tuples so you may have based on what your predicate is or whatever the operation you're trying to do you may end up filtering out or throwing away some tuples inside the vector while other tuples still need to be passed up so now the question is how do we handle that basically we have a vector that's going to have things half the data is we want to keep half the data we know we want to throw away oops sorry so let's say that query like this where the where clause is where column is null and column one or column one is like and then being a wild card so say this is my data I have column zero column one and so if I now do the filtering on this data say this is coming in as a batch this is the output I really want right this is because this is the logically this is the correct result but how do I get there because I don't want to have to copy everything out and then put it back into another buffer and it can be slow so I need a way to represent logically that these are the tuples that have been filtered out even though physically I may be still be passing on dead tuples or tuples that I don't need so there's two approaches to do this the first is used what is called a selection vector sometimes also called a position list and the idea here it's just going to be a densely packed array of the offsets of the tuples within my vector that I'm passing from one operator to the next that are still valid are still alive are still active right so again going back to my example here my selection vector would just be a list of offsets one three four because they correspond to the tuples that satisfy the predicate so now this is what gets passed on as the output of next or if I'm pushing along in the push base model this is what the next operator is going to process on so now when I start doing whatever it is I get this batch of data I have this selection vector I then account for some of the data may have been discarded some of the data is still active yes why do this this question is this must affect SIMD the question is why do this because the answer is going to be yes for SIMD and for other operations it's just it's actually going to be faster for us to pass along garbage right and then if the case of the selection vector goes zero then I know everything is discarded and I just throw the entire thing away and jump out but it's going to be faster for us to not have to after every single step it's easier to visualize this in the fusion model or the fusion approach it's easier from going from one sort of line to the next line within my nest before them I don't have to go allocate memory resizing the alternative approach is to do bitmaps and this is just going to be a bitmap that has the same length as the number of tuples in my vector that I'm passing along and it's just a zero one that specifies whether the tuple at the given offset is valid or not and again as you brought up this is going to matter this is going to make our life easier in SIMD some SIMD instructions in AVX 512 will actually take this as input as a mask and you can use it to tell it hey don't process the data in these lanes because I don't care for the output so again we'll see more of this later on actually how to design SIMD optimize operations or algorithms for query plans or database system that can use all these things the current research literature actually says the top one is the faster way to do this so the photon paper from Databricks we'll read later on says this and then our own research paper why would this even be relevant if you're doing a few things because if you're doing a few things it's just one or whatever it's the next layer of all because if you're doing five line of operators that means that there's a five line breaker at the end of this we don't need to use vectorization data flow if we just are dealing in that abstraction so your statement is if we're doing this why do I even need that why do I even need to because at the end of it we're just going to have to it's almost like we have a five line breaker at the end of it right so like literally think of like set a t2 it's being a single two it's a batch a vector it's called predicate it's going to then populate either the position list the selection vector or the output the bitmap if it's the bitmap I do pop count tell me how many zeros I have or how many ones I have if I have at least one one then I know I want to do this if you don't do that then I got to go iterate over every single tuple within my batch pop up a new output buffer it's just faster to do zero copy operate on the data as it exists I'm just updating bits or a position list to say which tuples actually matter and all these implementations of these operators which we'll cover later take as input with the selection vector or the bitmask and know whether or not to even consider a tuple at an offset what if we had s was the batch and there was an outer loop there was actually a looping through that a bunch of s's do you see how that could be much better? yeah but don't do that we can do co-jet and we can do one of this stuff we can't do that but that's going to be scalar instruction systy we can vectorize all of this we can vectorize the eval frame absolutely yeah this is why this course exists and why people pay you a lot of money to do this stuff right it's hard right yes so could you do in a pull-based approach like just wait for the vector to fill up completely and when you pull it from the one to the other so are you pulling a batch of tuples a vector tuples or a single tuple? a batch of tuples yeah so the statement is if I have my example here so I have 3 out of 5 match so I have two empty slots can I just call it next below me this is a two vector then at least find two matches that I then put in this vector here but that's a more bookkeeping keep track of like how many spaces do I have then I still got to maintain the intermediate result of the child thing I called before me it's not worth it it has way more indirection way more branching, way more conditionals if I just blindly keep track of it here's what doesn't match the conditional I showed before or the branch of the scan you're all coming up with examples like hey this seems kind of wasteful you're passing along tuples my vector size is 1024 all but one of them are thrown away well again it's just faster to pass along the other 1023 useless tuples then rather than having to do what he's proposing of getting intermediate results and filling in the empty space it's worth it if you do the same thing that's very straightforward even though you may end up executing more instructions but you end up using fewer cycles because again it's set up in a way that the CPU wants yes the statement is with the selection vector getting shorter and shorter well it depends on the query depends on the data but you would typically size it of the vector you're passing along so like 1024 and just have a length to say where the offset is where the end actually is and again these aren't have to be 64 bit IDs if you only have 1024 possible values you could store that in 16 bit numbers so it's not that big yes the size meaning like the allocated size or the actual contents why? I think what you're saying is like could you basically do a slab allocation to say like here's my 10 value vector and here's my 30 value vector 32 value of 64 and like as I get full in one vector then I go use the other one and therefore things hang around in L3 or sorry L1 more but it's not that big 1024 times 16 bits is not that big again what you're proposing is you could do if you once though no the worst thing to possibly do is call malloc what we're doing in any of this because who we talking to we call malloc the operating system right he's gonna screw us over we preallocate everything ahead of time yes there would be wasted space but like it's better than having again this it's better than having code to figure out okay try to be clever simple is better in this case okay alright we're over time we'll cover we'll cover pale execution in building class on Wednesday okay alright guys see ya you know I got him I take off the cap throw my three in the freezer careful with the bottom baby