 Yo, hey, yo, hey, yo, yo. Pack the Chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, lick shots and rapids fled with the church a few times. I'm just doing what I was thinking. All right, today we're going to talk about query execution. This is a reminder. Last class, or the last two lectures, we've been talking about how to minimize the amount of data that we're retrieving. Yeah. OK, cool. All right, so our last two classes, we spent time talking about how we minimize the amount of data we have to read from disk in order to execute queries or do in sequential scans. Again, we saw filters and indexes to identify things. We blocked the data we don't need to read. And then compression has allowed us to minimize the size or reduce the size of the data we do have to actually have to read and then try to process it in its compressed form as possible. So going forward for the next couple of weeks, we're going to talk about now how to actually start executing queries and the ways we're going to be able to improve performance based on taking advantage of modern CPU architectures and modern algorithms. So I showed this list before, but these are a bunch of the different techniques we can do to speed up sequential scans. So again, data skipping was the technique for identifying what you don't need. So now I want to go through a bunch of these ones for the next couple of classes and build our system to take advantage of these methods. So at a high level, basically what this class is about and what database system engineering is all about is an orchestration or the implementation of a bunch of these different optimizations that are going to be designed to take a full advantage of the hardware. So there isn't going to be one technique that you'll be able to take away from this class and say, oh, this is the one thing I should always be using every time I build a database system. It's going to be a combination of things. And in some cases, we'll see that some techniques are interfacing with each other quite nicely. In other cases, you have to do a little extra work in order to do one along with another. So the example would be you want to do vectorization plus the push-based method for query execution, as we'll talk about today, then you have to do some extra stuff to make that work. So this is not a my list here of what I think the top three optimizations that we're going to go through that are going to matter the most. This is not scientific. This is just my personal opinion from meeting the papers and seeing what other systems do and how things are implemented. And some of the experiments that we've done here, but I don't have like this ranking is definitive because obviously it depends on a bunch of different factors. But the three optimizations that are going to matter the most for us this semester are going to be vectorization, parallelism with multi-threading or multi-core, and then code specialization or query compilation. This one, the code compilation stuff, is probably the hardest one to do, both to actually implement and actually debug. And this is why most systems aren't going to do this or they do small very, the importance of this in a query plan. Like PostgreSQL do this for the where clause, whereas like single store will compile the whole query. And we'll cover all that means as we go along. But vectorization is probably going to be the top two here, probably the biggest one. But the performance benefit you can get with compilation it can be quite significant. All right, so if you want to optimize the execution of a database system while we execute queries, what's the goal? Like we should think about what do we actually want to achieve other than to say, hey, my query runs faster. What's actually going on in the system to actually achieve that reduction in the query execution time? We've already said with data skipping, we're reading less data from disk. So of course that's going to be faster. But for the data, we do have to go read and now it's what can we actually do to speed things up? How are we actually going to achieve this? So there's basically three goals here. The three things we want to target. So the first one is that we want to reduce the instruction count. And that this means that as we execute queries, we're going to use fewer instructions to execute them. Compiler can help with this. The code specialization can help with this. But it's sort of obvious, right? If I can basically use less instructions to produce the same result, of course, I'm going to run faster. You can pass an O2 to your compiler to speed things up. And the compiler will help along with this. But there's other sort of fundamental things we want to do in our database and to achieve this. You typically don't want to ship your database system with O3. I don't think anybody actually can't prove this. But this is sort of well known in systems that you don't want to O2 as far as you want to go and to regress in this on compiler optimizations you want to have. And to prove that it's not just databases, this is a mailing list post from Linus from 2021. And it basically says here is, sorry, I clicked it. That's a spoiler. I personally think O3 is in general unsafe. So O2 optimizations will help us reduce instruction counts. And maybe O3 could be a little better, but we could have problems that we don't want to ship our code with bad machine code and have problems. So O2 is about far as we're going to go. So there'll be other things we'll have to do in our own implementation. So assuming we've done number one, now we're going to figure out how can we reduce the number of cycles we're going to take to execute instructions. I'm going to take a guess of how we can do this. Nothing? Gee, I'm disappointed. SIMD? Well, SIMD will get you number one, right? Because if I apply a predicate, instead of doing it in SISD four times, I can do it all in a single bingo, yes. Yes, so he said the problem you can have in your CPU is that with branching or stalls because something's not in your cache, because of cache mishes, those are going to lead to increased cycles that were spent for an instruction. So it's the same amount of instructions, whether or not the CPU has to stall because it has to fetch something from cache or not, from memory and bring it into the cache. That'll determine how many cycles you're going to spend. And the other one he mentioned was branching would be another one. We'll see this in a second. That if the CPU wants to try to do speculative execution and predict what branch you're going to go down, if it gets it wrong, it has to roll things back and then flush the pipeline and bring things back in. So that's going to take a lot more cycles. So again, we'll see an example in a second where we can design our database system as we're executing predicates to reduce this. Again, it's the same amount of work producing the same result, we're just taking less cycles. And the last one is obvious that we want to paralyze the execution of the queries, either through multiple cores, multiple threads. We're not talking about distributed systems just yet, but across different machines. Again, this is why the blog article I sent you guys on the Piazza last night from the DuckDB people is the guy basically at BigQuery says most people didn't actually need BigQuery at the scale that they were trying to market it as. That DuckDB is probably good enough for most people or running on a single node. So we're not going to get faster clock speeds, although Intel is getting a bit aggressive. I think we're up to six gigahertz now for some of the latest ones. But in general, the way we're going to get better performance or more computational power out of modern CPUs is going to be through more cores. So we want to design our database system to be able to run our queries on multiple cores. So just make sure that we have all the same language going forward. So we're going to define some basic terms of what's in a query and what we're talking about. So a query plan is going to be a directive, a cyclic graph of a bunch of these operators. For now, we can ignore whether it's a physical operator or a logical operator. We'll cover that later on. But a join operator, a scan operator, and so forth. And then we're going to say a operator instance would be the invocation of one of these operators in our query plan at runtime as we actually take to the query. And so if I'm scanning table A, the operator at high level would be scan table A. But the operator instance could have multiple of them that are each scanning a different partition or segment of A. And then at some point in the system, I'll have to be able to coalesce the results and put those things back together from the different operator instances. We'll cover that in a second. And then so now the unit of computation, when we do scheduling and organize these operating instances, it's going to refer to this as a task, also called a pipeline. And this is going to be some sequence of operators that we can execute in sequential order one after another again on some cores or some thread. So in this example query here, I'm joining A and B on a simple join clause or the where clause for the A value and the B value, the two pipelines to B here. So I'd have a pipeline one, scan A, and then do the filter, and then produce some output, say, to build the hash join, build the hash table here. And the second pipeline here would scan B, do the filter, probe the hash table, and if the tuple matches, then do the projection coming out of it. So in this case here, the system is going to know it has a dependency that pipeline two can't run until pipeline one completes because I can't do a probe into the hash table until the hash table is populated. Because otherwise, I could have a false negative. So the system is going to know I have these tasks or these pipelines. It can schedule them, and then it can have multiple invocations of the same pipeline with different operator instances because they're operating at different amounts of table or different parts of the input data. OK? So today's class, we're going to talk about the Mone ADB X100 paper you guys read. And again, to me, that's the, yes, it is from 2005. But as I said, the other papers we've read, these are the fundamental papers that really lay out the key ideas that are the backbone of the design of modern OLAT systems. And vector-wise, this paper is very influential. And then we'll spend more time talking about different processing models, how to actually organize the flow of data throughout the DAG and the query plan. And we'll finish up talking about the basic levels of parallel execution you can have. OK? So this paper, as I said, is it's old, but again, it's the fundamentals of this sort of vectorization approach. So this paper describes an improved version of Mone ADB, they called X100, where they showed how existing database systems at the time were not built to target modern super scalar CPUs. And that if you redesigned the database system to better target what the CPU actually expects or wants in terms of instructions that are given to it and data, then you can get much better performance. Yes. This question is, does this satisfy the three optimizations that we talked about? In this case for this paper, yes. Because you can actually get a few instructions. We'll talk about how vector-wise does complicate. It wasn't fusing. It was focusing on the cycles. But there's background of how vector-wise, which is what X100 became, of how they do complication, to do predicate evaluation, that's going to allow them to execute fewer instructions. But I think there was an extent of joy. Well, we take this offline after we look at that again. Yeah. So the main takeaway which you guys give to this paper is this idea of that humans oftentimes builds database systems or software systems in general, that in such a way that's easy for humans to reason about and understand when they actually implement it. But oftentimes, the way that's easier for a human to understand how the system actually works and implement it is actually the worst way for running it on the CPU. And that although it's going to be a more engineering work, which is fine because it's hard to do. And therefore, they hire CMU students. They pay them a lot of money to do this. You're going to get way better performance if you write the system based on what the hardware actually expects them once. So now in this case, this paper, it's dated. They talk about the itanium CPU, which I'm assuming most of you have never heard of. Actually, who here has heard of itanium before this paper? Two of that. Well, you work on compilers. So itanium basically was this collaboration between Intel and HP in the 2000s. It was going to be replaced for x86. And but it was around for it basically has some interesting things. But none of the compilers actually supported it. No one actually ever switched over this way. x86 is still dominant, at least for now. So it was still a super CL CPU, meaning that they had these long pipelines of instructions that you could sort of queue up and that the system was going to try to cheap that pipeline always full and try to maybe speculate, execute things, instructions, so that it didn't have to flush it and go fetch a whole new batch of instruction from memory and put that in. So back in the day when this paper was written, like the Pentium-4, the Nehalem architecture, had these massive pipelines, like 31 stages. It's gone down to, I think it's like the latest, like 20 or something like that. It's more complicated than today because the latest Intel CPU, like Raptor Lake from 2022, it has both efficiency cores and performance cores and they have different number of pipelines. Things are getting really wild now, but the base idea is still the same. There's these modern super-scaled CPUs. You never want to have to flush the pipeline and go fetch other stuff, go fetch instructions. And so this paper basically describes, again, here's the way you can design the system to do analytical queries that's going to be most efficient for what the CPU expects. And so based on these findings and their experiments, they built this new prototype called X100. This was later renamed as VectorWise and they spun it out of the university at CWI. VectorWise was acquired by Acton in 2010. And when I used to teach this course, I used to say, oh yeah, Acton acquired and they killed it off because when you try to search for it, you couldn't find it on the website. And then somebody for the company emailed me, he complained and said, oh, it does actually exist. They brought it back. So now it's just called Vector, which I think is a terrible name. You search VectorDatabase. That is a class of databases. True story, the first name of VoltDB that Sturmbaker wanted to use was he wanted to call it the SQL. Because it was supposed to be the SQL to Oracle. But you search SQLDatabase, like S-E-Q-E-U-E-L, that'd be a terrible thing to search for. So you call it VectorDatabase. That's a terrible thing to search for, too. Whatever, so then the avalanche is the name of the hosted cloud version. And then, this is Marcin Zakowski and Peter Bontz or worked in this project. Marcin then went off and found a snowflake. And so a lot of these ideas that are in this paper you guys read is what showed up in Snowflake. And then Peter Bontz is involved in DuckDB as well, right now. So again, so this paper is very influential. So I'm gonna give you a quick crash course of everything you need to know about CPUs for databases in two slides. And we'll see how it goes. And of course, things are becoming more complicated. We can ignore GPUs. We'll ignore FPGAs. Those things are, I don't think GPUs will, but FPGAs are showing up in databases a little bit more now. There might be something you have to consider, but let's assume we're just, everything's still running on the CPU today. Okay, so the way the CPU organizes instructions is through these pipeline stages. And so the goal of having a superscalar architecture is that because you have all these different components in the system, like for the to do computation, that you want to keep these things busy at every single cycle. And so you don't want to have, you know, you don't want to have like this, the CPU could be installed because you have to fetch something from memory. So you can peek ahead maybe in the instructions to try to execute things ahead of time. And that way you're using all parts of the CPU as much as possible, right? And so in the modern superscalar CPUs, they'll have multiple pipelines and multiple instructions and they are allowed to execute things out of order. But in x86, they're gonna be really strict and make sure that like the results of memory matches up with what happens if you actually things in sequential order. So there's a bunch of mechanisms and make sure that things always track correctly. I think ARM relaxes this a bit. But if you're gonna allow it to do out of order execution, you have to have checks to make sure that did things end up being what they thought they were gonna be if I executed things sequentially. Some are too like specular execution and like optimistic and convergent goal and transactions. Right? So what'll happen is like if you start, the CPU starts executing things out of order, say because there's like a conditional branch and then a jump. And so it makes, try to exprediction whether it's gonna go through that jump or not. And then it starts executing things based on that prediction. If it then finds out that prediction was incorrect, it has to throw away all the results that's imputed and then roll back and then jump back to where it should have been. And that usually requires a pipeline flush because you gotta go fetch with the instructions you should have brought in instead of the ones you assumed you were gonna bring in. So everything is great with all this out of order execution is fantastic until there's a problem, until it gets it wrong. And so this can occur in basically two ways. So one is if there's a dependency between the instructions, like the output of one instruction is then used as the input for the next instruction, then you obviously can't do out of order execution for that because you have to wait to see what the output is before you exit the next thing. Right? So there's no really easy way to for designer database system to avoid this. If the output of an operator is a tuple that then the next operator wants to do some processing on we can't speculate execute the next operator the next set of instructions because we need to know what the output is. Right? So there's not much we can do for the first one. The second one is the problem that he mentioned where if the CPU is gonna try to predict what path we're gonna go down when there's a conditional if it gets it wrong then it has to flush the pipeline and roll back and then start things again. And that filling the pipeline is expensive. Now we're talking like nanoseconds here but like if you're doing this for a billion tuples over and over again as you're doing a scan this becomes very expensive. It's basically the CPU is gonna run three X slower than it should have. Right? So I've already sort of said what's the use case where this is gonna matter a lot? It's when we start doing scans sort of evaluating predicates that's clearly conditional and then the whether or not the conditional is gonna be able to true depends on what the data actually looks like the database itself, the tables and also to what is the predicate? And so the branch prediction component or the branch predictor in CPUs are super very sophisticated and can do amazing things but this is for databases is really hard because it's every query could be different and every database could be different. It's really hard to bake in exactly logic to say for this database here's what you should be doing and the branch prediction algorithms are secret. This is not something that like Intel AMD share. So what happens when it goes wrong? All right? Because we have these long pipelines and once exactly two branches it's gonna hide the stalls when we have dependent instructions but then the when we start doing again the scan as I already said the CPU is gonna have a hard time predicting whether this is, you know we're gonna fall down if calls are not to say yes this tuple match my predicate. So one way to get around this potentially or in theory is in C++ there's these unlikely and unlikely keywords. GCC has their own intrinsic built in and then now the C++ standard has this. And so this is in the old days this used to be actually a hint to the CPU but since like 2006 the CPU actually ignores it like Intel, the compiler will generate an opcode for this but Intel ignores it and the x86 ignores it. So this is just a hint to the compiler that to try to put the branch that is most likely to happen contiguous with the main block of code. Because again that's a sequence of instructions that's better for the CPU instead of having to do jumps. Right? As far as I know no database system actually uses this. There is another blog article from a compiler engineer at Intel says don't do this, right? So we're basically left up to us as the database system to try to figure out how to get around this problem. Right? Because we can't tell the compiler anything and we can't tell the CPU anything, right? So the way we can do this is by avoiding the thing that's causing the problem, the branches. Say we have a simple query like this select start from table, key is greater than low value and key is less than high value. So a basic implementation, probably what you implemented for project one would look like this, right? Iterate every tuple on my table, go grab the key I want to do my predicate on, then I might if clause. If key is greater than low value and key is less than high value, then copy the tuple in my output buffer. Right? So what's the line that could be the problem here for us? I've already said it. Yes? It's the F statement, right? Because again, and think about, I'm going through a billion tuples and in worst case scenario, say it's alternating every other one or it's completely random, sometimes it's gonna predict it's gonna go in and it's wrong, sometimes it's gonna predict that it shouldn't go in and it's wrong too. So we can rewrite this simple sequential scan like this without any branches. So now what we have here is we're always gonna copy the tuple in our output buffer. And then we'll go grab the key that we need and then we're gonna do two ternary clauses where if the key is greater than low value and then one, otherwise zero and then we end that with another ternary operation that says the key is less than high value than one or zero. And then by ending those two together that gives us a delta value and that tells us whether we should increment or offset into the output buffer by one. So what'll happen here if the predicate doesn't evaluate to true then delta is zero and we come back around and copy it over again. Now I'm missing this little extra work you have to do outside the for loop that says for the last one if this delta is zero then I shouldn't have copied it and make sure I ignore it but when it proves my output, right? But you get the high level idea. So again, it's the same operation, it's just we designed our database system to divide this predicate in a way that's better for the CPU. And so there's this graph from, this is from the 2013 paper from VectorWise where they show for some large database they vary the selectivity of the where clause with and without the branching versus branchless approach that I showed before. And what you see is the red line is the no branch line and it's the same no matter what the selectivity is because you're always copying things in and whether or not it's retained in the output buffer or not, it doesn't matter. In the case of the branching case, the branching implementation, you see that it actually performs better when the selectivity is low, I think roughly about 5%, right? But then beyond that, the branch prediction is actually causing the problem and it's just taking way longer to process every tuple. There's obviously a huge hump in the middle because that's when like it's 50-50, it's good to coin toss whether it should match or not. And then the CPU is just mis-predicting all the time. And then there's a small little flip over here where like at 100%, it just does better, yes. So this question is like the question is how do I make sure that I don't copy the things I shouldn't be copying? Yeah. So the output is just a buffer, right? It's just an array, assuming everything's fixed line. So it's this delta here, right? So I'm copying that tuple always into whatever my next position is. But if I match the tuple when I come back around, the position is anchored by one, so I copy the next position in the output buffer. If it evaluates it to false, meaning and the delta goes to zero, when I come back around, I just overwrite the last thing I copied in. So that avoids having any false positives in my output buffer. And again, I have to have a little extra stuff aside the for loop to say, was my last one true or false? If not, then remove it, ignore it. So this seems counterintuitive, right? As a human, like, oh, I'm copying things over and over again, but it turns out, assuming everything fits in memory, in this example here, that this is actually better, because it's better for the CPU. And I'm putting out here also, too, that the thing they're measuring is CPU cycles per tuple. So you're getting, you know, this is roughly just under four cycles per tuple. For this case scenario, you're up to almost like 10 or 11. So almost a two X performance difference. So again, this is a good example. I'll get designing the database system to, in a way that the human would naturally write, but if we understand what the CPU wants from us, or what it expects, or the best case scenario for it, we can implement better algorithms. So in terms of reducing the number instructions, we'll see, we'll talk more about this when we talk about query compilation and specialization. The basic idea is that, when you think of something like Postgres, and MySQL, and other database systems, that they're designed to be general purpose, and so they have this, all these sort of operator evaluations, a predicate evaluation code, where they deal with values in an abstract way, and then when they actually want to apply predicates or do whatever operation they want on it, then they have these giant switch clauses to say, if my, the left side of my evaluation, or my predicate, if it's an integer, and the other side is an integer, do this, or my data type is this type, or my values, this data type, do that, right? And you see in these systems, these giant switch statements that deal with these different data types, right? Because it's basically being interpreted. It's not, it's trying to say, I have this data type, here's the memory size, the offset, and here's the type, and then there's something that says, okay, I know how to take that data and do, add two together, so forth. Add two of them together. So the problem with this one now is, again, now we have much more branches because of these giant switch statements, and the CPU is not able to easily predict what path you're gonna take down in these switch statements all the time. For all that, maybe it's less of an issue, but there's still an overhead of like setting up the switch statement and looking into it and jumping and so forth. And that's just for like native data types that we talked about like the IEEE 754 standard where the hardware actually has instructions to do native operations on 32-bit integers. If you start talking about more complex things that are implemented in the data system itself, like the Postgres numeric type, then you're just screwed because it's this giant function here with all these if clauses that the CPU is gonna have a terrible time of trying to predict what path you're gonna go down. Again, this is just to add two numerics together, right? So again, we wanna design the database system when we do X2 queries, ideally in such a way that we reduce this generality or this general purpose nature of it and have exactly like here's the code path you need to do if you're adding two integers together or evaluating two integers together. Code completion is one way we wanna do it. The vector-wise does sort of pre-compiled predicates. We'll see how we do this later on. All right, so now we wanna talk about how we're actually going to execute a query plan. Again, this is sort of at a high level, at a logical level, not so much like what's actually in the physical operative instance, like what's the actual, how we implement the operator, but how are we gonna move data from one operative to the next, right? And this is again, if you took the intro class, we covered this, but we'll go a bit more detail in these approaches. All right, so the first one is gonna be the iterator model. Again, this will be the most basic one that most databases implement. The charalization model is, in my opinion, better for O2P systems. There are some OLAP systems that actually implement it. And then the vector-wise batch model is what we're gonna assume going forward that this is how we wanna implement our processing model in our database system. And of course, there's pros and cons, there are benefits and disadvantages of these different approaches, but we'll describe it in the context of OLAP. So the iterator model sometimes called the pipeline model or the volcano model. This is basically how the first databases were implemented in the 1970s. The volcano paper came out, I think it's like early 90s, and it's the first one that sort of described the high-level idea of what this approach was actually doing. It basically gave the taxonomy and the nomenclature for describing what this is gonna do. But just because the paper count 92, it's not like people weren't doing it before. So the way it's gonna work is that every query plan operator's gonna have this next function and on invocation, every time you call next on this operator, it's responsible for returning back the next tuple that in its stream of data that is processing. Or if it reaches the end, it returns back a null to indicate that it has no more data. So the idea is that you start at the root of the query plan, you call next going down, and you're moving data up from one operator to the next. Doesn't necessarily have to be from the top going down, we'll see how to go the other direction, but for our purposes here, we'll assume it's from the top to the bottom. So let's look at an example here. So say I have a query here, I wanna join R and S, and then I have a where clause on S value. So all these operators here will just be implemented in these little pseudo code. And I normally don't like to show code in lectures, but these are simple enough that you guys can get it and it's kind of hard not to do this. So all of these are gonna be the next functions for these different operators, all right? So to start executing this query, assuming we're going from the top to the bottom, we'll start at the root, call next on the top child here, or the top operator, and then you see inside the for loop, the first thing it does is call next on its child. So that's gonna go down and invoke this function here. Oops, sorry. I don't like not having a clicker, which I don't know is what. I should buy a new one. So, right, so then this calls down to this operator here for the join. Now you see we have two for loops. We have the on the left child, we call next, to build the hash table, and the right child, we call next to probe the hash table. So the first thing we do is call next on the left child, this comes down here, and inside this now we have a for loop we're gonna iterate over the table, and for each invocation of next, we're gonna send back a single tuple. We then probe our hash table, or sorry, build our hash table, and come back and do it again. And so once this thing, the next call for this leaf operator here returns null, the parent operator at step two knows that there is no more data that this guy's gonna send us, so we don't need to call next on that. So then we drop down to the next for loop on the right child, call next on that, it calls next on this child, it scans the table, and we move data up into the for loop probe the hash table, and then push it up to, sorry, push it up to the parent, up here, all right? Pretty sure they're for it, but again the key thing is that we're moving one tuple at a time. So as I said, this is what most data systems will implement, especially in the OLTB world, OLAP, at least a modern OLAP system is less so. And the nice thing about it is that it's gonna allow for pipelining, which is gonna matter for it in a disk based system, meaning I wanna be able to bring a tuple in from disk into memory and do as much work as I can on it before I pass it along to the, or before I go back and try to get the next tuple, right? So it's going back here on this side of the query plan. For every invocation of this next function here, I pass up a single tuple, then I immediately see what values to predicate, then I can immediately see whether it matches the hash table, and I can then do the projection. So I didn't pause and had to put it into an intermediate buffer, I ride it all the way up to the top. It keeps going for as long as it can go, and then ideally then produce the output for the result of the query. What's also nice about this is that it's gonna be really easy to output control with this approach, because if I have a limit clause, I could have that be the root of the query plan, and once it's a limit 10, once it gets all 10 tuples, it doesn't call next anymore and just stop execution. All right, so you can control how many tuples you're producing and without having to materialize everything all at once. We'll see problems in next approach because of that. So some of the operators have to block until the children emit all their tuples. These are called pipeline breakers. This just means that I can't probe the hash table until I build the hash table, as I said before, or I can't do a limit on a sort of data set until I sort the data set, right? So again, pretty much every single data system implements this approach. I'm showing, these are the major ones just thinking right at the top of my head while I wouldn't call them Apollo. And Apollo's not a major system anymore. NewDB got bought, they only have one customer, like, but like SQLite, Oracle, MySQL, Postgres, DB2, this is what they do, SQL server. Yes. The question is, when I say pipelining, do I mean pipelining where all the stages are going in parallel? No, it just means going back to my example in the beginning, I just mean that like, there's a sequence of operations, operators. So the question is like, within a pipeline, when I say like, could you have one thread doing like one operator and another doing, that's come, that's intro query parallelism, that comes later, we're not even there yet. This is like just, this is how you're moving data from one operator to the next. We'll get to then how you actually implement it when you have multiple friends. Yeah, I mean, there's a language issue. I say pipeline, the TPP is something different. I say lock, the OSP will want, don't want, you know, it means something like, yeah. I say latch, the OSP will say lock. Okay, all right, so the next processing model is called again, materialization model. And you can sort of think of the iterator model and materialization model to sort of two extremes of the design space. The iterator model for every single next call gives you back one tuple. For materialization model for every admit call or next call, you get all the tuples that the operator wants to spit out. And then you never go back and ask for it more, right? So this obviously has, there's benefits and problems with this. One problem is, as I said before, you can't control the output. Like if you want to say, I only want 10 tuples, but you call this operator on the scan and it produces a billion tuples, whether you got to push down or inline the limit calls to be aware of so that the operator knows not to produce more than it actually needs. But then the advantage is that it's going to be less overhead of executing queries in terms of like function calls and jumps and branches and things like that because it's one invocation of the operator and they get all the results. Whereas if it's the iterator model if I have a billion tuples coming out of the table, I got to call that function a billion times. And it doesn't sound like, you know, a function called how expensive could that be? If everything's in memory that they've already fetched it in, then that starts to add up. That starts to be expensive. So the output of all, I guess really all these approaches could either be a whole tuple or a subset of columns. For OLAP system in, if it's a column store, it'll typically depends whether they're doing materialization or not or materialization, it could be a subset, it could be the entire tuple. So it's the same query we had before but now in our functions here we see that they're all defining an output buffer. They do whatever it is that they want to do in the operator in the for loop and they just keep adding things to the output buffer and then they return it. So when I call next at the top here, I call for my child output, this thing then, the top thing blocks because it switches control over to this, then I do, I call my left child get its output to build my hash table. And again, I block until I get all the tuples. Then I, you know, I just iterated that, build my output. Then I go down the right side to the same thing. You get, you value all the predicate on this guy but he has to go get all the tuples from his child and then this gets pushed up like this. As a result of the blue lines or the control lines and the blue lines are like the parent operator is telling the child operator, hey, do something for me. And then the red line indicates this is the data flowing back up. It's the control flow versus the data flow. So in this case here, this is actually really inefficient to do it this way because all this operator's doing is just taking all the tuples out of S and putting it into an output buffer and then copying it into, and this guy's just going to iterate it over again. So an obvious optimization here, you would see in materialization systems you just, you would inline the predicate operator directly into the scan operator. You can do so for other things as well. Like if you want to do, this data needs to be sorted, don't materialize it as an output buffer, then pass it into a sort operator. You can inline that as well. So I say this every year where I say, this is a, materialization model is good for OTP, bad for OLAP, but there are two OLAP systems that implemented this and one of them still does. So High Rise was an academic system out of Germany that implemented materialization model. Monadb is one of the first academic column stores that use materialization model. High Rise, they rewrote it and then they switched to the vectorized model and it comes next. Monadb still uses materialization model and then this is kind of sad but the guy that invented Monadb, he passed away I think last year but then there would be people in India who would watch these videos and then I would say like, yeah, they shouldn't have done it this way. Then they emailed Martin and complained to him that Andy said this and then he had to like respond to them and he complained to me. But he was, he was stead, no, it's not a joke. It's like, he was steadfast on materialization model still in Monadb. But as far as I know, OLAP system does this because it seems insane. They're like, you have to do heavy inlining of the operators to avoid having excessive copying from one operator to the next, right? So again, this is great for OLTP because if everything's in memory and you know you're passing around a small number of tuples from one operator to the next, like go get Andy's bank record. It's a single account record. It's only one tuple getting passed along. So you don't pay that overhead of, well, I guess if you're only calling next once, it doesn't matter, but like there's, you get better performance if you know you're only passing on a small amount of data because it's one call to get everything and then you never go back. Whereas next, you have to sort of set up the iterator in some memory context. It's a bit more involved. All right, so again, two extremes. Iterator model is one tuple at a time. Materialization model is all the tuples at a time. And then the solution for us and OLAP that we're gonna want to use is to be somewhere in the middle. And this is what the vectorization model is. So it's like the iterator model where everybody implements the next function, assuming you're going top down. I'll explain what that is in a second. But instead of returning back a single tuple, you return back a batch of tuples. And again, it could be a, the entire tuple materialized. It could be just the columns that you actually need going forward and whether or not you're doing late or early materialization. But then the loop now that's gonna, it's gonna get this batch of tuples and it can process them. And then this is gonna allow us to take advantage of Cyndi of vectorization because now we have multiple tuples all at the same time. And then we can use vectorization to plier our predicates or whatever we want on those batches. Instead of like having to call next and look at a single tuple and then not be able to take advantage of Cyndi, right? So again, same query, same idea, but again now the idea is that in our next function as we go down, they're passing up some vector. And so in our for loop, when the vector gets to the size that is our threshold, then we send it back up, right? And then we can call next on us again and get more. Same thing going down this side. And again, we could do that same inlining optimization we had before, it doesn't matter. All right, so this is idea for OLAP because it's the best of both worlds. It's gonna allow us to get batches of tuples, but not all of the tuples. So we reduce the number of next calls we have to make, but not the overhead of calling next over and over again. And this is pretty much every modern OLAP system. This is just a small sampling, but these are the major ones. Every modern OLAP system implements things this way. Surprisingly, CockroachDB, even though they're not selling themselves to be an OLAP system, they have a blog article that describes how they use a vectorized model even though they're targeting transactional workloads. They do support analytical things, but they're not a column of store, but they still do a vectorized model. So going forward, this is the right way to do this. All right, so any questions about the processing models? All right, so I already said this as a spoiler, but we're gonna go through it in more detail now. And all the examples that I showed before, the way we executed the query in our different processing models was always start at the root, call next or whatever the invocation function is, and then that root operator would call next on its children operators and would percolate down into the query plan. And essentially you're pulling up data from the leaf of the tree up into the root because that's where you produce the output to the client. And this is how most database systems are gonna implement query execution, but it's not the only way. So the approach that I showed before, this is called top to bottom, also called a pool-based approach, where again, the idea is you're pulling data up from the children to the top. And in this case here, you always have to pass the data from one operator to the next through function calls, right? Because there's an operator, I implemented as these sort of abstract plan node type or operator type, and I just know I have a sequence of them, I know what my children and pointers are and it's called next on them, and they bring me back some tuple. And the idea with relational algebra is that you don't care what the actual data, sorry, what the operators are below you, you just know that the contract and the APIs, they're gonna return back tuples, right? So this is why you implement as always in functions. The alternative is to do bottom to top, and the idea here is that you start invocation of the query by invoking or initiating execution at the leaf nodes themselves. And then they're gonna then generate some output that they then push to their parents. And then once these leaf nodes are done producing the output, you never have to go back and invoke them again. So you'll see this in the next paper, the next paper you guys are reading is on, not next week, next class is on query scheduling and then after that is query compilation. The query compilation paper you're gonna read is from Hyper. I want you to, the main focus of the paper is query compilation, but he describes how you do push-based execution in this. Like that's how awesome the Germans are. It's like he wrote a paper by himself, mind you, right, with three kids. He wrote a paper by himself on like here's the way to do query compilation with LLVM and it's like, it's a seminal paper in this area. But then he's also, oh, by the way, here's all this push-based thing you should be doing too. It's insane. We did a variant of this in our own system we'll cover, the old system that we'll cover later on. But Hyper was sort of the most recent system that started off this new wave of people using push-based execution. DuckDB used to do pull-based execution, but then there's a GitHub issue from Mark where he describes how we should be switching to push-based and they did that in the last two years. So again, he describes all the reasons why you wanna do this. Again, it doesn't come for free. There's other things you have to worry about. And then Snowflake is based on this as well. Use the push-based approach. So let's see now how we do a bottom-to-top push-based execution using the iterator model. So again, we think about the query plan in terms of pipelines. The first pipeline here is we're gonna scan R and then assuming we're doing a hash join, we'll build the hash table on the build side of the join, the side of the join. Then the second pipeline will scan as apply our predicate, then do the probe in the hash table, then do our projection. So the first pipeline, if you wanna do a push-based approach, is just a for loop, then you build the hash table. Right? Nothing fancy there. But all you have to do is just invoke this function, so this here's the operator, just execute it. And then once this thing, the data is notified that way this operator is done, then the next pipeline is just this for loop here where you iterate every tuple in S and then now you do all the things going up the query plan, I was talking about for it, like you try to ride the tuple up the query plan at every single operator as long as it keeps evaluating the true whatever it is that you wanna do. So the first thing you do is evaluate the predicate. If that evaluates to true, then you put up the hash table. If there's a match, then you go ahead and do the projection. Right? So again, in this example here, I'm assuming there's always gonna be at least one, at most one tuple that will match in the hash table. Cause obviously if you give back multiple matches, this needs to be a four on the side of this, but we can ignore that. So what's the potential benefit of this? Well, what's one obvious benefit, I've already said. It's like the materialization model where there's no function calls for next, next and next, right? I evoke this, it completes, then I evoke this and it's done. And I'm showing it here, I'm showing it for two by the time in the iterator model. We have a paper that you could do this actually in batches and a vectorized approach to sort of blend the two. It requires them to work at the stage buffers but again, the hyper paper you guys read will show you how they do this. And one of the big benefits they talk about is not just trying to keep data in CPU caches, but keep data in CPU registers as you go from one sort of stage of the inside the four to the next. And that's really fast cause like it was way faster than L1, of course. Right? So he says in this case here, pipeline two starts to wait for pipeline one. Yes, cause you can't probe the hash table till you finish building it. But that was true regardless of whether it was push-based or pull-based, right? Like there's no magic to make that, like you can't magically make that, not to build a hash table. You have to see, you have to build a hash table cause it's a dependency. Yes, so the statement is this basically breaks the query plan into multiple pipelines. This question is how do you break it? So I'll cover this more maybe next class. The data system knows where the dependencies are. Like that's what he was saying, I have to finish building the hash table before I can probe it. So we call those pipeline breakers, right? I can't take a tuple from R and go all the way up because I'm on this side of the join, I'm building the hash table. I have to build the hash table and then I'm done. That's the pipeline breaker. Then once this pipeline is done, then I can execute the next pipeline. So the data system knows what it's doing cause it's the beauty of SQL and declarative query languages. The data system knows what the SQL query is, knows what the query plan is and knows exactly what you're doing. So it's not some arbitrary C function that you send it and it's never seen before. It knows exactly what the one stage is, how to move data from one stage to the next. So it knows where the pipeline breakers are. So it can orchestrate these things appropriately. Yes? Do you got what the difference is like? As a question is between a pool-based and a push-based approach, is there a difference in terms of parallelization within a pipeline itself or across the whole system? We'll get to this in a second. Yeah, so I'm not showing anything about parallelization here. So I'm not, again, this is just high level. Hey, here's this pipeline, it's a for loop. I can easily have multiple cores scan this, you scan a table, different chunks of it in parallel and produce output, right? So you can do intro query parallelism, right? So that's horizontal parallelism. In terms of vertical parallelism, you can't do that in this example here because I can't execute this before this executes. In some cases, I may be able to do that. We'll see you in a second. So in a pool, right? So in a pool, you're pulling all your updates to the load. Yes. The volume to the top is considered a case where you're doing parallelization only for these two, right? Yes. So can you achieve the best performance for both of them? So what do you mean? Can you cheat? What do you mean? There's one better than the other parallelization. Oh, it's a question. One of these approaches is better for the parallelization for another. Next slide. I don't think it matters, right? The challenge is gonna be one of these approaches will be better. So the push-based approaches, it's hard to do short-mer's join because you need two iterators over sort of tuples that sort of you call next on, right? And you can't easily do that in a push-based approach. In a pool-based approach, you can. In a pool-based approach, it's easy to implement limit because again, as I said before, if I don't need any more tuples at the root, I stop calling next and the query's done. In a push-based approach, you don't, like the pipeline may have to produce all its results before it can figure out, okay, I don't need anything else, right? It depends on what pipeline it's like. Again, going back here, now it works. In this case here, the limit clause doesn't help you because you have to populate the hash join entirely before moving to the next one. In this case here, you'd have to, this example's simple enough. If you had the limit clause in here, you could stop. Once you're having multiple pipelines, because it's more tricky, right? So again, assuming everything's in memory, there's no more branching. There's still branching inside the pipeline itself, but I don't have that next call over and over again because I just call go on the pipeline, the push-based model, and then when it's done, it's done. We'll see this when we talk about query compilation. This is basically an example of code specialization because assuming I can generate this code, I just execute it. I don't have to do a virtual function table lookup, assuming it's an implement a C++ to say, okay, I'm calling next on this operator and it's an abstract type that implements the next function. I don't have to then do a function table lookup to say, okay, what's the actual implication of that? The implementation of that function I wanna execute. There's none of that in here because it's just like boom, boom, boom, do exactly what I wanna do. I did want to try to bring up too much about code compilation because you can do this without code compilation, of course, but you get amazing performance benefits because you can easily imagine a system that could co-gen this. Okay, so that's, again, we spent most of our time on processing models. Let's finish up and talk about query execution because this is a parallel query execution because this is showing up in a bunch of the things that we're talking about. Again, this is, again, another crash course of the basic ideas of how to do a parallel execution and then we'll see when we talk about different algorithms, especially for the joins, we'll see how they do it in parallel and you understand in what context we're describing. Like within a single tuple, sorry, within a single operator, within a single query. We'll see how that all fits together. So, again, it's obvious that Davies doesn't want to execute multiple tasks at the same time to prove a hardware utilization. If I have eight cores and I have eight tasks and seven of those tasks finish up and there's one last one that's still running, I don't want to have those seven cores as waiting for the straggler. I want to have them doing something. So that's the goal here. And the tasks that we want to execute don't have to be from the same query. Some systems do that, they only execute one query at a time, but most systems don't do that. So there's two types of parallelism. There's interquery parallelism and interquery parallelism. And then when the interquery parallelism will have different types of operator parallelism. So, again, we'll go through each of these one by one. And again, this is independent of what the processing model or whether it's top to bottom or bottom to top approach for the system. So interquery parallelism is sort of obvious. It's basically, I would have multiple queries running at the same time, right? Most systems do this by first come first serve. Like, so whatever query shows up first, they'll try to process that, give that higher priority versus the ones that come later. We'll see when we talk about Redshift, they actually have multiple, they have a fast queue and a slow queue, or slow query queue and a fast query queue, because you want to try to finish the fast queries as fast as possible. So the challenge is gonna be for an OLAP queries is that they're gonna have sort of parallelizable phases and non-parallel phases. So a parallelizable phase would be that those scans I was saying, I can run those in parallel and populate some hash table. But at some point I may have to coalesce the results. Obviously maybe the root of the query plan, because I had to send back a single answer to the client. So something needs to be able to wait till I get results and combine everything together and produce my output. So again, we want to keep everything, the system active at all times. So we'll just talk about how to multiplex queries across different cores or within cores and how to schedule them at next class, like using code routines and other techniques. So that's what we'll focus on in next class. So I don't want to say too much about this right now. So instead we'll talk about within a single query, how do we actually parallelize things? So the two approaches are interoperator, intraoperator parallelism, so that's horizontal parallelism. So again, scaling out a single operator to multiple operator instances. And then interoperator parallelism, the vertical parallelism, that's what he was referring to. Like, could I have the data system execute different parts or different pipelines in the same query in parallel at the same time? And you can combine these together, they're not mutually exclusive. Again, because we have the query plan inside the data system, we know exactly what the operators are trying to do. We know what data they require as input and output. So the data system is responsible of figuring out, and it can figure out how to, where it can exploit parallelism for these things. So intraoperator parallelism is the most common one. And so the idea is that for the single operator, we just make multiple instances of it that are gonna do the same steps, the same operations on different portions of the table. So again, think of like, we talked about how I could break up my database table into a bunch of files that are separate parquet files or whatever, and I could then have a separate operator instance be responsible for processing data within one of those files. So they're from the same table, which is different threads or skating different files. And then we're gonna introduce this exchange operator in the query plan, a sort of synthetic barrier to indicate where the end of the pipeline is. And that's a sort of a meeting point for the different threads or different tasks that are running to say, okay, we can't proceed in the query plan until we get all the results that this exchange operator requires. And as the exchange operator, this comes from the volcano paper, like this is what they introduced to how to support parallel execution, horizontal interoperative parallelism in query plans. So as they also do, you'll see this in the, this technique in the morsels paper you guys read next class from the hyper guys, right? Where they're basically, they're breaking the data into blocks, they call morsels, and they're having threads assigned tasks of these operating instances on these different morsels. All right, so we have a simple query here, join A and B with some where clause. So say we start on this side of the query plan that we're gonna have a bunch of operator invocation, operator instances of the scan on A. And we'll assign these to separate workers. Think of cores, KPC views, or nodes, it doesn't matter for this point here. Then the next step is that in our pipeline we wanna do our filter, right? So the data from this, the scan on A just feeds immediately into our filter. And again, I'm not saying whether it's push or pull or iterator or volcano, this is the pipeline, this is what we're doing. So then we have the, we can push down the projection. So this is optional, but the Davidson could recognize that maybe the tuples on A are really wide, like a thousand or 10,000 columns. So rather than passing around 10,000 columns if I'm doing early materialization I could do the projection inside of this. Then they build the hash table, and then above this we're introduced to the exchange operator because we need to know that the hash table has been built by all the operating instances for the scan on A before we tell the system to begin executing the next pipeline. So you can think of each of these, this is the pipeline, scan A on, and then do the filter and then build the hash table. But each of these would be a separate task for invocation of the pipeline. And then I have to point something here, it's an exchange is basically gonna get in and say, okay, now I'm ready to do the join. Because my hash table is done. Okay, and I'll say to you, I'm not saying here whether is it a single hash table or is it a one-year list? Or is it a partition hash table? That comes later when we talk about hash joints. But our purpose is now it doesn't matter. Same things, here over here, so we scan on B, then do the filter with the projection push down, and then now we're gonna probe the hash table. And again, these are all gonna be separate invocations that can run separate threads at the same time simultaneously. Then up above we'll have an exchange operator, because this is the output we need to produce to the client. So this exchange operator basically is waiting to hear back that all three operator instances, or all three pipelines have completed and there's no more output that it's waiting for, it can coalesce the results and then produce the final answer to the client. And so we'll cover next class, like for the scan on A, how many cores, how many instances should I use? Depends, depends on what the size of the data is, how selective it is, depends on what the hardware can support. These are things that the data system can decide at runtime. So the question is how would you do sorting? So sorting would be a, you would have it above the exchange operator, because you have to wait till you get all the results before you can start sorting it. And then if you have a parallel sort, you could do that. But then again, depending on what's above you, you may need another exchange operator above it. Sorting would be an example, like it's a pipeline breaking operation. All right, so the next time a parallelism is interoperator parallelism or vertical parallelism. Again, this is just, we have different operations that are in different parts of the query plan that we can actually execute at the same time because we don't need to maybe materialize the results immediately, as part of the output. So you can sort of have like, it was like a producer-consumer model where you have some pipeline that's running over here and it's producing tuples as its output. And then it doesn't care where it's going. And then at the same time, there's another thread running the consumer of the output and this runs and does whatever he wants to do as well. So this is called pipeline parallelism because you're trying to paralyze the execution of different pipelines in the system. So not different invocations of the same pipeline, just different, completely separate pipelines. So this is a contrived example but I'm going to join A, B, and C, D together without a where clause. So say I have a bushy query plan where I'm going to join A and B, join C and D and then take the output of that and join that together. So you can imagine a query plan that looks like this where each of these are going to be their own separate own separate pipeline that's going to be executing, right? So I do the A and B, I'm not saying whether I do A and B in parallel or like just to say I'm just executing this one thread. At the same time doing C and D and then the output of these could then be maybe partitioned or hash partitioned or divided up that like maybe some of the tuples that are hashed here or some of the tuples that are joined here and joined over here go over here and other ones go over the other one. And therefore I know that I don't have any, actually I take that back. This is a Cartesian product. You know where clause is Cartesian product. So it's just trying to match everybody with everybody. So that's why you can do this completely in parallel. So you'll send the data, you'll duplicate the data to the different sides and just everybody gets joined together. Again, this is like a contrived example. You wouldn't want to do this in a real system but like this is how you could actually implement it because there's no data dependencies other than or there is a sense that these guys are spinning waiting for output. You typically see this in the streaming systems where it's like a constant stream of, or the continuous queries. There's a constant stream of data coming in and there's some predicates or some evaluation you're doing on it and it's always producing output and the query never dies or never ends because it's like as I get more data produce more output. So again, you don't typically see this in sort of OLAP systems that we're talking about here because of the pipeline breakers because you have those dependencies. Okay, all right, so to finish up. As I said before, the main takeaway from the Monee B paper you guys read is that the easiest way to implement something is may not be the best way to actually implement it for modern CPUs. In particular, again, the paper you guys read was focused on super skill CPUs but that is the defining architecture of today's CPUs. There's the performance cores and efficiency cores. The technique still matters for those different kinds of cores. It's just now it's more complicated because there's more variation in them but the idea is still the same and that branchless thing I showed before will still work on today's hardware. The vectorized and bottom up execution approach going forward probably is going to be in most cases the better way to execute OLAP queries. The hyper guys will make very hard claims that this is the case. We'll see the snowflake paper does this. DocDB now does it but I think the paper you guys read they don't do that but we'll see this again throughout the entire semester. And it shouldn't, the algorithm we'll talk about when we do joins or also the parallel execution stuff it doesn't, it's not going to matter whether it's bottom up or top down for the things that we care about in this class. But we'll see in the case of the hyper paper they're going to make a big deal about keeping things in CBU registers. Okay? All right, so next class we'll talk about actually query task scheduling, how to decide for a bunch of queries which ones I should run, how to then decide within my query if I want to do these tasks now which ones do I run and where do I run them. And then we'll talk about code routines of how to maybe allow one thread to run multiple tasks at the same time. Not exactly the same time but have, be processing multiple tasks at the same time where one, if one task can't get any data it then yields itself and then the next task runs. And then this is avoid having to go back to the OS to reschedule things. Okay? And then apparently you have already completed project number one. Ha ha ha ha, that's my favorite all time. Ha ha ha ha. What is it? It's the S.P. Cricut I.D.E.S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Now here comes Duke. I play the game where there's no roots. Homies on the cusp of y'all my food cause I drink brook. Put the bus a cap on the ice bro. Bushwick on the go with a blow to the ice. Here I come. Well indeed that's me from G and St. I's but yo I drink it by from 12. They say bill makes you fat. But saying I's is straight so it really don't matter.