 We're now picking up where we left off before spring break, before the midterm of burning out the actual system further. So again, the overview of where we're at in the semester is that we talked about the networking layer, and then we're starting from the bottom and sort of going up, although we've already covered the storage manager. So right now we're at this part here, where the query plans are going to show up through the networking layer. We'll talk about how SQL shows up through the networking layer. We'll talk about how the optimizer and the planner is going to generate a query plan, and now we're talking about, if we assuming we already have a query plan, now we need to start executing the operators in that query plan to produce results for our query. So we're at this point here, but then now we're still going to go up the stack and talk about how we can then compile plans into sort of efficient machine code to execute them very efficiently. That's what the team was talking about with the DSL stuff from last time. Then we'll go up even further again and talk about how we actually generate an efficient query plan in our optimizer. So again, we're here, we're going up. So the next couple of lectures are really going to be focusing on operator execution, how can we take again some component or operator in our query plan and execute them efficiently. This is now going to be going way deeper than we've ever really covered in the introduction class. Because now we're going to actually start really caring about what's the hard work they're going to do, where's the data we want to operate in memory, what's in our CPU caches, what are some additional instructions that the CPU provide us that can also do things in parallel or more efficiently. So basically, we're going to have a big bag of tricks we can do to run these queries super fast. Query plan processing, what we're talking about today is basically how we're going to organize the overall system to have it execute the operators in the query plan. But we're not going to talk about what actually those operators are just yet. That's when we started getting into like inlining application logic for user defined functions. That'll be next Monday. The parallel join algorithms vectorized operations, or vectorized operators using SIMB and then query compilations. So again, all of these together are the things we're going to use to execute operators. The high-level thing we're talking about today is how we organize the system to execute these operators. Where do they send their data? Then everything else will then be, now within a single operator, what does some crazy sh**ing do to make that run really fast? Some of these things are easier to do because we're in memory. We're not worrying about disk. So again, the goal of what we're talking about today and going forward is, we have some query plan that the optimizer generated from SQL doesn't have to be SQL, but in our world it is. Then inside of this query plan, we're going to have operators. We'll make a distinction between physical operators and logical operators later on. But for our purposes, we have a physical operator that says, I want to do X on some data that I'm getting from either a table, an index, a child operator, it doesn't matter. Then it's going to come to that, produce some output. Then we'll say an instance of that operator. Since we want to be able to run this in parallel, we'll call this an operator instance. So one core could take one operator and deploy it on multiple cores and all of them are running in parallel. Each of those would be an instance. Then a task or pipeline will be a sequence of one of these operator instances running within a single core. So our purposes here, we're focusing on at a high level, how are we taking these guys, the tasks or the pipelines and having them run on different cores? Where are they getting the data from and where do they send their data to? That's the goal of today, but we need to understand why we want to organize things a certain way. As I already said, if you get rid of the disk, which is the whole point of this semester, if we get rid of the disk, everything fits in main memory, then the goal for how we're going to speed up queries isn't as obvious as it was before. In a disk-based system, the disk is always the slowest thing. Network also sucks too, but sometimes you can't avoid that. The disk sucks, it's slow. So we want to do things like heavyweight compression to minimize the amount of IO we're doing, or minimize the time it takes to get data into memory. But now disk is gone, we still need it for logging and that's unavoidable. But now the thing we want to target to speed up our system, it's not as obvious as it was before. So it's going to be the orchestration of a bunch of different ideas together to speed things up. Now query compilation is going to be a huge win, vectorization also matters, parallelization also matters. So what I'm trying to say, there's not one thing I can point to you say, if you're building new data from scratch, do this, do this one thing first. We are actually trying to do all of them, because in some ways if you're building a modern system, you want all of them, but there's no one of these that are going to stand out clearly better than others. So what are our goals now? If we assume everything's in memory, the disk is not along the bottleneck, what kind of things we need to start caring about? Well, it's the low level parts of what the actual CPU is going to do. So there's three different optimization goals we could have in trying to speed things up. The first is that we can reduce the instruction count of the execution of the query. So that means essentially we want to do less things, we're going to execute less instructions, but still produce the same amount of work, the same amount of results. And this is not something like, the compiler can help us in some ways, but many times it's also going to be us specializing our code or organizing in such a way that we reduce the amount of work we have to do in order to execute a query, or execute one of these operators. The next is going to be to reduce the number of cycles we have per instruction. So again, the high-level basic idea is here, we're going to have some number of instructions, but instead of having, taking longer to execute those instructions, assuming that they're fixed, right? We've already done maybe the first one, try to reduce the number of instructions. For now, the instructions that we have, we want to reduce the time it takes to execute them. And the metric we're going to use is to reduce the number of cycles we have per instruction. And so what does this mean? What's the main bottleneck whether it calls an instruction to take multiple cycles? Yes? Memory stores and loads? Memory stores and loads, or cache measures is another way to think about this, right? So this is us now trying to figure out to, I guess it's already said there, right? It's still a good answer. This is us trying to maximize the locality of data that we have in our caches and we reuse it as much as possible. And we'll see when we talk about the processing models, like the volcano processing model, or the iterator processing model, these techniques are written in such a way that they're actually bad for locality. Because maybe you go operate on one tuple and then rather than riding along that tuple and doing as many things as possible with it, you go back and look at the next tuple, right? The last one will be to obviously paralyze execution, right? We don't need to say this, Morse law is ending, so we're not gonna have way faster clock speeds. The sort of hitting the physical limit of how small the die size can actually be, whether the transistor size can be. So we're not gonna get a huge performance just because the CPU got better for a single core. Instead, Intel is gonna give us a bunch of cores. We're gonna make sure that now we implement operators and we can execute these in parallel efficiently. So we'll talk a little bit about today how we organize the overall query plan when we execute it to paralyze operations. But then when we talk about in the next couple of weeks, we can take like a single join algorithm and how many run that efficiently in parallel, okay? All right, so today our agenda is sort of three parts. We're gonna begin with the vector-wise paper or the MoNAB X100 paper that you guys read and talk about their analysis of what modern CPUs look like and why existing implementations of database systems at the time in the mid-2000s while they were inefficient for the architecture that still looks a lot, not exactly the same as they described in the paper, but some parts of it still are the same. Then from this, we then talk about, again, the processing models of how we actually want to take a query plan, break up its operators into tasks or pipelines, and then have them be scheduled within our system, using all the scheduling stuff we talked about before the break. And then we'll finish up talking about at a high level what different variants of parallel execution look like. And again, think a high level for the entire query plan and then we'll talk about in the subsequent lectures how we can execute each operator in parallel, okay? All right, so the paper head you guys read, as I said, it's from 2005, so it's almost 15 years old. But to me, I actually love this paper. One, because it's super easy to read and the stuff at the end with the, when they describe their system, maybe not so much, but at that beginning part, they really lay out like here's the problems you have with in-memory databases running on modern hardware and here's why everyone's gonna get crappy performance for analytic queries. So I'll say also too, like we're focusing on analytical queries here because this is where we're reading a large amount of data. For the OTP stuff, there are things you wanna do to get better performance and target it towards the hardware that you have, but the bang for the buck you're gonna get is not gonna be at the same scale as you get for analytical queries because these things are just reading way more data. There's a bunch of other overhead that you can avoid like lock contention at the illogical level that will slow down performance of a system whereas these guys don't have to sort of deal with these things because again, it's just reading data as fast as possible and processing them. So the first half of this paper provides a really good low level analysis of the performance issues in in-memory databases or actually just for all databases because they looked at MySQL as well and Oracle. But it looked at how these guys were not designed for the modern sort of CPU landscape as it existed at the time. And the way to sort of the reason about is the systems they looked at were written in a way that was sort of natural for humans to write. It was written away from an engineering standpoint that was easy for the programmer to reason about. But the way that actually is best for humans actually turns out to be the worst thing for the CPU itself. After running to get efficient execution on the CPU. So then based on these findings, they then proposed a new system at the time it was called Mone ADB X100 because Peter Bontz and Marcin came out of the Mone ADB project at CWI but then they forked this off and that then later became VectorWise. VectorWise is actually really good. We benchmarked it today. I think you've done this, right? Or no, you didn't. We benchmarked in other experiments VectorWise against some other systems we've talked about so far. And again like the last public version of this from like 2015, 2014-ish still beats modern systems. It was really good. So then VectorWise got bought by Actian. Actian is a, I don't say holding company but it's like a, they buy old databases. So they bought Ingress, they bought some other stuff but they bought VectorWise. And then they killed it. Well they renamed it to Vector. Then they killed it. Like if you didn't know what you were looking for when you went to the Actian website you wouldn't find it. And so every year I would say yeah, VectorWise is really good but like too bad Actian killed it off, right? Or they changed the name and they made it a version for Hadoop or some b***h like that, right? And so I kept saying in all of my videos like I love VectorWise too bad they killed it. Got to the point where actually somebody Actian saw it and they emailed me and they're like hey, just so you know Vector is no longer in hiding it's now back on the Actian website. Cause again if you went to the Actian website it wouldn't be available for download. But you had to know where the old link was to find it using like archive.org. Anyway, he just tells me that like this is all changed VectorWise is back from the dead. So that was kind of nice but then they actually had an announcement what is this, three or four days ago. Actian has now been rebranded as Actian Avalanche and now they're selling it as a like a data warehouse in the cloud, right? Sort of like Redshift or Snowflake or something like that, right? I can't comment to how good this is but I'll say again for the single node version that VectorWise is actually really good. Now if I remember correctly VectorWise is actually based on M-Map. Now much as I complained about M-Map again for read-only workloads M-Map is okay so they weren't doing transactions. They just load the entire database into memory. To get other backgrounds, so Peter Bantz and Marcin Zorakowski were the two people working on VectorWise. Peter went back to CWI and he's building a new system there after he left Actian. Marcin went off to go found Snowflake. So a lot of the techniques that were the things we'll talk about from the VectorWise paper and things we'll talk about going forward the rest of the semester these are the kind of things that he went off and built in Snowflake. That's part of the reason why Snowflake's really good. So it's sort of like I guess the database community is very incestuous. Okay, so there's some parts of this paper as I said that are a bit dated. Like they talk about itaniums and I fancy that some of you actually have never heard of an itanium before. Who's ever never heard of itanium? Raise your hands, okay. Oh, everyone, all right, perfect, all right. When nobody uses it anymore, HP killed them off, Intel killed them off. But they talk about some things like the pipeline for the Pentium-4 chip they were looking at had like 31 stages. Whereas like a modern Oswell-Brogwell since like 2015 or so is now we're down to like 14 stages. So we don't really have that super long pipeline anymore but there are still, the problems you still have with the pipeline still are in effect and still matter to us in our database system. So again, the harbor is slightly different but the main ideas still matter to us. All right, so we need to understand what our CPU is gonna look like in order to understand how we can design the system to operate efficiently on it. So this is like everything you need to know about CPUs for databases one on one, right, in like two slides. So again, the CPUs we organize in these pipeline stages and the idea of these pipelines is that because a single instruction may take multiple clock cycles like for example, if I have to go do a load into memory it's gonna take a couple of cycles for that to end up in my CPU caches. So the idea here is that they can have the CPU operate on multiple instructions at the same time because they can have different parts of the CPU new executing something different at every single instruction. And underneath the covers again like you don't write your program assuming that they're gonna do all these weird out order stuff you know the compiler just generates the instructions and the CPU then figures out how to then sort of shuffle things around while it's running to have the thing that CPU always crunching instructions at all times, right. So these are also called to, you know to modern CPUs have these pipelines but then if you have a single pipeline if you have multiple pipelines it's called super scalar and the basic idea here is just like you do and you have multiple pipelines that are all sort of executing parallel and then keep sort of firing off these instructions if you know that they're gonna be independent of each other, right. But soon as you have like things like dependencies or branch mispredictions we'll talk about the next slide then you sort of have the CPU has to correct itself to make sure that it actually does what does things in the proper order, right. Cause you don't wanna have out of order execution of things cause that would break up above the correctness of your application for example, right. So we want to have Flynn's taxonomy so much so make more sense when we talk about parallel execution or SIMD but Flynn's taxonomy is a way to describe different computer systems. So you can have like a parallel system can be classified as what the instruction stream is or what the data stream is. So a single pipeline for this in our world is a single instruction stream, single data stream. So we have one instruction stream coming on one data stream that they're operating on and then we can sort of do that in parallel in our different pipelines. So what do we care about for super scalar CPUs? What do we care about in data systems? Well at a high level again, this is sort of what every application would care about but this matters a lot in our world and because we're the ones actually building a database system we can architect it sort of in a way to try to avoid some of these problems. So the first of that we have to deal with dependencies and this is where if you have one instruction depends on another instruction then you can't have them be in the same pipeline and sort of execute them immediately after the other cause you need the output of the first one before you can get the second one and then you need to put the output somewhere and then provide it to the second one, right? So we have to avoid, actually this is harder you can't really avoid these dependencies but the CPU is not gonna be able to get full speed execution of your instructions if you have dependencies. The one that we can program around better is do handle branches. So because to fill the pipeline is expensive and so again we have a 14 stage pipeline so we have 14 instructions and so I'm executing some instruction here and then the output of that instruction say like it's an if branch that then determines what co-path I go down. So for my remaining instructions in the pipeline I can't, I don't actually know what path I'm gonna go down so I can't fill in the instructions that are gonna execute next. So modern CPUs will try to predict what path they're gonna go down and then fill in your pipeline instructions that you would execute if you went down a particular branch. And again how they actually figure this out is like trade secret but you can think of something like keeping really track of like I've been down this path before what probability will I go down, I've seen this if branch before what probability will I go down one branch versus the other. At a high level that's essentially what they're doing but it's more complicated than that. And so what'll happen is if the branch prediction gets it wrong. So then I have to basically flush my pipeline throw away any maybe speculative work I've executed assuming I was gonna go down a particular branch I throw all that away then load in the branch I should have gone down into my pipeline and then start executing that. So again this is problematic because again it's something that the Harvard is doing we don't have complete control over to say hey go down this, I'm probably gonna go down this branch so maybe prefetch this one versus the other one. We can't do that. That's all underneath the covers to us. So where is the most obvious thing in a database system where this thing becomes problematic? Well what's the most common thing we're gonna execute in our database system where we would have branches? Wear clause, filters, exactly, yes. So think of those. What is a wear clause? I'm scanning the table, I have some like wear something greater than something or something less than something. So I'm taking a value that comes from the tuple itself. It comes from the table and I'm comparing it against another constant. So for every single tuple I'm looking at that value from the tuple is going to change. And again unless I do some like crazy pre-sorting then it's unlikely that the branch predictor is gonna get this correct. It's gonna be completely random. So it's gonna, 50% of the time it's gonna choose the wrong path and we're gonna get bad performance. So let's look at this a little more detail and see how we actually can design a scan operator with a wear clause to avoid branches. And this is something again where we can design our database systems internals in such a way to be aware of what the harbor is actually doing and potentially get better performance because of that. So say we wanted to run a simple query like this. Select start from table, wear key greater than some low value and key less than equal to then some high value. So if you were like building a database from the first time you would implement it sort of like this. This is the branching implementation. Again I apologize for showing code in some of these lectures you just sort of have to but normally I tell my students don't show code because nobody ever meets it but this is simple enough that you should be able to follow it. For every single tuple at my table I go grab the key I wanna compare in my tuple then I'm gonna do my if branch. If key greater than equal to low value and key less than equal to the high value then I'm gonna copy it into my output buffer. And say the output buffer is just a giant array and then so I increment the offset and then I'm running to my buffer every time I add something. So again what's gonna suck here from a CPU standpoint is this bad boy right here because we're not gonna be able to predict this. We're like it's gonna depend on the data and it's just gonna be completely random distribution. So to implement this without any branches you would do this. So for every single tuple immediately copy it to my output buffer. I don't check I just copy it, right? Then when I wanna do my comparison of the key to see whether I should include it I'm just doing arithmetic operators here to tell me whether the if this is less than this then I have a one and this is sorry if this is greater than equal to this I have a one and then key is less than equal to that then I have a one. I add these two values together and that's either gonna give me zero or one. It's zero if only one or neither of these matched. It's one if only if both of them matched. Then I get that that's just my off that's sort of the value of this either one or zero is what I'm going to add to my offset in my output array. So if I don't match this goes to zero I come back around and I just overwrite the last tuple I put in there. Of course now I'm missing an extra piece here we have to check to see what was the last thing I add was the last thing I just added should be there or not if yes keep it if no then truncate the last one right there's an extra step you have to do afterwards but in this case here I'm assuming I'm gonna match I always copy it in and then I check later yes. Is the Turner operator there not counter branching? His question is the Turner operator not counter branching so that's this thing here the question mark no because you can rewrite this as math arithmetic and then it's single instruction to the comparison. Oh the compiler just does that for you? Yeah. Okay. Yes. You both have to be careful to use logical and not short circuiting and right. The statement is you have to be careful to use logical and and not short circuiting and yeah so what well wouldn't that implicitly create a branch? Yeah but like yeah so in this case here if match actually because it's not an if clause because it turns into arithmetic instructions it'll always execute both of them as far as I know yeah. Right. Don't install windows. Sorry. So again the key thing is here because here it's an if clause it's gonna you know say you do the stupidest thing you say all right the last time I went through this branch I went the last time I had this if clause I jumped into this branch here so let me go pre-execute that or refetch that in because that's what I think is gonna happen but again if it's every other one matches and every you know every odd one matches every even one doesn't match then that's the worst-case scenario because I'm gonna blow out my you know I'm mispredicting all the time. In this case here I'm paying I'm doing extra work because I'm copying things that I don't actually need but I don't have any branch misprediction so I'm just like blasting through my pipeline as fast as the clock speed of the CPU right. So this obviously depends on what the data looks like. So this is from another paper later on from the vector-wise guys where they're doing a sequential scan and they basically have the same branching versus non-branching branching versus branchless selection operator I just showed in the last slide and they're varying along the x-axis the selectivity of the where clause. So this is where you would have 0% of tuples match and this is when you have 100% of tuples match. So the red line is the no branching one and as expected no matter what the distribution of the data looks like of a selectivity of your where clause you're always doing the same amount of work and that's why it's basically a horizontal line but the arc you see for the blue line that's the branching case, right. So when everything is sort of less than roughly around 5% here. So when you have less than 5% selectivity the CPU's pre-fetching or branch predictor gets it pretty, is correct most of the time and therefore it's the benefit you get from predicting that your thing is not gonna match and not having to do branch prediction is outweighs that overhead of the extra work you're doing in the branchless case. But when you're up here and obviously when you're at 50% that's the peak because that's roughly you're wrong 50% of the time and that sucks for the CPU. So again, this is a really good example we'll see this later on this actually the branchless code actually has some other benefits when we start doing SIMD and vectorized instructions because SIMD doesn't have if clauses where a branchless approach seems like it'd be wasteful because you're doing useless work but that actually turns out to be the better approach. Okay? All right, so for the other optimization we can have by is reducing our instruction count the way to sort of think about this is that in the examples that they showed I think for MySQL on the paper was that the where clauses are super expensive because of the system is being written in such a way to be general purpose and be able to handle all of these different data types within the same code path of doing the various operators in the system. So the way to think about that is like say I wanted to do an addition I'm gonna have this giant switch operator that says if my left operand is this data type and my right operand is this data type here's the addition instruction that I wanna do. And again, that they talk a little about how vectorized handles that they do by pre-compiling those comparisons and this figuring out on the fly as you run which comparison function that you've pre-compiled you wanna run. That's one way to do query compilation we'll see the DSL or LLVM stuff later on on the fly but that's one example of how to actually avoid this problem. So again, these giant switching it sucks because that's just more if instructions and now that means more branch misprediction in the CPU and somebody's also the function calls and now it jumps into memory and that set up your stack and do this calls as well. So we already saw why this is problematic before when we talked about the numeric data type when I showed you the sample code from Postgres. But so here's our switch statement and then you see here's all these different instructions to basically do something equal something or something greater than something. In this case here we're doing fixed width or fixed decimal arithmetic but in a lot of cases for other parts of the system in the MySQL example they talked about you're gonna see these giant switch statements to handle comparisons between different data types as well because you have to cast things. So the goal for reducing the instructions we have to execute will be by kind of specializing this code to only support the data type that we actually know about ahead of time. So we're not gonna talk about how to do this just yet this will come up when we talk about compilation mostly but again, this is something in the background mind we should be thinking about how to specialize the system to get better performance because that's gonna actually be the best thing for the CPU as well. Okay, so now we wanna talk about in the processing model which I'll define as the high level approach of how the system is gonna execute a query plan and for these different processing models they're gonna have different trade-offs for different types of workloads. So we're actually gonna talk a little about OTP versus OLAP here whereas the, I mean everything I've said matters in both worlds reducing instruction count that's clearly always gonna be a good idea as I said but we're mostly describing the context of OLAP environments. For this point here we'll actually talk about both workloads. So the first approach is called the iterator model also sometimes called the volcano model or the pipeline model. Volcano was an influential system at a late 1980s, early 1990s out of, I think he was in Oregon but this guy, volcano was a sort of optimization framework. We're also gonna talk about Cascades which is the successor of volcano but this is from the same guy who wrote the locking paper you guys read about from B-Tree indexes, Gertz graphing, so he's done a bunch of stuff. So the textbook definition or the sort of more formal name of what we're describing here is the iterator model but people often refer to it as the volcano model because this paper sort of laid out how to do this in parallel first. So the idea here of the iterator model is that every query plan operator is gonna have this next function and what'll happen is a parent node will call next to its one of its child nodes to have it go send up to it or retrieve the next tuple that it's supposed to process. So for every single next function invocation you get back one tuple, right? And the idea here is that the parent node is basically gonna loop over this next function and keeps calling next, next, next on its child gets back some tuple and maybe does some iterator process before sending it to its parent and it keeps doing this calling next on its child until the child comes back and says, I don't have any more. Right, and then at that point we've processed all the data we need to process at that side of the tree in our query plan. So let's look at an example here. So say we're doing a simple join on A and B or A dot ID equals B dot ID value and then we have a simple predicate on B value. And then we have a projection at the top. So the way to think about this, and again I apologize for showing code but we have to do this, is that we have these, each of these different operators in our query plan, these correspond to a little function, a little snippet of code. And then we're gonna start at the top and we're gonna have this for loop over for every tuple in child dot next. And then every time we call child dot next that invokes this function here. I have to say down here below, like for its child give me the, give me the next tuple that you have. All right, so this one here, we're doing a loop over child dot next. We call it next down here, we get down here. And this guy's gonna loop over the left child because it's doing a join. So it calls left dot next and that lands us down here. And this is our access method or scan operator where we're actually just doing a sequential scan over A and we see we have this emit function. So this is gonna look at every single tuple in A and then for every tuple that it gets, it calls emit which then sends back a single tuple after this guy. So for every time we call emit, the execution stops here. We still maintain state, almost like an iterator in Python where we know that if we come back and book this thing again, we don't start from the beginning. We pick up where we left off but then we send this call as emit and the execution context of our thread that's running this query plan goes back up to the parent. The parent just keeps doing this, doing this and getting back single tuples until the child here says I have enough. But inside our for loop for every single tuple we get, we're gonna put it into our hash table because assuming we're doing a hash join. So then when I'm done, my left child gives me everything that it wants to give, then I call next on my right child, goes down, it calls next in this filter operator from this guy here and then again single tuple goes back all the way up. And then we do our join and then we emit a tuple going up. So we'll talk about pipelines, we'll talk about query compilation a bit more but one way to think about this is like the left side and this thing here is a pipeline because this could run in, for every tuple I'd get I could then build my hash table but the pipeline ends after two because I can't actually get the output for my projection until I build my hash table and then I start probing it here. So like this, the top one can't proceed until the bottom children have, or at least this side here, the left side of here has produced all its results. So again, this is called pipelining and then there's a pipeline breaker that says when you can't keep going and you have to go back and get the next job. So this iterator model is used in almost every single data system that you know about, that you've ever heard of. I'm showing a small sampling here. These are the ones that actually can confirm from either looking at the source code or looking at the documentation that I know they're doing to iterate a model. Again, any transactional system, any OTP system will be doing this because this one or two exceptions that I'll show next slide, but this is just the way everyone implements it because it's the easiest one to write because you have these nice functions, abstraction you call next and get data going up. So one nice thing about this is that output control is super easy to implement this like a limit clause because you just stop calling next when you get all the data you want. You don't have to have any additional logic to have some children stop emitting results because they're gonna send up everything you need and it's almost like an on-demand request. Give me the next tuple. So if I have everything I need based on my limit clause, then I can just stop, all right? And then the pipeline breakers are these guys here. Join some queries and order buys. This is where you can't continue up the query plan with a tuple until you get all the results from one of your children, all right? So again, this is really good for OTP because, well, it's not great for OTP, but this is the most general purpose one. This is what sort of like, you can do this for analytics, you can do this for OTP. It's not particularly good at either one. It's just, again, it's a general purpose thing. The next approach is called the materialization model and the idea is that instead of when you call next, instead of getting back a single tuple, you give back all the tuples from your children. So there's one function called down below and it spits back everything that it would ever, everything that it's ever gonna generate all at once, all right? So the reason why this is good for OTP systems is because these things don't produce a lot of tuples. So rather than worrying about this function call as you go down, we'll see it in a second, you actually push the data going up and then you don't call next, next, next, you don't have to set up these stacks of calling these functions. It's just one invocation within a single operator to send everything that you actually want. But here's a good example where like because it's a bottom up approach versus potentially versus a top down approach, you do have to pass down hints to avoid scanning more data than you actually need. Like if up above I know I have a limit clause where I only care about one tuple, then down below I need to make sure it knows that I only need one tuple because otherwise it's gonna dump all the tuples up and then I throw them all away so I get the very top. So as I'm describing it, we're describing this as tuples and it sort of implies that it's a row store where you're getting all the attributes for a single tuple, but it doesn't have to be that way. It could just be like our column store approach where it's all the tuples for a single column or all the attributes within a column for all the tuples. And you just pass back that column instead of having to materialize the whole thing. So here's roughly what it looks like if this works anymore. Nope, awesome. Sorry, is it frozen? God damn it. That's weird. All right, whatever, sorry, it froze. All right, so instead of having next, we now have an output function. So at the very top, we're starting from the top going down, I call output and this says give me all the tuples that my child has so I go down here and this has output. So it goes to the left side, give me all the tuples that you have. So down here, we create a buffer, we iterate over every single tuple in A, that's the table we're scanning, we add it to our buffer and then we return it. We just send that data back up and again it's all the tuples that this operator down here is ever gonna touch or ever ever look at. So at this point here, when we call return, we never go back to this guy because it materialized all the data that it's actually gonna need. Same thing, we come down the other side, it gets its child output, it gets down here, materialized the total tuple and then send that back up and we go all the way up there. So again, I'm showing you this as a top down approach. It could also be bottom up, I could just take this thing, execute this first, produce a bunch of tuples, put it in, all the tuples that I'm ever gonna generate, put it in some buffer and then had this guy then read from that buffer. But in this case here, I'm going top down because it looks like the other one. So this, I think my opinion, the bottom up approach with materialization model is the right way to do the OOTP because it's fewer function calls, it's fewer coordination between these different operators. We don't care about parallel execution in OOTP for a single query which is called intro query parallelism because there's, if I'm gonna go read one tuple, I can't really parallelize that, right? So this bottom up materialization model is the best approach, I think the most lightweight approach for OOTP. For OLAP, I think it's a bad idea and this is actually what Monadibi did and this is what the vector-wise guys said, no, this is wrong, this is the right way to do this and which is they're in the vectorized model. But Monadibi did this, at least the old version of Highrise did this which was an HTAP system with roots from some ways from the Monadibi project. But these guys do this, now they were column stores so they were materializing the entire column and sending it up but again, if I have a billion tuples, my select operator filters out maybe half of them so I have a half a billion tuples. I have to make a buffer of a half a billion tuples and shove it from one operator to the next as you have these really large intermediate results which may not be good for Casual County, may put a lot of memory pressure on the system and your caches. So again, this is why I think this is a bad idea and no system with exception from Monadibi really does this. All right, so the last model is again what the vectorized came up with and again, it sort of seems obvious now but at the time this was novel, this was new. So it's gonna look a lot like the iterator model where we're gonna have a next function that the parent calls on the child but instead of getting back one tuple you're actually gonna get back a batch of tuples or a vector tuples. Now the size of this vector can depend on what the hardware actually looks like. We'll talk about this later on with vectorized execution but it's some subset of the total data that an operator is gonna generate and it's size enough that it'll fit in CPU caches. And so they're sending the data from one operator to the next is a really cheap operation. So what this is gonna allow you to do now in our internal loop inside of our operator before I was showing you these functions where it was like, get for one tuple, do something. When we start talking about vectorized execution now if we iterating over our batch of tuples now we can use SIMD, basically vectorized instructions to take a bunch of data, feed that into the CPU for one instruction and then that produces a bunch of answers in parallel all at once with one instruction or a small number of cycles. Again, this will make more sense when we talk about the vectorized execution stuff with SIMD but that's, if you do this then it makes, you can do the vectorized execution stuff later or the vectorized operated execution later. So going back to our example, right? So now inside of our little code snippets for operators we generate the output buffer like the materialized model. I guess that should be next, sorry, not output but it doesn't matter, right? But now what we're doing is we're iterating over our data from our children and they're gonna generate a buffer and when the buffer reaches a certain size we emit it as output. So again, same thing, going top down we just keep calling output or next and these guys keep shoving data up, right? So like to think about this, like so here this probe on a single tuple, this could be a vector of tuples to do the probe in parallel inside of our hash table, right? Using, so we're getting sort of like, we're getting sort of the intracranial parallelism across different operator instances running at the same time but then each single operator instance we can then run instructions in parallel using SIMD which will be a big, big win. So in my opinion, this is the way to build a modern OLAP system today and there's a bunch of the newer systems and now also put this. So like DB2, Oracle and SQL Clever like the base system is a row store that doesn't do this but if you buy their query accelerators or their comm store engines, they will do what I'm describing here. So that's why I'm including them there but vector-wise is again, what the MonoDB X100 thing that you guys read are now vector or Avalanche, this is where this all came out of and Snowflake was founded by an X vector-wise developer, right, so that's a lot of the ideas that came out of vector-wise they put into Snowflake. All right, so I've already sort of alluded to this but just to say it concretely, again in all my examples I was showing you what's called a top-to-bottom or top-down approach for query processing where we have a query plan and the first thing we do is go grab the root of the query plan tree and we fire that function or fire that snippet and it calls next on its children. So the execution permeates from the top, goes down and then essentially pulls data up to the operators as needed. But I did talk about the bottom-to-up or the bottom-up approach in the materialized model. This is where we start the bottom of basically our scan operators on our tables or indexes, it doesn't matter and they start iterating over their data source and then they shove data up into their parents, right? Again, at a high level it sort of seems like it's the same thing and semantically yes, like it actually used to query, like it doesn't matter whether it comes in the top-down or the bottom-up. But the bottom-up approach turns out to be better because you're gonna have tighter control of your caches and registers in your pipelines. So you're gonna have tighter control of how you move data up into the tree. And then now for your pipelines, now you can do things like well I can keep this CPU or keep this tuple with its values in not just cache with CPU registers and pass that along from one operator to the next and that because CPU registers are the fastest memory you can have, faster than L1, L2, L3. If I keep things in my registers then I can have full control over that then I can get even better performance. So this idea of pushing from the bottom-up and doing co-gen, a lot of these ideas came in hyper. You will read the paper that describes how to do this and this is actually what we did in our own system in the old version of Peloton and what we're doing now in the new system today. So for all of that, I think the vectorized approach plus bottom-up is the right way to go and you'll see this out of the hyper-paver, yes. So why can't you pass data off with that in the approach one? This question is why can't you pass data like efficiently through registers in, because if you're calling functions then every function called resets the stack, right sets the new stack up and you're not gonna guarantee there are things you'll be in your registers. This is like, I'm basically unrolling the operators like there's snippets, I'm thinking like unrolling the loops and like executing as many things as I can within one pass in a loop. Yeah, but can't you have some sort of protocol within your functions and be like, okay, I'm calling you and put this in the register when you return? And it can be some sort of call you say register. So his statement is, couldn't you set it up because such as you have basically some kind of stub or something that says, I'm calling this function and tell that function to make sure it puts something in a CP register. I actually don't know what happens when you do a jump, whether you guarantee that. If you're jumping to another function, I don't know if you can guarantee that. Let's take this offline, I don't have the answer. I know why you can do it in this. This one makes sense, it's just unrolling. The function call thing I think screws things up but I don't have a good answer. We talked to Prashan about this because he's actually looking at this now. Okay, all right, so again, the processing model is how we're actually, we have a query plan, how we wanna, how we're going to have the different operators get data from the children. And then the bottom up versus top down tells you where the data flow goes. And then the processing model tells you how much data each operator is generating. A single tuple, like the volcano model, the entire thing, like the materialization model or some subset in between, like the vectorized model. So now let's talk about how we actually wanna execute queries in parallel. So again, we're not talking about within the snippet itself for an operator how we execute that in parallel. It's just now how we organize different operator instances in parallel running in our system. And again, because Intel is gonna give us more cores, they're not gonna give us, you know, faster clock speeds. So there's two, a couple different types of parallelism. So there's interquery parallelism and interquery parallelism. We've already sort of talked about interquery parallelism. It's basically we're gonna have multiple queries running at the same time. And we talked about how the system can schedule this, you know, using a task queue and then having the different cores take things out of task queue and run them. So again, this is basically, again, multiple queries from different transactions are running at the same time and we have to sort of load balance them. And if we're doing updates, then we already know how to do concurrent control if they're updating indexes, we already know how to protect, you know, make sure it's thread safe. So we already know how to handle this in both the cases when they're read-only or when they're updating the database. So there's not much more else we can say about this. The one thing I'll also say too, is to the best of my knowledge, no matter what processing model using, like whether it's materialization or the materialization iterator or the vectorized approach, or how you organize your workers, whether it's a single worker per process or multi-threaded, right? As far as I can tell, no one concurrent control scheme is actually easier or harder to implement based on these decisions. They're all, it's all sort of the same. The one that's probably more interesting to us though is intro query parallelism. Again, how do we take a single query and have multiple operator instances running in parallel? So we can do this either horizontally or vertically. So enter intro operator parallelism or enter operator parallelism. So within a single query, how can we execute multiple operators for that query in parallel? So I would say that again, what I'm describing for both horizontal and vertical, these aren't like either or. Like if you say I'm gonna do a horizontal parallelism, doesn't mean you can't do this one. You can actually can do both of them. There's no reason you can, right? And for every single operator you can have in a relational database, all of these can be implemented in either way. Some of them obviously have more benefits. That's sort of the pipeline breaker stuff. If my operator can't proceed until it gets the full output of its children, then the vertical parallelism actually doesn't help you that much. But everyone for the most part can execute this one. So first one, horizontal parallelism. The basic idea here is that, again, we're gonna break up a single operator in our query plan or sort of replicate it into multiple instances. And each of these instances are gonna operate on a different subset of our target data. The target data could be either from a table or it could be from the output of another operator, right? And another way to think about this is we talked about morsels last time. Like you can have each operator instance could be operating on a different morsel from a table. So the way we're gonna organize the system is through this new sort of synthetic or virtual operator in a query plan called the exchange operator. This comes from the volcano paper. This notion of an exchange operator for parallel execution is what is defined in the volcano approach and that's why they call it the volcano processing model. The basic idea is this is gonna act as a barrier that's gonna prevent other operators above us in the query plan from starting to execute until we get all the responses from our children. Again, this is another example of a pipeline breaker, right? It's just, again, it's just a way to have everything be, I know I need to get the output of everybody and have it be, and wait until that they're all finished. Now, it's not to say that they're all gonna be stored in the same location inside the exchange operator, it's just sort of bookkeeping to know that everyone has generated all the data they're gonna generate. And then once this thing gets everything it needs, then you sort of fire things up above you in the query plan. So let's see a really simple query, same query we had before, we're doing a join A and B or now we have a predicate on A and a predicate on B. So the first thing we're gonna do is for our query plan here, we're doing a scan on A. So let's say we break A up into three different morsels. So if we have three different threads, whether they're on the same sockets or different threads or different cores, it doesn't matter. These three threads are gonna operate on these three different morsels. And then now we actually see, we can actually incorporate some more parts of the query plan into our task because we wanna get better pipelining effect. So rather than having the scan on A and then feed that output into another task or another operator that does the actual filter, we'll just say that this thread within a single pipeline will grab a tuple from A and then actually run through the filter process. And actually we can actually also include the projection up here, because if A is a really wide table has a lot of tuples and we're not a column store, we don't wanna be passing up the entire tuple from one operator to the next. So maybe we'll push down the projection as well. All right, so now again, this now represents a single task in our pipeline. Scan on A filter and then do the projection just to have just the data we need to actually do the join. Then they're gonna build our hash table because we're doing a join. And again, there's no relational operator to build a hash table, it's just a join operator. But again, they're all gonna do this. But then now they're gonna feed this into the exchange operator. And again, think about this. We're going down the left side of the tree here. We can't actually start computing the join until we have our entire hash table because otherwise we would have false negatives because something that maybe should have matched does a look up in the hash table and doesn't see anything because we haven't finished building it here. So the exchange operator again is saying that I can't proceed anywhere as up in my query plan. I can't start producing the output of my join until all of these guys have produced their entire, process all their data that they need. So again, I'm not saying that these things are separate hash tables or that it's all right into a single hash table and they're passing hash table in the exchange operator. It's just a bookkeeping thing to know that I have to wait till these guys actually finish. And then we have now on the right side of the join, again, we break the up into three morsels that are gonna do the filter and then do the same projection. But then now they're actually gonna probe into the hash table. So these guys won't get fired off until I know I've completed my join here. So again, this is where the internal bookkeeping comes in, these guys are dependent on this one. So these guys can't execute until this guy, the exchange operator has told him that everything's done. But then now we can run the join and out parallel. So think of like a bunch of other threads now. Once these guys finish, actually no, in this case here, the reason why it's not an exchange operator, because this is actually a single pipeline here. So from one thread, it's gonna do the scan on B, the filter, the projection, then probe the hash table to see whether you have a join and then if there's an output, it produces that output in some output buffer here. But now I'm gonna have these three threads running parallel and these output buffers then need to be coalesced by this exchange operator up here. So I'm not showing this too well, like these threads basically stop here. Then new threads or the same threads can start running these tasks and they'll stop up here, the exchange operator here. And then once the exchange operator gets all the results from my different threads, then it can produce the output to the rest of the part of the query or actually to the client here. So this is clear. So this exchange operator is in different code bases, it won't be called exchange, right? But on the academic side, this basic idea of coalescing things and blocking other threads from proceeding in the query plan until you get all the results, it's called an exchange. All right, the other type of parallelism is interoperative parallelism. So this is where, again, we're going to have in a single query plan, we could still be executing different operator instances in parallel, but we're also now going to be executing different parts of the query plan in parallel on different threads, right? And the idea here is that you're gonna have one thread sort of spinning that for loop from the snippet I showed in the operator. It's just waiting for the output of a child. When that child emits it, it sort of hands it off to the queue from its parent and then it can go back and start maybe processing more data in its form of loop while the other thread can then process the thing it was given and continue with that, right? It'll make more sense when I show it at a high level. So I would say that this approach is not widely used as far as I know in traditional relational database systems. Where you see this is in the stream processing systems like Spark, Kafka, Flink and a bunch of other ones, right, because in these worlds, you have these like source and sync concepts. Where you have these like sync operators are generating data that can then be, sorry, the source operators generate data that then pass to syncs who then process them and pass them on to other operators. And each of those guys could be running on the same machine or different thread or to other machines that you sort of farm out and scale horizontally that way. But within the query plan, you're scaling vertically or at least your parallelism. Right, so same query plan we had before but now we say for this join operator here, right? Say we're doing a really simple nested loop join, right? For every single tuple on the outer side and every single tuple on the inner side, admit the output based on the join, right? So I could just be spinning this thing, every single time I get a tuple from the right side, every single time I get a tuple on the left side, then do my join and then it produces an output that then sends up to another thread that's just spinning, waiting for output to come out of this guy and then actually do the final projection and then produce more output, right? So again, like it's not gonna be spinning all the time, and this will be a blocking call but think of like I have one worker thread dedicated to running this snippet and another worker thread is dedicated to running this thing, right? And again, I can scale these out horizontally, have multiple instances of these operators running different threads, right? So that like I'm doing this join as many threads as possible, but the idea here is that I, every time I admit here, this is a non-blocking call, this guy goes back and goes get to the next tuple and cause some other workers gonna crunch on this, right? And again, it stalls, I guess it's a stalling call so you're not like burning the CP while you're doing this. All right, so some finish ups and last thing. So the coming up with the right number of workers we actually wanna have in a query plan is a hard problem. This is sort of the scheduling stuff that we talked about before, right? We had this notion of these different thread pools can steal data or tasks from other thread pools. How you organize that is highly dependent on the organization or implementation of the database system. And this is usually something that's like, you base it on the number of cores in the CPU that I have minus some number for back on task or maintenance tasks. This is something that you usually tune, the default usually is like, I have this many threads and I wanna use this many workers for it. How you can actually allocate the workers is sort of the scheduling stuff we talked about before. So easiest approach is that you have one worker dedicated to a single core and this is where you use task set or a set affinity to make sure that nobody else runs on your same core. So you always have, you never context switch at your core, you just sort of spinning through and doing as work as fast as possible. The alternative is that you have multiple workers per core and this looks into the pool stuff we talked about before and the idea here is that if you have one thread get blocked because it has to require some resource, whether it's disk or memory or CPU or sorry, or disk memory or network, then other threads can start processing work and still make full work of progress. So there's no one approach that's better than another. For OOTP, at least in HStor and BoltDB, this is what they used, this is what Hyper uses as well, but a bunch of other systems do this. I don't have a strong opinion that one is better than another. How we're actually gonna sign tasks for our workers is that, again, either push or pull and this is the stuff we read about last class for scheduling. Do I have my threads basically say go to a single queue and pull data out or pull task out and execute them or do I have a centralized coordinator pushing work to them? Again, no one approach is better than another. I don't have a strong opinion about this. I'm good to say do what Hyper does and they're doing a pull, but I don't know whether that's better. All right, so finish up. The main takeaway here and will be a reoccurring theme throughout the rest of the semester is that oftentimes the way we as humans organize our code and our system is actually gonna be the worst way to do this for performance. In order to overcome this, we have to be aware of what our CPU actually looks like. One of the pipelines, how much cash do we have, how the different sockets are set up and that's not always gonna be easy, but again they pay us a lot of money to build these database systems so that's fine. And then in my opinion, the one thing I do have a strong opinion about but it's not like a controversial opinion because the science clearly shows that this is the case. The vectorized bottom-up execution approach is the right way to do OLAB execution and for O2D queries is actually not a bad approach either because the query plans are usually quite short and they're not like super big. So this is what we did in Peloton and this is what we're doing now in our new system. Okay, so any questions about this? All right, so next class. Next class, the paper from Hemmier Reed is actually, I think it's actually one of the most important papers in database research community that came out of the West and one of the most important ones that came out in the last two years, right? It's, yeah, I don't want to spoil it for you but basically like it's actually now in production in SQL Server 2019. Like it was so good, like the turnaround time from Microsoft's like, oh shit, you guys wrote this paper to actually putting in the full system was like a year. It was that good. And nobody else does this and I'm sure there's patents all over it and no one else is gonna do it for a while. But the high level thing we're talking about next class is pushing more complex logic that the application wants to do to now run that inside the database system itself. So, user defined functions, stored procedures, are ways to do things that you can't really express in SQL but do it in a functional form and have the data system execute that efficiently. And what you're gonna see is that what the Freud paper does is like it says, you don't even need a procedural language for a large percent of the user defined functions that people write, you actually can convert into a relational algebra, inline it in the query plan, throw that to the optimizer and don't even bother with like function calls and shit like that. It's awesome. Yeah, so I highly recommend reading this thoroughly but we'll cover in details on Monday next week. All right, any questions about this? All right guys, have a good weekend.