 Today's lecture is going to be about query execution. This is the first of two parts, so we're going to cover the basics in today's class and then next class we'll look at parallel query execution. So before we get started, just a quick rundown of all the stuff you guys have pending. So again, project number two is going to be due on the Sunday following the midterm, so the midterm exam is going to be next week on Wednesday, October 13th, just in class during the regular class time. We went through all this stuff in the last class and it's also all covered in the exam guide. I think there was a question about whether or not there would be a practice exam. We will put out a practice exam from one of the previous years by tonight, so you'll have a chance to look at that in preparing for next week's midterm. So again, just kind of remember you have to budget your time between the midterm, studying for the midterm and the project which is going to be due shortly thereafter. So are there any questions about administrative stuff before we jump into the material? Okay, so just a quick recap, so we've talked about this in the past. A query plan is basically this collection of operators that we have in our DBMS and they're arranged in a tree. Technically, again, it's a directed acyclic graph, but for our purposes everything we're going to talk about is a tree structure. So you can translate one of these SQL queries in the way that the DBMS is going to execute is by translating a SQL query to some kind of logical plan. So the logical plan corresponding to the SQL query shown on the slide is right below it. And basically what's going to happen is the query is going to go through these different steps. So we're going to start by translating the SQL to the logical plan and then the logical plan is then going to be translated again to a physical plan. And the physical plan is where we take each of the operators and we decide what the physical implementation of that operator is going to be. So for example, in this join operator here that joins R and S together, we need to decide how we're going to do that. We're going to use a nested loop join or we're going to use a patch join, sort merge join, that kind of stuff. So we're starting out with this high-level logical plan and we need to translate that to the actual physical implementation of the operators that are going to execute the plan. So again, the way we want to think about this is that the data in our database is flowing from the leaves of the tree. So at the bottom we see a table scan on R and S. So the data is going to flow from the leaves of the tree all the way up towards the root. And then once you get to the root, the root is going to materialize the final query output that we're expecting. So in this case, we're going to get back the results from the select clause, the RID and the S.C. date that we requested in the query. So that's thinking about what we're doing at a high level. And we've kind of been talking about the different implementations of operators. We've talked about specifically how to implement a join, how to implement sorting, how to implement an aggregation, all that stuff. We've been speaking about those in isolation so far. And in this class, hopefully, you're going to see how and understand how everything kind of fits together to go from the individual operator implementations that we have all the way up to the final query result that we're going to get by executing this SQL query. So just a high level overview of what we're going to cover in today's class. We're going to start out by talking about the different processing models we have available to us. So how actually we're going to implement the passing of results between query operators as we stitch them together. We're going to revisit access methods. I know we talked a little bit about some of the different options, table scans, index scans, that kind of stuff. We talked a little bit about those in previous lectures, but today we're going to revisit those and expand on them a little bit. Modification queries, so typically we think about read queries, how we can retrieve data from the database. But it's equally as important to figure out how we're going to execute inserts, updates, and deletes to our database. So those are an important consideration and they differ from read queries a little bit. And then finally, we're going to talk about expression evaluations. So you can kind of have arbitrarily complex expressions or logic in your where clause, project clause, whatever. But the idea is that we need a way to evaluate these arbitrarily complex expressions. You could be adding things together, modifying values from a column, multiplications, all these different things. And then the comparisons that need to happen and we need to be able to support in arbitrary conditions in our evaluation logic. So let's start off with the processing model. So basically a DBMS's processing model tells us how the system is going to execute a particular query plan and maybe that sounds a little broad. But kind of the high level idea is that there are different ways that we can implement passing results between different query operators. We have each operator implemented in isolation and now we need to move intermediate results between the operators. So we're going to look at three different approaches here. They each have different trade-offs depending on primarily the workload, but for the overall system design. So we're going to talk about kind of the different cases where one approach might be preferable over another. So again, each has different advantages and disadvantages. We want to be mindful in designing the system which particular approach we're going to choose. So probably the most common and most famous approach is called the iterator model. Sometimes it's also called volcano style query processing after there's an influential kind of research or academic system that popularized this model. It existed prior to the volcano system but it kind of popularized the model and formalized the different aspects of the approach that became commonplace in system design. So you may see it referred to as volcano style processing. You may also see it referred to as pipeline query processing which we'll see why in a second. But basically the idea is that each query plan operator that we're going to implement, so whether it's a scan or a join or a sort or whatever, whatever the operator that we're implementing it is, needs to also implement this next function. So we need to create some function called next that the operator needs to implement and then on each invocation of the next function, the operator is going to return either a single tuple or if it's run out of tuples, there's no more tuples to process, let's say it's a table scan. So you start at the beginning, you're going along, emitting tuples and then you get to the end. You want to return some kind of null marker to let whoever is calling your iterator know that you're out of tuples. So basically this just executes in a loop that keeps calling next on the children in the query plan. So remember we have that tree, each node can have potentially multiple children. So you're going to keep calling next on the children in the query plan until they return this null operator signifying that they're done. And then you for your operator can return null to any parent nodes that are calling next to you. So we'll see a visual example of how this works. Again, we have the same query here and the visual query plan. So if you imagine what's going on inside each one of these operators, at a very high level, we have this loop that's essentially implementing the next function call for each operator. So if we look at the top, which is performing the projection, it's essentially saying for every tuple t produced by my child node iterator. So it's going to be calling whoever the child is. In this case, it's the join operation. But for every tuple t that's produced by the child iterator, we want to emit the projection function. In this case, the projection function is just returning the RID and the S value. So you can kind of see how these are all chained together. The projection calls the join. Now the join recall has these two parts. There's the build phase and the probe phase. So for the first piece, the first for loop in there is going to be calling the iterator of the child that's on the outer side of the join. So the one that we're building the hash table for. So for every tuple produced by the left side of the join, we're going to be building up this hash table. And then once that's done, we've materialized our full hash table. Then we start calling the iterator for the right side of the join. And that's going to start probing the hash table. And of course, if there's a probe match, if the probe succeeds, then we want to emit the joined tuple. Then as we continue further and further down, the selection operator on the right side there is just filtering out all of the S tuples that don't match the predicate S dot value is greater than 100. And then each of the base, the leaf nodes in the query plan are just basic table scans. So they're just saying for every single tuple in R, we want to emit that tuple. Or every single tuple in S, we want to emit that tuple. And then those kind of flow up through the tree to the parent operators. So again, each of these you can think of as an implementation of the next function call for that operator. So let's just go through, run through this really quickly to see how a single tuple flowing through the process would look. So we're going to start here at, again, the root node. So this is the projection. The projection is going to ask, OK, for every tuple T produced from my child node, we need to call next. So we're going to come down here, call next on the join operator. The join operators can say, OK, well, what I need to do is I need to call for every tuple in the left side. We're going to call that next function. And then as soon as we get one tuple, then the table scan of R is going to emit or produce a single tuple that's going to flow back up to the join operator. So this keeps going until, again, all of the tuples in R are produced. And then that first for loop, we pass back some kind of null marker to let that for loop node exit. So the first for loop exits, our hash table is built. And then we're going to do the same thing here on this right side. We're going to call the next function for the selection. And then the selection is going to call the next function for the base table scan. Again, we want to emit up a single tuple to each of these parent operators here. Now, of course, in the case of the selection scan here, we're only returning tuples that match. So for example, let's say S produces several tuples that don't satisfy the predicate. They're not greater than 100. Then that operator at step four isn't going to return anything. So it's only going to return. It's only going to emit a tuple when there's a match. And then for every single time, we probe the hash table and there's a match there. We're going to emit back a tuple back to the parent operator, the root node, and then the query plan is done. So are there any questions about how specifically this functions here? So just to summarize the iterator approach, this is used in pretty much every mainstream DBMS. It allows for, I mentioned, what's called tuple pipelining. So if you look at how these arrows look here on this right side, every single tuple we emit is going to be pushed or pulled up by the parent operator all the way through the query plan. So as long as it needs to end up the tuple T that we emit and step five needs to end up in our result set, it's going to be pulled through every next call all the way up through the entire query plan. And what that's going to allow us to do is maximize the locality of the data that we're working on. So remember, going and getting a page and then getting a tuple from the page, extracting it, decoding it, all that stuff is really expensive. So basically what we want to do is maximize the amount of work we're able to perform on the tuple once we go and get it from disk. So we're going to get the tuple, we're going to try and propagate it or pipeline it as much as possible through all of the operators that we have in our query plan. So that's why sometimes you'll see this iterator model referred to as a pipeline query execution model. So you also might have noticed that some queries or some operators need to block until their children emit all of their tuples. So in this particular example here, if I step back a few, we need to wait to execute the right side of the query, the probe phase of the join until this left side of the query, the build phase has emitted all of the tuples. So we need to go over all the tuples in R and build up the hash table. And then only then we can start executing the right side of the join, which is the probe phase. So there are certain operators for which we need to wait until the children have returned all of the tuples. And finally, the other thing that works really nicely in this approach is output control. So things like a limit clause, it's really easy to stick a limit clause in here. You can just essentially stick it on the top and say I only wanna call the next function of the root node in my query plan. I only wanna call that next function. Let's say I wanna limit 10. I only have to call it 10 times. And then by that point, we know, okay, we're done. We don't need to return any more tuples. So this will allow us to do some kind of early stopping where if we don't need more results, we can ignore completing the entire probe phase of the join. We can do the whole thing because we know that we only want 10 results from this query. So here are a few examples of the many, many systems that use this iterator-based approach. There are certainly many more. But again, as I said, it's a very common paradigm because it lets you kind of break down the execution of a query into these modular pieces where you only have to worry about implementing each operator individually and then you can kind of stitch them together arbitrarily. And for example, if you wanted to add a new query operator to your, the available operations in your system, let's say for example that you only had a sort merge join and you wanted to implement a hash join. All you need to do is implement the hash join operator that conforms to this next iterator function call API. And then once you have that, you can just plug it into your query plan or your query execution engine and it can now be used just the same as any other operator. So that's kind of the nice abstraction of this iterator-based approach is you can think about each operator individually in sort of these modular units and then stitch them all together to get an arbitrarily complex query. So the iterator version is, or the iterator approach the first processing model we talked about the second one is called the materialization model. So basically what the materialization model is going to do is that each operator is going to process all of its input all at once and then it's going to emit the entire output all at once. So this is, you may alternatively see refer to as the operator to time model for sort of obvious reasons but basically what it's doing is it does all of its processing and then it materializes just one single result. That's the entire result set that that operator is going to produce and then that operator is done and we never need to look at it again. So the kind of the tricky thing here is that there's no good way in this model to push down for example limit operation. So in the other version, we could just call next 10 times that gets us 10 tuples and then we're done. In this case, since each operator it needs to materialize its entire output. So the table scan would need to materialize its entire output all at once before moving on to the join part. There's no really easy way to push this down. So you need to kind of pass down if you wanna do some kind of early implement some kind of early stopping like a limit based approach, then you need to pass down hints through the operator tree or the query plan tree to make sure that the leaf nodes in the query plan know that they can stop early. And again, there's kind of this difference between NSM versus DSM storage. So you can either materialize like a full row or if for example your operator is working on an individual column, you can just materialize a single column. So we'll see some examples of that later. So just visually the way this would look, we still kind of have the same operator breakdown here but rather than having these next function calls that are going to return a single tuple of time, basically each operator is going to call, so the root node here is going to call its child, the join is going to say please give me all of your outputs. So this is child.output function and then you can iterate over all of the tuples in the materialized output. So the root node is going to call the join, the join is going to call its left child here. So basically what's happening in this table scan now instead of returning one tuple at a time and kind of pushing it. So it's pipelined all the way through all the operators in the query tree. It's going to put all of the tuples, it's going to write them out into this output buffer called out and then once it's done, we're going to return out up to the parent and then the same thing sort of the, you get all the tuples back all at once, build the hash table and then sort of the same thing is going to happen on the right side. So we have to go through here. This table scan is going to produce all the output, return it to the selection and then so forth all the way up the query tree. So this is kind of the extreme opposite of like tuple at a time processing is operator to time processing. So we do all of the entire operation on every single tuple for that operator and then we're done with it and then we move on to the next operator. So the question is when would you want to do this? I said kind of why the tuple at a time iterator based processing was good so the answer is that sometimes in OLTP workloads because we're only accessing a small number of tuples at a time it might be beneficial to do this kind of materialization approach. So if we know we're only going to get a small number of tuples, we want to get all the way down to the scan and we want to kind of fill up the entire, while we're accessing the table, we want to get all the tuples that were possibly in the need from the table and then be able to push them up to the parent operator. So you're kind of optimizing the accesses and the operations that you're performing on each tuple. It's much more streamlined than having to recursively sort of call all of these next functions and traverse those for every single tuple you want to execute. So the advantage there in something like a high performance system, VoltDB, which Andy actually worked on the predecessor, the academic predecessor, HStore when he was in graduate school and then they commercialized it as VoltDB but kind of this is the one way that you can strip out a lot of the overhead of having these next function calls. VoltDB is an in-memory system so the overhead of having all these function calls can kind of adds up. We've talked a little bit about kind of the differences between in-memory versus disk-based processing. For this class, we're focusing exclusively on disk-based processing. So these function call overheads don't really matter for us, they're masked by disk IO stuff but when you're in memory, those sorts of overheads start to add up. And the other side of things, it's less good for OLAP queries because they typically have really large intermediate results, you have to compute an entire join, materialize that, that can grow quite large. So there are some systems that do it, for example, Monadb, it has these vectorized operator at a time processing engine and basically it will do, so it's a column store, DSM store, and basically it will operate on an entire column, produce that output and then that column will be the input to another operator. Yes? So the question is, is the materialization model easier to work with if you want to aggressively pre-fetch pages? I think probably it makes pre-fetching easier if everything's sort of a sequential scan but I think the drawback is, it depends on how much room you have in your buffer pool in memory to handle stuff because if you imagine, if I have to materialize essentially the entire output of a table scan, if my table is bigger than memory, then I'm going to end up basically just rematerializing the output. So yeah, I mean you probably rewrite those away at the end when you get some kind of streaming access but I think that the imagine a join, right? If I have to materialize the entire join outputs, that means the whole hash table side plus the build side plus on the probe side, now for every match I have to emit those, that can get pretty large. So kind of you lose, you lose, I think whatever benefits you would get through pre-fetching by having to do now much more disk IO if you're larger than memory. If you're in memory, then sort of the question changes and that's why systems like loan ADB can have this operator-to-time processing model but if you're larger than memory, I can't think of a good way to do this that's not going to kind of sabotage any advantages you get in that regard. There any other questions about materialization? Okay, so the last one we're going to talk about is called the vectorization model and it's similar to the iterator model so each operator is going to implement a next function but instead of just returning a single tuple at a time for each call to the next function, the operators are going to emit a batch of tuples instead of just once. We're going to have multiple tuples, let's say it could be maybe we want to return a hundred tuples at a time, a thousand, whatever. We just want to return some number, the iterator, than one where we kind of batch them all together and return that to the parent that called our next function. So basically the internal loop for each operator that's doing this, that's implementing the next call is going to batch up all of the results, do whatever operation we need to do on it and then return it as the output of next and the size, the exact size is going to vary based on your hardware or your query properties or things like that. So again, there are different considerations if you're in memory, for example, you might want your batch sizes to align with cache lines or with memory pages or something like that but basically the idea is that we want to batch up several tuples together in a single batch rather than just one. Can anyone think of what some of the advantages of doing that might be? Yes, exactly. So the statement was that you avoid the overhead of function calls. You get to kind of amortize the cost of a next function call over several tuples rather than having to do it for every single one. You can kind of do it over several. Now again, if you're in a situation where you're extremely disk bound or IO bound then the function call overhead probably doesn't matter very much but if you're partially in memory or fully in memory then this kind of amortization of the next function call overhead over batches of tuples can be potentially big win. So this is another way to get around sort of that function call overhead that we saw with just the basic tuple at a time iterator approach. So this is going to look kind of like a hybrid between the iterator tuple at a time and the materialization approach. So again, we have these output buffers but rather than being the entire materialized output there are going to be some fixed size, say 100 or 1000 tuples at a time. And basically we're going to keep calling child.next the same way that we did for the iterator version but now basically each time we call next we're going to get back a batch of tuples rather than just a single tuple at a time. So basically we're filling up each of these output buffers out all the way down to the table scan. So what's happening here in step three is the table scan operator of R is filling up this output buffer until it gets full. If it's full then we want to emit the whole output buffer and we're going to turn that whole tuple batch back to the parent join operator here. So we're going to do it kind of in these batches and then the same thing on the right side. So once we're all done building up the hash table during the build phase of our hash join in that first for loop in operator two then we move on to the second for loop in there. And basically for each probe we're not again emitting the tuple right away as soon as we find a match we're sticking it in this output buffer till the full output buffer fills up and then we're going to return it. The same thing for the selection in step four we're not emitting the tuple right away if it passes the predicate we're putting it in the output buffer until it fills up. And there you may not be, whatever your batch size is might not be evenly divisible by the table size so you may have to if you get a partially full buffer at the end you're gonna have to flush that all the way through the pipeline, yes. Or they have to commit batches by hand but before it doesn't take everything they use so they toss the batch table by five and then toss the second batch and then it's like it puts it back but it's batched off and that's too big a sum but step four still has to hold the second part of the batch that we see in time and memory. So the question is can you run into problems where you have sort of these partially utilized batches sitting around in different operators? So the answer is no based on the way that these are implemented. So if you think about what's happening here each output buffer, let's just for simplicity say that the output buffer is the same size as the page that we're gonna have in our buffer pool. So we're gonna allocate one page for each of these operators. So for example, the selection operator in step four allocates one page and then the table scan operator in page five I guess allocates one page to fill up. So basically I guess it can fill it up by just reading in the page from disk. Now that's full. So that gets passed on to, so it's a single page it gets passed on to the parent operator which is the selection. So now let's say that the selection filters out 50%. So what you're going to do is you're gonna take the whatever 50% match and you're going to copy those to the buffer for the selection. So now the selection is going to have a 50% filled up buffer. So that's it's one page that it was allotted then you can throw away the, you don't need to keep around the ones that didn't match then you're gonna get a new page in from the table scan. So you get another 50% you're gonna copy those over to fill up your buffer. So then once your buffer is full, yeah. I'll tell you the problem, second page is messed up. So the question is, so let's say the first page you get 50% full and then the second page you get 100%, right? So let's, I mean kind of work through here what's gonna happen, you're going to copy the first 50% to your buffer, right? And now you're not going to do any more work. You're going to pass your buffer up to your parent. Your parent is gonna do its work all the way up to the root. It's gonna return the result and now you have your buffer page back, you just wipe it out and now you can continue copying over the second half of that page that was going to be 100%. So you never wind up with sort of these half full pages hanging around. You're always kind of packing this batch or vector as full as possible and then once you're full you pass it along. So you're not going to have like these kind of collections of half full vectors or output buffers sitting around. You only ever have one, each operator has one output buffer that it's going to fill up all the way and then if it's in, so in this example if the selection is halfway through finishing the vector supplied by the table scan, it's gonna stop as soon as it gets full. It's going to pass its output buffer to its parent and then once that kind of completes all the way up the tree we're gonna go back down to the selection and it will resume the remaining 50% of the buffer that it had input. Does that make sense? So the question is if you request a batch and you get a single tuple back will you just propagate that and not request multiple batches? So the way that it works is the next function call is blocking until the batch fills up. So for example if you think about what's happening from the table scan in the operator five it's producing a bunch of tuples. So now that like I said there's no selection there so it's gonna always produce a full batch that batch is going to be sent to its parent operator. So once you get to the selection the selection has some output buffer hanging around let's say it starts off empty. As you start filling it up you're going to keep calling next from the table scan operator until you get enough in your batch that it's full and then in your parent you're going to return next because the next call for each operator blocks until the child can return a full buffer or if you get the null thing that says you're done you're done, you give back a, in the very last one you give back partially full buffer but you're going to keep in the next function call you don't return multiple batches you only return one batch but it has to be filled up by the child before it can be returned to the parent. So you always get back a full batch which is why you don't end up with these multiple pages hanging around. Does that make sense? Yes. Great. Yes. So the question is do you flush out the partially full buffer at the end? Yes, so when you get to the end of the, after the for loop it's not shown here for simplicity but if you aren't able to fill up the buffer all the way then you need to flush it out at the end to get whatever last remaining tuples are up to the parent. Yes. Are there any other questions? Okay, so when is the vectorization model good? We talked about kind of the other two. The vectorization model is ideal for OLAP queries in particular so these are those analytics queries that are doing a lot of ad hoc joins or aggregations that kind of stuff they're looking at a lot of data because it reduces the number of invocations that you have per operator. So you don't have to have all of these next function calls. We can amortize the cost of the next function calls over several operators. Now if you're doing transactional workloads as I said you might, you wanna start working on tuples as soon as possible so kind of waiting around to fill up these buffers gives you some additional overhead that might not be ideal for those cases. So the other nice thing about the vectorization approach is that it allows for operators to more easily use vectorized or SIMD instructions. So SIMD stands for Single Instruction, Multiple Data. They're basically CPU instructions that let you operate on several data items at once. So imagine you have a bunch of 64-bit integers. You can pack multiple of those into a SIMD register. If you have a 128-bit or 256-bit SIMD register you can fit multiple 64-bit integers into the register. Let's say you wanna add one, you just issue an instruction and it adds one on all of those values at the same time. So you can do, depending on the SIMD register width and the size of the data items you're working on, you can do two, four, eight, however many of these instructions or operations on data items in parallel. So particularly in cases where you have a column store for example a DSM storage model and you have let's say all of the integers stored together in a single column, you can kind of operate on those much more effectively using things like SIMD. So these are a bunch of systems that use the vectorization model in varying ways primarily in their OLAP or analytics if they're multi-purpose then this is primarily used in the OLAP side of the engine rather than the transactional side of the engine. So are there any questions about either the vectorization model or either the other two iterator or materialization before we move on to the next piece? So the question is like so right here will the child block until the parent calls the next operator again? Yes, so for example in the let's say the join operator, right? So the join operator is gonna call next the child, the selection is going to fill up a full buffer and then once the buffer is full it's gonna return that and now the parent is going to get back a full buffer to work on and now the child is not going to fill up start filling up another buffer until the parent, the two operator there calls next again. Are there any other questions? Okay, so just this is a really quick point I wanna make. So all of the examples that I showed the three different query processing models all had this kind of top down query processing paradigm. So you started at the top of the query plan you started with the root node and that operator started calling the next function, the iterator function for each of its children to produce results. So you start at the top and you kind of recursively call next all the way down the tree until you get to the leaf and then the data flows up back to the root. So this is usually referred to as a pull-based model because each parent pulls the data up from its children. So they're always being pulled up with the function calls. The alternative approach and this is a lot less common because it's a little bit more difficult to reason about but there are some advantages and I'll mention if you hear but the alternative approach is kind of this bottom to top approach. So you started at the bottom nodes and rather than the parents pulling either single tuples or vectors of tuples or whatever from the children, the children are pushing their results that they compute to the parents. So this is again a little bit counterintuitive to think about but there are some advantages. So for example, you get this kind of tighter control of caching and registers during your query processing, you have much better pipelining of the data. So you can kind of keep the data rather than having to have these interactions with the function calls and kind of just push the data up through the query plan through each of the operators. So I just wanted to point this out in case you ever see it anywhere but for the most part, pretty much everyone implements this top to bottom iterator based approach. So the next thing we need to talk about is access methods. So we've seen access methods a little bit before, particularly in talking about the differences between table scans versus index scans, that kind of stuff. We've talked about some of the different trade-offs there but this is gonna be really important for the overall query execution. So an access method is basically the way that the DBMS is going to access the data that's stored in a table. So this isn't defined in the logical plan or the relational algebra kind of stuff but it's something that we need to consider when we're implementing the physical operators that are going to be actually responsible for executing the different pieces of the query. So there are gonna be three basic approaches to this that we're gonna talk about. The first is just the sequential scan of the table is pretty straightforward. The second is kind of revisiting, we talked a little bit before about how to use indexes or how to leverage indexes for pre-processing so we're gonna look at that again. And then the final piece is if you have multiple indexes hanging around, how can you kind of leverage more than one index to answer a query? So again, we're gonna start kind of with talking about the sequential scan for these two base tables here in our tables. This is how we're gonna access the data that's stored in these tables. So the sequential scan is pretty straightforward basically for each page in the table we're going to go through the buffer pool to get it, get it back and then we're going to iterate over each tuple and check whether we should include it. So you can kind of push down the selection piece into this table scan so that we don't have to, in the previous examples, every time we were calling, every time we wanted to call, for example, the selection in that query, we then had to go and call the next function of the table scan, which is always just going to return a tuple. And in the vectorized or the materialized version, it doesn't make sense to kind of repeat the work of fetching the tuple so we might as well push down this selection predicate into the table scan. So again here, for each page in the table, for each tuple in each page, we're evaluating the predicate and if it matches, then we're going to do something either, we're going to, if it's a tuple at a time processing, we're going to admit it right away or if it's a vectorized processing, we're going to fill up some kind of output buffer and then return that. So the DBMS is basically going to implement this as an internal cursor that's going to check or track the last page and slot that it examined. So it's going to keep track, it basically it's going to maintain its own internal iterator to keep track of this. So there are all sorts of optimizations that we can perform to kind of speed up sequential scans of tables. It's usually, depending on what the query is, it's usually not a very efficient way to execute it. So for example, if I want just a single tuple somewhere and I have an index, usually it's much faster for me to go look it up in the index and get it back that way versus if I have to do a sequential scan to go over the whole table, find just the one tuple I'm looking for and return it, it's a lot of wasted work. So maybe in things like where there's no selection condition or where you need to kind of aggregate over many things, you need to look at the whole table or if you have an index and you have to kind of look at everything, you have no choice but to do this, but there are still a bunch of ways that we can speed up sequential scans over a table. So a few of these we've talked about already, for example, pre-fetching, you can pre-fetch or pre-request, get pages from disks that you know you're going to need in the future, if on page I right now I can get page I plus one ready in the buffer pool waiting for me, so I don't have to block on IO for that. The buffer pool bypass thing we talked a little bit about where you have kind of rather than going through the usual buffer pool mechanisms, you have sort of a separate just buffer area that you can fill up with intermediate results for your query. We also talked a little bit about parallelization, but we'll cover that more in next class when I talk about how to kind of parallelize overall query, parallelize all aspects of the query. So the three things I want to focus on here are these three heap clustering, zone maps and late materialization. So let's start with zone maps. A zone map, sometimes you might see it called the small materialized aggregate SMA, but as far as I know everyone colloquially calls it zone maps. I think it's, I don't know if it's like trademarked by Oracle or something, but that might be why it doesn't show up in formal documentation, but everyone calls it zone maps. Basically it's pre-computed aggregates for the attribute values in a page. So you're looking at all the attribute values in a page and you're pre-computing some kind of aggregates about it so you don't have to necessarily access the page. You can check and decide whether or not you wanna access the page before you have to go and get it. So basically we can store this zone map for this original data stored in the page and the zone map's gonna keep around statistics like the min value, the max value, average sum, count, that kind of stuff. And then if we wanna execute some query like this for example, so we're filtering out from the table scan all of the tuples where value is greater than or value is greater than 600. In this case we can just look at the zone map, don't even have to access the data in the page, but because we can decide, see okay, right away there are no values in this page that are greater than 600, the max value that we have in this page is 400. So we don't even have to look at the page. Typically these are stored in some kind of separate zone maps page, otherwise you're kind of already paying the disk IO panel. If it's stored, for example in the header part, we talked about how the disk pages can have a header at the front that stores some kind of metadata information about the page. If you're storing the zone map in the header, you're kind of already paying the price of having to go and do the disk IO to get the page to look at the zone map. So it probably saves you less there, it may still save you from having to iterate over the tuples, decode the, if you have variable length tuples, look in the slot array and kind of do all that stuff. But if you store zone maps instead in some kind of zone map page, then you can, or zone map table, some separate sort of buffer that keeps all of these zone maps together, then you can exclude these IO fetches, four pages in the table scan that you know you don't even need to look at. So as I said, there are several systems that use this. I think only Oracle officially calls them zone maps, but basically all these systems are doing the same thing. They're storing these materialized aggregates somewhere so that they can prune out or exclude pages that they don't even need to read in to look at. So are there any questions about how zone maps function? Can anyone tell me one big problem with zone maps? What's an issue here if we have this? Yes, correct. So the statement is every time you update the page, either you update the value in tuple or you add a new tuple to the page or you delete a tuple, you have to go and update the zone map. And again, that's easier if it's stored in the page because you already have the page there to work on it, but otherwise if it's stored in this separate kind of zone map page, then you have to go and update it somewhere else. So it adds, it lets you prune out pages early that you don't need to look at. So that's a big win there, but it also adds the complexity during writes that you have to go and keep the zone map in sync with the original data. So kind of there's this trade off again between fast writes, fast inserts, updates, deletes versus fast reading. So another optimization we can apply relates to late materialization. So DSM DBMSs can delay stitching together tuples until the upper parts of the query plan. So remember DSM is the column store model. So all of the values in an individual column are stored together. So when we're accessing a page, we're actually getting all of the values for a particular column from a whole bunch of different tuples. So we get all those values and let's say we're just doing a selection. We can select out the parts that we don't want. So for example here in this query plan, all we need for the output is C. So if we're in an NRE or row based storage model, then we get all the A, B, and C values all together. If we're in this DSM storage model, then in our query plan here, what we can do is just for the selection here, we're going to grab just the column that I need. So it's gonna be just A, it's going to get pushed up to my selection operator there. And then the offsets of the rows that match. So for example, let's say rows or tuples zero and two match, we're going to pass those offsets up to the parent rather than the actual values themselves. So then again with the join, we're gonna perform the join on this B column. We know we just need to look at offsets zero and two. We're gonna pass those offsets up to the average based on whatever joined. And then we're just going to grab the C column here in order to produce the final result with whatever offsets match based on our earlier operators. So does this make sense? So any questions about this? Okay, in index scan, we've talked a little bit about this before so I don't wanna spend too much time on it here but basically the idea is that the DBMS needs to or can pick an index to find tuples for the query. Rather than having to scan the whole table, if you have an index sitting around, you can leverage that to get just to the tuples that you want based on the selection conditions specified by the query. So which index to use is a really hard problem. It depends on all sorts of things. What attributes the index contains, what attributes the query references. Those are sort of obvious. If you don't have an index on a attribute that's needed in the query, then it doesn't help you. The attribute values, how the predicate is composed and whether or not the index is a unique or a non-unique index. So again, this is a really complex sort of decision. We're going to talk about it, how actually to decide between all these different options you have, potential options you have in lecture 13 when we talk about kind of query optimization. So here we're just gonna focus on actually how you implement or how you would implement an index scan. In particular, let's say we have this example query here. So imagine there's a single table with 102 poles and we have two indexes. Index number one is gonna be on age and index number two is gonna be on department. So we're gonna execute this query. We can either, we have these two indexes to choose from. Which one do we wanna use? So we have this table here. Those are the two indexes and this query. We wanna select star from students where all those predicates. Which index should we use? Yes. So the answer is whichever one is most selective. So that is correct. So basically what that means is it's going to depend on the exact data that we have. If for example, age is less than 30 is more selective then we will make one decision if department equals CS is more selective we wanna make a different decision. So for example in scenario number one, if there are 99 people under the age of 30 but only two of them in the CS department we would want to use the CS index. Yes, okay. So we use index number two on department to get back only those two people in CS and then filter out the people that are over the age of 30. Now suppose we have this other scenario where there are 99 people in the CS department but only two people under the age of 30. Obviously we wanna instead use index number one on age to filter out everyone who's above 30 and then we can perform the last checks for the 99 people in the CS or to filter out the remaining people in the CS department. So this is kind of having forcing us to choose between one of the two indexes that we have. Another option that we might have is what's called a multi-index scan. So you may see this referred to as a bitmap scan I think that's what's called in Postgres but basically the idea is that if there are multiple indexes that the DBMS can use for a query then we can use all of them. The way that we're going to do that is we're going to compute the sets of record IDs using each matching index and then we're going to combine the sets based on the predicates. So if it's an and conjunction then we're going to use an intersection of the record IDs. If it's an or, a disjunction then we're going to use a union of the record IDs. And then we're able to at the end retrieve all the records that apply any remaining predicates or operations that we have in the query plan. So kind of when you use multiple indexes get the values that we need from each of them independently and then combine the resulting record IDs together. So just as an example using this previous example query if we have an index on age and an index on department then we're going to retrieve all and we want to use both indexes we can retrieve the record IDs from the age index satisfying ages less than 30 so we get all those record IDs then we independently retrieve all the record IDs for department equals CS using the department index and then we take the intersection of the two because here we want the and. Then the last step is to we don't have an index on countries so we want to retrieve the records and check the final country equals US. So there are many different ways to implement this you can do it using bitmaps where the record IDs each bit is set if the record needs to go through the result you can do it using hash tables you can do it using bloom filters and then since bloom filters are probabilistic have some final check that double checks to make sure that it should actually be included so there are different ways to implement this but the basic idea is that you're producing the record IDs from each index independently and then what we want is the intersection of the two and then we're going to fetch those records that are at the intersection and perform the final selection step on country equals US. So are there any questions about this? Yes. So the question is how do you know which index is more selective without querying it? So the way that the DBMS is going to know about this is by maintaining statistics about the different tables and attributes that are stored in the database. So the DBMS is going to have essentially in the catalog that contains all the information about the tables and the columns that you have is going to maintain also statistics about the distribution so for example age, you might build a histogram over the ages that you store in your catalog for departments, you might store the distinct departments so CS, biology, English, whatever and then account associated with them. So you can kind of build these histograms that give you statistics and these are going to be really important during query optimization. So when you're deciding should I do plan A or plan B they're both equivalent, they're going to give me the same result but if you know something about the selectivity of different statements for example you may decide to execute one over the other. So I kind of in the high level introduction here I said that there are all these different things that are going to factor into the decision and we'll talk more about them in the query optimization lecture but the high level idea is that the DBMS maintains statistics about all these things and they may not be exactly accurate they could be outdated so they're usually these commands, you can issue analyze commands on the command line and you can tell the DBMS to go like Postgres go update your statistics that are stored in your catalog. So it kind of keeps these statistics as rough counts or histograms in order to figure out how to do join ordering, how to do index selection, all those different problems, how to decide which order to evaluate predicates in all those sorts of different problems that will come up in query optimization. Does that answer your question? Great. Okay so again I said that kind of we've spoken mostly about read queries so equally important are modification queries. Basically there are the operators that implement the insert, update and delete functions of SQL. So kind of the important thing to remember here is that each of these implementations when we implement the insert or update or delete operator we're responsible in that operator for checking any constraints that we have in the query so for example if we have a primary key constraint all of those values need to be unique so we have to check to make sure that that's true we can't let non-unique values get inserted in that case so the insert operator needs to implement some kind of check for that case. Similarly if you have other constraints in your database like everyone's salary needs to be greater than zero or something we can have negative salaries then whoever is performing a right operation needs to make sure that that constraint is gonna hold true. The other thing that these operators need to do is update indexes. So we talked about just a few slides ago how it's important to maintain the zone map values the aggregate values that are stored in zone maps and make sure that those are synchronized with the base data stored in the tables and it's the same for indexes. The indexes need to match what we have stored in the tables so we can't have something in the table that's not in the index and likewise we can have something in the index that's not yet stored in the table. So these operators need to be responsible for kind of maintaining that synchronization. So the two sort of operators that fit together update and delete basically you can think about it as the child operators pass the record IDs for whatever the target tuples are so these can fit in the query plan just like other operators like a projection for example they can fit in the query plan they're going to have this stream of tuples passed to them however they have for example either single tuple at a time, vector at a time, whatever but those are going to be kind of streamed to the update and delete operations and it's going to be important for them to keep track of previously seen tuples so we'll see an example of that in a minute but that's how update and delete work for insert you sort of have two choices for how you want to implement it so choice number one is that you can either materialize the tuples inside the operator which means basically you know say I want to insert these tuples those values get or sorry I want to insert tuples with these values they constructed put in the operator and then pushed up to whatever you know index updates or constraint checking needs to happen the other option is that the operator could insert any tuple that's passed in from child operators so you kind of can decouple it in this way which will allow you to implement something for example like a select into so for example you want to select all the values in a particular table that match some predicate and insert them into another table choice number two will allow you to do that so it's a little bit more flexible than the first option so I mentioned the update query problems so imagine that we have this situation here we have an index by salary on the people in our people table and we want to update people and say salary we want to give them all a hundred dollar raise in the table if they make less than 1100 currently so basically our scan on the bottom is for every tuple T in people we're going to emit T up to the parent there and then we're going to do the processing so for every T that we get back we're going to remove the tuple from the index update the tuple and then put it back in the index so parent again is going to call the child operator here and we have this index on people.salary we're going to start scanning from the beginning so let's say this is our range here we want everyone is less than 1100 we're going to start at the lowest salary and then work up until we get to 1100 so let's say our first value here is Andy and he makes a salary of 999 so we're going to return that from our scan up to the parent here the update operator the update operator is going to perform the operation that we want here so in this case it's going to give him a raise of 1100 so sorry he's going to give him a raise of 100 so now he's going to be making 1099 and he's going to get inserted in this final step back into the index so he's now back in the index so as we proceed through the index what we might find is we get to this tuple Andy who is still less than 1100 here and now he's going to get read in again returned by our index scan and now we're going to give him another raise which frankly I don't think he deserves so now he's getting $1,199 when he already got his $100 raise so this is a problem in update queries it's called the Halloween problem if you ask why it's called the Halloween problem it has nothing to do with the specifics of the problem it's because the researchers who were who noticed this from IBM they were working on this I think they discovered it on Halloween day in 1976 it was like a Friday and they said wow that's a really interesting problem and then they said okay it's Friday let's solve it next week so that's why it's called the Halloween problem nothing exciting about the specifics of the problem just the day on which it was discovered so basically it's this anomaly to summarize this anomaly where an update operation can change like the physical location of a tuple so in that example you're updating the salary so it gets repositioned in the index which causes the scan operator to visit that tuple multiple times and then you end up with kind of these anomalies where you can end up updating the same operator multiple times I think in the original Halloween formulation of the problem they found that they wanted to give everyone a raise who was less than $25,000 or something and then once you got to $25,000 they didn't want to give you a raise anymore so they ran the query and what they found out was that the end result was that everyone in the table was making $25,000 at the end because what kept happening was people kept getting returned to the update operator while they were less than $25,000 so they kept getting raises until they all capped out at whatever the max salary was so there any questions about the Halloween problem or updates, rights in general okay we have one more thing to cover so expression evaluation is basically how the DBMS is going to represent the where clause in the expression tree so we have this expression tree here or sorry we have this query here we're referring specifically to this piece so let's say this join RID equals SID is in the where clause and then the S value is greater than 100 so as I mentioned we can represent arbitrarily complex expressions so we can have comparisons we can have conjunctions, disjunctions any kinds of these operators plus minus times all that kind of stuff we can have constant values so here we have the constant value 100 and we can have tuple attribute references so S dot value so we can have arbitrarily complex expressions in the way we're going to represent this is as what's called an expression tree so we're going to take this clause here and we're going to turn it into an and so we want both of these conditions to be true we want both RID to equal SID and we also want S value to be greater than 100 so in this case the way that we can formulate this all the way down to the leaf nodes is as a tree so in the left side of the tree is the equals which does the RID equals SID and on the right side is the greater than and then of course it'll be combined in the and operator at the root node which is going to be the intersection of the two so kind of the way to think about this is that we need to have this execution context so basically what we can do here is parameterize the values of the query in this way so let's say we have some kind of prepared statement here which allows us to pass in arbitrary values we need to have kind of this context that tells us how to evaluate the different operators at the different layers of the tree in the different nodes so we're going to start out at the equals we go down the left side here we need to get the value attribute S.value we need to retrieve that it's going to tell us the table schema here here's what an S is we have to decode all that stuff fetch the current values for S that we're working on so we have that tuple stored kind of in our context there this case it's a thousand let's say and then we're going to traverse down this side of the tree we have to get parameter zero let's say is 999 and then some constant value is one so kind of this is how we need to kind of proceed across the whole tree in order to then we start working our way back up can evaluate parameter zero is 999 plus constant one gives us a thousand and now we get back to the top and we say okay this is true a thousand equals a thousand so kind of this is really flexible because it's going to let us evaluate an arbitrary expression but it's really slow so the DBMS is going to have to traverse this tree for every single tuple and then for every single node in this example every single tuple we're going to have to do this whole tree traversal to evaluate the predicate so for every single node we have to figure out what the operator needs to do with that node so this is just a really simple example where one equals one that's obviously true so there are different ways around how we can avoid the overhead of having to evaluate these expression trees every time many times you can simplify expression trees so for example in this case at the high level query optimization rewriting phase we can say okay I know that one equals one so let's just replace that with true sort of things like that a compiler might do when it's compiling your code or for example if you're saying okay I want to do one plus one we know that's going to evaluate to two we can replace that with just the constant two rather than this plus expression so an alternative way that some higher performance systems use is to evaluate expressions directly so they may use some kind of jit compilation or something where they can generate explicitly the code that matches whatever this expression tree is so they don't have to traverse the tree and evaluate each node they can just kind of emit exactly the code that they want to execute for the equivalent expression tree so as I said this is one equals one we know that that's true so we can just kind of eliminate that from our query plan so are there any questions about expression evaluation and then we'll just wrap up really quickly okay so to conclude the same query plan as we've seen can be executed in multiple different ways and this is going to be a big deal when we talk about query optimization deciding if we have multiple equivalent query plans that are going to give us the same answer which one is more efficient to execute we've already seen that a little bit in deciding the join order whether we should have one table will be the outer table versus another but it's going to become even more complex when we're looking at all of the operators in a query plan most dbms as I said will want to use some type of index scan as much as possible rather than a full table scan lets you get down to for a large class of queries lets you get down to just the tuples that you want rather than having to scan through a full table and throw away most of the tuples that you don't want and finally expression trees which we just talked about here at the end are very flexible, you can get an arbitrary predicate or expression that you want to evaluate but they're very slow to have to evaluate so next class we're going to continue talking about query execution specifically about how to parallelize the different operators that we've seen here so I will see you next class yeah C-O-I-C-K talkin' about the Say-Nots rule one through a can of two share what my crew is magnificent bust is mellow but for the rest of the commercial I passed the microphone to my fellow for a mic check, bust it the bees all set to grab a 40 to put them New Yorkers snappin' snacks in a line take a sip then wipe your lips to my 40s gettin' warm I'm out, he got spit-dip drink it, drink it, drink it then I burp after I slurp, ice-crew I put in much work with the BMT and the e-drub get us a Say-Nots brew on the drum