 Okay, let's get started Again give it up for dj drop tables. Thank you today's we awesome, you know why we're talking about databases. Oh, okay All right, so quickly before we get started on today's material again Just a reminder for what's do or upcoming for you guys in the course homework three is do two days from now on Wednesday at midnight We'll have the final exam or sorry midterm exam In class next week on the 16th And that'll just begin during a regular time and I'll do a quick midterm review on the end of next lecture on on Wednesday, and it'll post like a Study guide on the website with all the information you need to know about preparing for the midterm Later in the week and then project to it will be due after the midterm on Sunday, October 20th And then immediately after Project three will go out. Any questions about any of these expectations for you as a student? Yes His question is will we provide a previous exam? Yes, we will provide a With solutions as well. Yes when probably by this weekend. Yes So I'll cover this on Wednesday. Yeah, you'll have one handwritten sheet of notes No copies of the slides. No, you know shrunk down version of the textbook has every has behind written Okay Again, so we'll cover this more detail. It'll be up into up and including Wednesday's class So everything on from this class and Wednesday's class will be covered in the midterm next week Just not next Monday. Yes. Okay So we've been talking about query execution But we've been mostly focusing on So far how do we actually implement the operators in our query plan? How do we do sorting? How do you hash joins and so forth? So now today we're going to start putting this all together And how do we execute an end-to-end query and produce a final result to to to the application? So again, these are all the operators in our query plan And we know how for a join the sort merge hash join nested loop join if we do a sorting We know the different ways to do sorts and aggregations. So now we're trying to put this all together and say, alright, how do we? If we're given a query plan, how do we actually go ahead and execute it? So there's three things we're going to discuss today So the first is that we're going to talk about how do we actually process the query plan? Right and how do we organize the execution flow or the flow of data between these operators in such a way that you could produce Produce the correct result and we'll just see how to do it in different ways for different environments or different storage models for the things below us Then we'll talk about the access methods and we've already covered this In bits and pieces throughout the entire semester how to do an index scan how to do a sequential scan So now to try to understand this in a bit more detail and then we'll finish off talking about how to do How to evaluate the predicates and expressions in our query plan if we have a where clause? How do we how do we? Apply that where clause to a tuple whether we're doing a sequential scan or an index scan and again, you see the the high-level Idea of what we've done so far is that we have a bunch of bits and pieces now of how to do sorting how to do joins How to do index probes and now we're starting to put it all together to add a Actually be able to execute a query. Okay All right, so the first thing we got to discuss is the processing model So a database systems processing model specifies How it's going to execute a query plan Right, so in essence it's specifying whether You're going from top to the bottom bottom of the top and then between each operator What are you actually passing from from one to the next and so there'll be three different main up three main approaches We can do and they're all gonna have different trade-offs and have different performance implications for different workloads and different operating environments So we'll go through each of these one-by-one and show examples So the most common one is gonna be the iterator model. This is what pretty much every single database system You know about this is this is for the most part. This is how they execute queries Materialization model is a specialized version of this That is primarily only used for in-memory systems and then the vectorized model is and it is again It's it's based on the iterator model, but you're sending batches of things or vector vectors of things and this will be more useful in Analytical workloads, okay All right, so as I already said the iterator model is the most common one I think the textbook refers to this as the iterator model There's sometimes I see this referred to as the volcano model or the pipeline model So the volcano the reason why it's called the volcano model is because there was an influential system in the late 1980s early 1990s in academia called volcano that Describe that at a high level exactly this this approach and people were doing this before But this guy sort of laid out the the you know the exact way to do this and in parallel Which is what we'll talk about on Wednesday And the volcano system was invented by the same guy Gerd's graphy who wrote that B plus tree book that we talked about before Uh, it's also the same guy to implement the volcano query optimization model, which we'll talk about next week So this that dude is pretty prolific So the way this basically works is that for every single operator we have in our database system So if we want to support joins we have a join operator, you know for a sorting We have a sorting operator So for all of these they're connected. They're going to implement a next function And what happens is that the apparent node will call next on their child node And that child node will then produce as the return result for that next function the next tuple that the parent needs to process You start to see how this is going to cascade down So I'll call next on the root it calls next on its child and it calls it's next on this child And we'll keep going down until we hit the leaf nodes in our query plan And then we start emitting tuples up the query plan start processing them one by one So the reason why that they're going to call the pipe also this is called a pipeline model is that This is going to allow us to for a single tuple to try to Ride it up as far as possible up in the query plan and keep processing it in one operator after the other Before we go back and get the next one And this is important in a disk base system because If it's every single we can only have one page in memory, for example Then every single page every single tuple we go fetch that page We want to do as much work as we can with that tuple while it's in memory Before we go back and get the next tuple or get the next page And so that that that series of work or tasks you can do in the query plan For a given one given tuple is going to be called a pipeline So let's look at an overview of how this works So again, this is that same join we've been looking at before on rns on rid equals sib And then we just have a simple predicate where s value is greater than 100 So Normally, I don't like to show code in in lectures SQL doesn't count because it's beautiful, but like But for this we have no choice, right? So this is some pseudocode to showing you the different next functions for these operators Right and essentially there's just for loops that are iterating over The output of their child operator Right, so if you start at the very beginning so say we start at the very root We call next on the root node Right, this is this is the uh, this is just a projection And so it has a for loop what's going to say for every single tuple in my child that I get back from next Do the projection So at the very beginning when we call next we enter this for loop And think of it like it's like an iterator where I can keep calling next and if I go down and traverse And produce an output if I'm called next again in my operator I know how to pick up where I left off before So at the very beginning we call next on the root it has no tuples So the very first thing it has to do is now called next on its its child Which is the join operator And the join operator is composed of two parts or two phases as we talked about before to say we're doing a hash join We have the build phase. We're going to build the hash table and then we have the probe phase We're going to probe the hash table So again the very beginning i'm calling next in my my hash join operator I'm calling next on the left child because that's what I want to populate the The the the hash table. So then this now invokes the function next on its child operator So now you get into leaf nodes now. This is again. This is our access methods This is how we're accessing retrieving tuples from either the index or the table So this just has its own little for loop that's going to iterate over The relation r and then it's going to admit a tuple up So for every admit function we're passing up a single tuple as a return result for the invocation of next And so we'll keep doing this The the the parent operator the join operator will keep calling next on its child over and over again This thing keeps emitting tuples up inside the for loop. We're building our hash table Then at some point we reach the end we process all the tuples And then we just pass up a null pointer And then now the parent knows that i've gotten everything i'm going to get out of my child I never need to go back to it again All right, so again, it's called the iterator model because these are just iterators Right or cursors looping through and getting all the tuples one by one So after we finish on the left side now we fall down into the operator on the right side Same thing we call it next on its child that goes down here And then we just admit tuples up one by one and then now we do the probe And then for any tuple that matches in our in our hash table when we're doing the join We admit that up to the parent So is this clear So the reason why again why it's called a pipeline model sometimes is that Say on the right side of the tree we're back here. We call next on this guy. He calls next on this guy Now we're going to admit a tuple up to here So we want to have we brought this tuple in memory We want to do as much work as we can to process the query while that tuples in memory So rather than calling next and getting the the next one we then pass it up here And let us do do the the join after we do the filter And then if it matches we can then pass it up even farther up the pipeline And send it up to our our parent Which then can then produce it as as as an output tuple And then only and then we call next all the way back down and we just do this all the way over again So this sort of series of of operators we can operate we can process for a single tuple is called a pipeline and it's just like we talked about before with When doing joins when we bring something in memory We want to do as much work as possible while it's in memory before we go off to the next thing Because the disk ios are so expensive for us So again the iterator model is the most common one It's used in pretty much every single database system you can think of they're using the the the iterator model And the reason why you do this is because it's sort of a from a human standpoint. It's it's It's easy to reason about the the the program flow It's easy to reason about what's going to be in memory versus not in memory All right, so again, this is what pretty much every every loan implements so now Some to some operators cannot or are not going to be able to allow us to do pipelining all the way up So these would be called pipeline breakers So these are the operators that requires to get more data from our children before we can go on to do the next thing So my joint example here was a perfect choice, right? So if i'm on the If i'm on the left side, right? I go get a tuple from this guy. I have to build a hashable I can't process it up further in the tree because I don't know whether it's going to match anything on the left side So this this would be called a pipeline breaker within this pipeline here So I I do the the scan I build my hashable and now I got to go back and get the next thing Because I can't continue up the tree with it So you see the same thing for order bytes or sub queries anything required to have more more tuples before moving up is a pipeline breaker And they're unavoidable, right? You can't not there's no way to get around that The another other nice thing about the iterator model is that output control like limit clauses are super easy to do Because now you can then say in the parent. Well, I only want 10 tuples So if I call next 10 times and I get back 10 tuples, I'm done Right. I don't need to go keep calling it again So output control is works great with this as well Parallel parallel queries also works really well And we'll cover that next class of how to run, you know, these operators on on different threads at the same time Or different machines at the same time okay the Next approach is a materialization model And the idea here is that instead of having a next function that spits out a single tuple Each operator dumps out all the tuples Anytime it's invoked like all the tuples. It's ever going to actually To admit it comes out all at once But you don't keep going back and calling next next next next right so of course now the issue is going to be that If we only maybe want 10 tuples up at root because they have a limit clause Unless we propagate down into the query plan information about the number of tuples we're actually going to need at the top We'll end up passing along way more data than we actually needed Right, so there's some extra stuff you have to do that you don't have to do in the iterator model to make sure You don't pass along more data than you need And then the output could be either a materialized row or a single column typically in the iterator model It's usually the entire tuple like an entire record But in the materialization model some systems can actually pass along a single column But it's all the tuples It's all the values for that for that single column For all tuples Right, all right, so go back to our example here right now. We no longer have a next function right now we have instead a We have this the return value is an output buffer, which is like a list of all the tuples So we started beginning very beginning we call uh child output in our root Then this invokes the operator below us Same thing as before we go down and get all the tuples on the Left side to build our hash table and result is again all the tuples We take then in our for loop process everything Build our hash table then go down the right side and pricklyte forward The data that we need going up right and it's all the tuples Yes What's that What is child? So like Like it's a tree, right? This is the parent. This is this child And you either have one on the left side or the right side or just one And the what one sorry the middle one Oh this Here we said there is a child so metal Yes The joint Right, it's a tree. Okay, so I mean the child means this join or All that so in this in in this example, right? This is the projection operator It has one child. It's the joint operator But again because everybody's implementing the standard the same api There's nothing in this code that knows i'm dealing with the joint operator Everyone always and you know gives you back all the tuples you need Right either one tuple or all the tuples are a vector in the vector as model So like I the main takeaways that I if I implement this standard api where they either have an output function or a next function I can compose these operators in any different way that I want And it's not like a major rewrite every single time so I can now I can say well I first put out my system. I don't have a join operator If I had the standard api at some later point I could add a join operator and it can just fit in and work with all the other tuples Or sorry all the other operators It's just a standardized api that allows me to compose query plans by plopping in operators Based on what the query actually wants to do okay So again the main difference between the iterator model and the materialization model It's it's it's either one or everything the iterator model is one tuple materialization model is everything Right, so once I call output and get back on for this operator and it spits back It's buffer of all the tuples. I'd never go back and ask for more. I have everything I need. Yes Is it one atomized tuple or one block of tuples? What do you mean? Like it's everything The iterator model sorry So he says wouldn't it be more efficient to implement it where you pay you pass up a block of tuples Instead of one tuple. Yes. That's the vectorized model. That's the next one Yes okay So the materialization model is fantastic for ootp workloads because In ootp workloads, what are you doing? You're starting getting one record at a time or one small number of records at a time So the overhead of calling next next next i'm percolating that down Right those function calls actually start to matter if everything's in memory Right because that's a that's a branch on the cpu and those are expensive that that's a jump in you know in in the address space so With the materialization model if we're only shoving up one tuple at a time. Who cares? That's awesome. That's super fast um So this is actually how we implement it in volt db volt db is based on a system I helped build in grad school called a store a store uses the material to materialization model and volt db As far as I know still uses the same thing because it's an in-memory database designed for really fast transactions It doesn't do analytics well Right, so this is this is fantastic for this mode adb is a in-memory column store system out of uh out of the netherlands It is an awesome system too It's not for analytics. Sorry. It's not for transactions. It's for analytics I don't know why they did this approach, but they do Uh and I had a bunch of other problems That we don't want to get into but I think this is a bad idea for this I think this is the right choice for for volt db All right, because again if I'm scanning a billion tuples and Unless my filter is super selective I could be shoving up a billion tuples from one operator to the next only to find up above my query plan That I didn't need a billion tuples. I don't need like you know Maybe 10% of that so you end up moving a lot more data than you actually need with this with this approach All right, so now the his suggestion which is is what we're talking about here Is that instead of having for every single next invocation? I pass along a batch of tuples or vector tuples instead of a single tuple wouldn't that be more efficient? Yes, so this is what the vectorization model does So again for every invocation of next we get back a batch of tuples instead of a single tuple And then now we're going to design our system such that the inner loops the sort of kernels of these operator implementations They are going to be rewritten or designed to operate on a vector of tuples rather than a Rather than a single tuple And so the size of the batch that you're going to spit out for every single invocation of next Can depend on what the harbor looks like So that can be based also like on what the The actual storage devices are how fast they are versus sequential versus io But also can vary on what your cpus look like So we're not going to get into this here because we didn't we said in this class We weren't going to worry about cpu registers and cache lines, but if everything fits in memory Uh on modern cpus. There's instructions that allow you to do vectorize execution or effect vectorize operators on a chunk of data at a time Like this is called simd if you're taking 418 or What are the parallel class 618 418? um So if you now have a batch of data that can now fit you know And a ycp register with a single instruction you could apply a predicate or evaluate You know some some operator on that on that data very very efficiently So this is why this vectorize approach actually is is a good idea so again This now looks basically like the iterator model, but now it's in the side of our our um Inside of our our kernel functions for operators We have this output buffer and then we just check to see whether the output buffer as we go down is Larger than the than the size we want to admit and if so then we shove back a tuple batch But it's still the same sort of model where we're just you know each implication and next goes down to our child A child does some kind of processing maybe go to its child to get some data And once we have everything we need in our batch, then we shove it up and let it process so this clear Okay, so the the vectorization model is ideal for analytical queries Because what are they doing analytical queries are doing long scans over large portions of the tables And therefore rather than for every single vacation at next getting two tuple now. I'm getting a batch of tuples And the size of the batch can vary depending on where the data is coming from and how you're going to process it Right. This is the same D stuff that I talked about before so every major data warehouse built in the last Uh, I mean the last 10 years or so Is using the vector vectorized model Right because the performance impact is quite significant. And this is actually what we use in our system that we're building here So this is clear So in general we're going to focus on the iterator model But there's other ways to do this the the vectorized vectorized model is sort of easy to see how you can extend the iterator model Materialization model really only comes up in specialized systems Yes Materialization model, yes You're quite you see it is with the materialization model you're saying that the Yes Yes That yes Right so statement is um, if everything's in memory then in the materialization model Even for olap wouldn't this still be a benefit of not having to call a next next is next over and over again If we just get back everything Yes, but if you're reading large amounts of data And you're not able to push down the the predicates Uh, if your predicates aren't selective enough meaning the the the amount of data you're outputting this is large Like my my table is is like 100 gigabytes and now i'm shoving up You know 90 gigabytes of data You can be smart about if you're doing the light materialization stuff We'll talk about in a second but like if you're copying a lot of data then it sucks But in the vectorization model like if you are doing next next again and it is going to different sections of the disk every time All right, so we're mixing terms right so or make mixing environments I'm saying that this works for in-memory systems. I think this is not good for disk based systems Right, and I don't I don't know of any disk based system that actually does this For in memory so for disk based systems the vectorized model Would be better still because you know The amount of memory amount of memory I have to keep intermediate results is smaller I can't I don't want to have to spill a disk as i'm going out my pipeline So the vectorized model is prefer preferable for OLAP systems You shaking head as if as if you are confused Okay, are you good? Okay All right, the the last thing we want to talk about And I just want just we're not going to focus too much on this I just want you to be aware of it is that in all the processing models examples that I showed iterator of mentalization and vectorize I showed a top down approach Meaning we start with the root we call next or output on that and it percolates that calls down into the tree and we start pulling data up to the to the root That's the most common way people implement query execution or create these query processing models But you can go in the other direction you can start at the bottom And actually push data up This is a bit rare. We did this again in in in volt eb We're doing this in our in our new system today A hyper out of germany does this as well and the reason why you want to do this is that Now we can be more crafty and more careful about as we move data up to make sure that The the tuple or that's information of our processing can sit around in our cpu catches and registers So if you're very careful about your memory placement and in your memory allocation This way is is preferable, but this assumes that everything fits in memory So for this reason for the disk based systems, this approach is is better This is also a bit harder to reason about as humans because we have the right programs in a way That may not be natural for us as we normally think about how we write right code Right where this one again from a human standpoint. This is easy to understand I call next on my my child. It gives me some data This one is actually like carefully crafting the program in such a way that Is better for the cpu, but it's more difficult for humans to reason about Again, I don't want to go too much details in this. This is something if you take the advanced class in the spring We'll cover this Uh, the main takeaway for you guys is everyone implements the top one for the most part Okay All right, so now let's talk about what's going on in this leaf nodes or in our query plan So again, these are the access methods These are how we're actually actually retrieving data from the database system from our tables and then be able to Pass them along to our next operators So we've covered basically these already I just want to spend a little more time and just go over them again And then we'll talk about how to do this in with multiple indexes So in general there's two approaches You're either reading the data from an index or reading reading it from a table with a sequential scan and the index Not always, but it could be preferable based on what query you're doing And then the fallback solution is always the sequential scan If I don't have an index that's can do what I want to do or it has like the right attributes for my query I just fall back to a sequential scan Right and the multi index is just an extension of this it allows us It's you know, instead of accessing one index we can access multiple indexes and combine their the results together Okay So the sweatshirt scan again, we've covered this many many times It really is just a bunch of for loops inside of our operators so The scan operator will then iterate over a single page in our table And then for every single page we we iterate the tuples instead of those pages And then we can do whatever work we want to do On them and then we can emit them up to the the next operator as needed So again the way this is implemented is like a cursor if you're familiar with python They have a yield function for iterator. It's basically I call This my for loop and I called next on my iterator and then when I get the next page I can then iterate every single tuple and then I can call some other function or pass this control back to another piece of the Part of the system and then if someone comes back and says oh, give me the next tuple I know how to pick up where I left off So operators have to maintain this data where the iterator left off every single time it goes off and returns a tuple So that when we come back we can pick up where we left off right And the typical way this is This is referred to often in systems as as cursors So there's a bunch of optimizations we can do to make our sequential scans go faster And we've covered a couple of these already right sequential scanning again It's it's it's the fallback option if we don't have an index So we don't have a better way to process the query, but it's slow Especially in a disk based system where or every single page could be out on disk So it was a bunch of these we covered already right we talked about prefetching This is the double buffer optimization for doing joins We talked about doing buffer pool bypass or instead of polluting our Buffer pool cache. We have a little side buffer just for our thread or a query We'll talk about how to run sequential scans in parallel next class So I want to focus on these three down here again some of these we've already covered But again, these are just ways to make sequential scans go faster And the idea is that again There's nothing we can do to magically make a sequential scan go faster Like it were limited by the speed of reading data from disk and bringing it to memory But if there's ways for us to figure out how to do less work And that's gold. That's what we want to do. And so that's what these optimizations are trying to allow us to do So zone maps are actually what he brought up a few lectures before about pre computing some information about pages To allow us to figure out whether we actually need to access them or not And so the the basic idea of a zone map is that for every single page in our database In our table We'll have some additional metadata. We've we've computed or derived from that page That gives us information on what what what the what are the values inside of the page for the given attribute So let's say that we have a really simple table that has one column, right? And then within a single page We only have uh five tuples So zone map for this could have the pre computer aggregate information About the values within this page for this column So the min max average someone count So now if I my query comes along That looks something like this select start from table where value is greater than 600 So Without the zone map. What do I have to do? It's a sequential scan I go grab this page and now I sort iterate through every single tuple and I evaluate my predicate to see whether I find a match But instead if I have a zone map I can I can say well, I know I think I need to access this page Let me go look at the zone map I'm looking for values greater than 600. Well, I go check and say well the max value for this for for this column in this page Is 400 So I know there's not going to be any value any tuple with a value greater than 400 So therefore nothing in my page will ever match my predicate. So I just skip looking at the entire page entirely Right. So now you say all right, where's the zone map actually stored? Well, some systems could pack it inside the page itself So you still have to bring the page in but at least now you're not iterating over every single tuple Other systems can actually store these zone maps as separate pages So I'll have like a zone map block or zone map page that has a bunch of zone maps for different pages So maybe that sticks around in memory more often And then that I just jump to that page check the zone map and then decide whether I want to go to the next page Or go actually access the the underlying page So I use the term zone maps because that's what uh, that's sort of the common term Uh, I forget whether it was natisa or oracle that invented this term But oracle if you go google oracle zone maps will be a bunch of documentation for this I don't know whether that's a trademark term because other systems don't use that term But it's found again pretty much every single major data warehouse today because the performance benefit is quite significant Because disk is so so slow So you can imagine a bunch of different types of queries you could do just based on these kind of these kind of statistics What's one issue we have with zone maps though? They contain them exactly right so Anytime I update something in here Then I got to make sure that this is actually insane because I don't want to have a false false negative I don't want to look at look at my zone map and say I don't have a match when I actually I do So these are typically all these systems. I'm showing up here. These are running for the for the analytical stuff but you would not want to use a zone map for OTP because Again, you have to maintain them all the time. That's going to be super expensive But for analytics where it's usually like right once read a lot Then zone maps are a big win and this is why all the major systems actually support those All right, so the next optimization is something we talked about before I'll make sure everyone understands it in further detail This was late materialization So for a column store system We actually can delay or not or avoid having to propagate data From one operative the next and the stability is passed along offsets or column id so allow us to go get the actual data We need later on Right in a row store system typically the operator the output will be The actual entire tuple because I've already went to the to disk to get the page that has that tuple I'm not just pass along that entire tuple up the query plan tree And not have to ever go back and get get more data But in a column store to go get all the the data for a single tuple That's a bunch of different reads because the the data is broken up across different columns And therefore I want to avoid that as I want to delay that for as long as possible So let's say we're doing a join of two tables foo and bar and say the foo table has three columns a b and c So my query plan in my pipeline over here as I'm processing on the the left side of my join Well, the first thing I need for this first the filter operator. I only need the a column So I'll just pass along this this iterator here where it's passed along blocks or pages from from the a column Then I do my processing But then I know that in my query plan I don't need the a column ever again because I'm doing the join on b I'm doing my aggregation on c so I don't need to pass along a So instead I'll just pass along offsets Of the the the tuples that passed or passed the predicate So now inside of the join operator I can go get the b column to complete the join And then the same thing I don't know I know I don't need b up above So I just pass along the offsets and then now when I compute the average Then I go off to disk and get c and then now I pass along the final result Right the idea here is that we can be smart about identifying what data we need at different parts of the query plan And only pass along offsets allows to go back and get the the rest of the data we need at a later point All right, the last thing to talk about is heap clustering or cluster indexes. Again, we've already talked about this before This is just allowing us to Uh scan along the leaf nodes of the index And go fetch the data in in just sequential order Uh, because we know the tuples are going to be ordering in our pages in the same way that they're ordered in our index, right? So there's not much else to say about this this one we've covered already So let's talk about now how we're going to do an index scan so the For the basic idea of an index scan is that we want to be able to identify What index we have on our table that is going to allow us to quickly find the data that we need and and limit the Amount of useless work we have to do or waste of work we have to do So again, it's all about reducing the selectivity of the data That we're processing so that we're passing along less data from one operator to the next So how to pick what index is super hard because it depends on a bunch of different things It depends on what attributes we have in our index what attributes we're referencing in our query depends on what the values of those attributes actually look like Whether they're super selective for our query are not selective depends on what our predicates look like Are they less than greater than equals to not equals to that determines whether we can use index And then of course and then depends whether it's a unique index or a non-unique index So I'm going to go through a really high level example of how we pick an index But we'll go in this way more detail next week because this is what the query optimizer does for us Right. I have a select statement. I don't specify what index I want to use I want the optimizer inside the database system to say, oh, these are my choices for my indexes Here's the best one based on what what your query is trying to do So again, we'll cover this in more detail In next week, but let's just look through a really simple example So say we have this simple query like this where we want to get the All the students from from the students table that are under the age of 30 That are in the c's department and from come from the us And say for this particular database, we only have Two indexes we have an index on age and we have an index on on department So for this particular query, what index we want to use again depends on What the values of the data actually look like in our table So in the first case say that we have We have 100 tuples And 99 other students are under the age of 30, which is probably true for this university But then there's only two people in the c's department, which is not true for us But assume that's the case. So what's the best best index to use for this? Age or index sorry age or or or or department Department right because I'm only going to get two tuples to match for us If I go with this if I go for the the the age one Then my index scan is essentially useless because I'm just I'm just going to find all the the records I would have found anyway through a sequential scan And now I paid the penalty of traversing the index when I didn't need to Again, if we just reverse this and say well There's 99 people in the c's department and only two people on an age of 30 Again now the the the the age index is better for us because that's more selective So at a high level, this is what we're trying to do in our index scan We're trying to avoid having to go read data. We don't we don't necessarily need And that includes also You know the cost of going actually probing the index So if the data system can recognize that well, I'm not going to be very selective for this index And I'm going to pay the penalty of traversing that index or doing a look up in the hash table for that index Then it's just better to do a sequential scan Again, what we'll reason about how to actually make this decision more next week But what I do want to focus on though is what happens if both indexes are a good idea Right, we could be stupid is flip a coin say all right I'll just roll the dice and see that you know, I'll pick the age one that that'll be good enough But the data business we can recognize that both of them are actually going to help me a lot And be very selective Then I want to do probes on both of them Get back the results And then combine them away in a certain way based on what the query is actually trying to do And then use that combined result then figure out what data actually matches or find the data that I'm looking for Yes His question is what to find both the values. What do you mean? How we decide what In this example both No, uh, the previous one. Yes. So so For each of these scenarios, which one had we decided to pick? No, picking is okay, but But then you came to the conclusion based on the numbers, right? So you have to compute more than numbers 30 and 2 Won't that be an issue? You have to convert it to compute both the numbers 30 and 2, uh Okay, how's yeah, like yeah, we will cover this next week. We have these Yeah, like we have an approximation of this It's not always accurate that can lead to problems, but it's in soon. We have we know something This is like in this extreme example. It's super obvious We would know that like there's two people, you know in the CS department We would know that this is like this is an outlier that this is this is the other way You usually have a hitter we would know this one's super common. So we we would choose that or not choose that Yes This question is is an index scan always better than a uh, uh, is the index scan always better than a sequential scan? No Yeah, but we'll cover that in a few more slides. Yes, this question is if it's not a clustered index We could have random i o how do we solve that give me like two slides three slides Yes All right, so this question is in this case here Would I have to probe the index a hundred times No, because it was like I do I like I'm on the query plan here. Yeah, so in think of my my my leaf nodes on my query plan for this query I called next Right at the leaf node to access the actual table Well, I'm accessing it through an index and the index is going to do probe now to find all the students in the CS department And it's going to find it's if it's a b plus tree it would find the The the left side in a leaf node where where all the students in the CS department begin I admit the first shoot while I find That goes up to my to my parent the parent comes back and calls next again Now my iterator jumps to the next one in the leaf node. That's another student in the CS department. I admit that up Come back again. Now. I iterate to the next one. Oh now it's in, you know, the biology department That's not doesn't match my predicate. I've returned all So I probe the index once and I scan along the leaf nodes till I find everything I'm looking for Right now his question is his question is is a index scan always preferable to this the A sequential scan no because like if I'm matching everyone like in the case of 99 people in the CS department Now I'm paying the penalty to traverse down the index That's just wasted IO or wasted lookups. It would have been better just to jump the beginning of the table and scan it down sequentially So where that cutoff point this in this case is obvious 99 of the tuples match sequential scan is the right way to go where that threshold is It depends on the cost model inside the query optimizer inside the database system And that's going to vary between different systems. And so Monday next week I always feel I feel like I'm always saying like our next class or this tumor classes will discuss this and uh And I'm trying to focus on just one idea but Next week we will show how do we derive this information and then use it to make it to make an approximation Of which one is better so not only whether which index is the best one to use But whether a sequential scan will still be better We'll cover that next week For now we're focusing on how to is what you know, what is this execution actually look like? All right, so again the multi multi index scan is doing multiple lookups on different different indexes And then we'll combine the results Based on what the predicate is so if it's a if we have a conjunction an and clause Then we use an intersection if it's an or clause a disjunction. We use a union. We're just combining these sets together So and then for any tuples that That records that match after our our combination We then can go again do a lookup to find them and do additional processing So all the major data systems support this They call it different things if you use postgres sometimes you'll see in the explain output They'll call this a bitmap scan. This is essentially what they're doing They're building some kind of they're building a bitmap Where every bit corresponds to our record at some location And then now they can combine them together using you know bit manipulation operators But different systems do different things Yes On the leaf nodes, yes, it depends on the predicate Right, if it's in clause you could that'll be multiple probes if it's some kind of like range clause You can do a sequential scan along the leaf nodes I shouldn't use the term special scan a range scan. It's what it's called on the index leaf nodes Yes, then again you do multiple probes Yeah, so again When you write your sequel query, you don't write it like oh, I know I have a hash table index I should do you know, I should write this kind of query. You don't know you don't care I mean you At a high level, you don't know you don't care the data system can figure out Oh, I have a hash index on this attribute and that b plus tree index on the same attribute Depending what my predicate is I it knows how to pick which one And does that all for you? That's the beauty of sequel or the beauty of a declarative language that you don't want to worry about how things are actually be physically stored So I could drop the index and the query still works. It falls back to a sequential scan, but it still works Okay, so again, here's that same query we have before And so this time we have an index on both age and department And so what we'll do is we'll first retrieve all the record IDs that match age less than 30 in our index Then we'll do the same thing It could be in a different thread at the same time. It could be the same thread It doesn't matter they do a look up on the department. Then you take the intersection of these two The outputs of these two indexes because it's an an clause so conjunction And then any additional records that match after the intersection We then do a look up to actually get the tuple and apply the final predicate on the country equals us So again visually it looks like this probe Do a look up on age less than 30 on this one department equals cs on this one and then whatever ones are in the middle Are then the ones we know that match both of these indexes And therefore that those are the ones we actually want to look because that would match our our conjunction And then we go retrieve them and check them by the country equals us So what we're actually generating here Could be again a bitmap like if it's in postgres It could be another a hash table that we can potentially combine with another hash table or do a join Could be a that bloom book that we talked about before Right, it doesn't matter different systems do different things Ideally you want this to be as succinct as possible because you don't want to have a really large hash table for every single tuple That matches both of them and try to combine them So you just have an efficient way to do an intersection and then you go back and get the rest of the data that you need okay The last thing we'll talk about for access methods is what he was asking about before and I think this has come up a couple of times in the semester is If I have an unclustered index And i'm scanning along the leaf nodes. I'm jumping around potentially at random to different record IDs at different pages The tuples are not sorted the same way the leaf nodes are sorted in the in the index So in worst case scenario say I only have one, you know one buffer page where I can store something in You know a single page in my preferable Then every single time as i'm scanning along the the leaf nodes I'm reading a page if the page i'm looking at is not the same as the last one I just retrieved then that's another disk. I go to go fetch this But depending on what my query is If the output does not need to be sorted on the id that the the index is based on Then before I actually do any lookups on the data I scan along get all the record IDs Then I sort them based on their page IDs And so now for every single page they're grouped together and it's one io to get the one page I process all the tuples that are inside that page before I move on to the next one All right, this is the beauty of the relational model the relational model is unordered So we're allowed to do this any way we want it doesn't matter that the You know that the output could be different from one day to the next Because you can't you're not in your application if you cared about things being ordered a certain way You would have to write a specific order by clause So one day our tuples, you know two tuples that may be Closed together and in the in the store order might exist on the same page The next day after some compaction or some garbage collection process They're now on different pages So I could end up getting in a different result for the same query and that's okay We're allowed to do that So there's a bunch of stuff we can do as we're processing data that could change the result change the Doesn't change the high level correctness that could change the the exact output ordering All right, so any questions about this? Yes Your question is After you do the sorting do you have to walk through? So it's like you sort based on page. So say I there's two tuples. I need in page 101 I I know what page they're in they're in page 101. I know the slot number so Then as as I bring the page in I jump to slot one I jump to slot five and I go find the tuples that I want You're not iterating. You're not scanning through because you're not looking at every single tuple You're only looking for the tuples that this thing matched up up here Yeah, yes Correct so his question and I should recover this last week How do you actually maintain a clustered index because If say this this page is full and I insert something and the sort order fits in here Do I do that mean I have to like then reshuffle everything? Yes Unless you store everything in like in my sequel or in a db The the pages the leaf nodes themselves are the where the tuples actually are being stored So as you do splits and merges and you're moving key values, you know from one page of the next that's that's the same thing If there's a disconnect if it's any index organized if it's sorry if it's a clustered index That's not connected exactly to the underlying data pages. Yes, there's a bunch. You have to do a bunch of extra work This is why most systems that's index organized or cluster indexes are not the default. Yes So his question is if it's like sort limit 10 like if there's an order by With a limit clause you can't do this correct. Yes But again the data system knows this we have rules and our query optimizer to recognize that you like Oh, so that would be that's an example of a pipeline breaker to do an order by I have to have I have to see all the tuples to know what the global store order is I and I can't go now to the limit clause up above me in my query plan until I finish that sorting Yes Okay, yeah, so let's I should have an example. Let's let's just use this the Go back to the iterator model. Okay, so Pretend that this instead of this being a Projection it's an order by and then there's actually make this to be the order by make this about want to be the limit So the order by is a pipeline breaker I don't know the complete store order of all the records until I see all the records Because if I try to start sorting them just by seeing partial, you know, you know, I see, you know I see 9 10 11 I sort them But then I don't see, you know value 1 because I haven't called next yet now my sort ordering is incorrect So a pipeline breaker says you cannot proceed up in the query plan until you get all the tuples you need So the order by here would be a pipeline breaker And then if I have an order by with a limit clause that I want to pick the top 10 tuples After they're being sorted. I can't do that top 10 until I have everything ordered Make sense So order by sub queries joins Limit clauses with order buys would be a pipeline breaker Other examples Min's maxes things like that So how does it relate to an index scan so the index scans are down here Right, it's just you know for this for t and r. I'm not saying what r is how we're with this folder back He has it could be a sequential scan could be an index scan so If it's it if it's a sequential scan it's unordered So I'd have to have my order by calls do up above do a bunch of extra stuff but if I am doing an index scan and my my pro my range scan and my index and it's in the Sort order of the index is the same as my order by clause up above Then I know these tuples are going to come out in sorted order. So I don't need to do my order by up above So again, but we know all this ahead of time as we're generating this query plan because we know exactly what the query plan is because it's declarative There's no like There's no magic function that we don't know what it's going to be until we actually run it It's not entirely true, but for now That's the case Okay All right, cool. All right. So the last thing I want to talk about is How do we actually evaluate these predicates? We have these where causes we have these join causes How do we actually, you know, how do we actually make sense of this? So the way we're going to do this is that we're going to represent the where causes as as an expression tree And all the nodes in our expression tree are going to represent the different types of expressions that that we can have in our predicates So all the comparisons conjunctions disjunctions arithmetic operators function calls Lookups for actual tuples constant values all these things to now be represented in a treat So for this particular example here, I have my own clause where rid equals sid and then a where clause with s s s dot value greater than 100 So I could represent this entire predicate the combination of the join clause and the where clause as as a tree In my root, I have an I have the conjunction the in operator on one side I have the quality predicate it matches rid and sid the other side I have the the the greater than Operator for the attribute value. I'm sorry the value of the The value of the value attribute For the tuple and then the constant 100 So just like before in a query plan we go top the bottom we'll do the same thing here Determine whether the tuple matches So let's look at a more simple example So here we have a query where select start from s and then our where clause is where b dot value equals question mark plus one So we haven't talked about I don't think we talked about prepared statements yet But prepared statements are like a way to declare a like a query template Like you tell the data system ahead of time. Hey, I'm going to execute this query over and over again And here's a placeholder where you can fill in a value at runtime So you then you invoke this query almost like a function and you'll pass in a value that gets substituted at runtime We'll cover that I think Maybe next week or after the midterm But is this clear what's going on here? It's just a placeholder. Yes Yeah, but why then I use plus one. Why can you do that? Yeah, no, like what's the point of it? So like I'll replace the question mark, right? Yes And then I say okay add plus one. Yes, why not just have plus one be done on the client side Sure, so that is a uh, that's a philosophical question. No, so like Take about this. So I always have to add plus one, right? So Let's say that I have my I have a big application. I have my I have a desktop version I have a web based version and I have a phone application And they're all going to use this query That means I if I always make sure I have to call plus one I even make sure that on all three different programs they call plus one But if that's logic that should always be that I want to hide away from the application Then I can write it be in my in my query and then all of them get to use that The plus one is like a trivial example, but think of other things like, uh, you know more more complex operations This is just a place. This is just to show you what's going to happen the tree. Yes We will so we'll call that We'll have a whole lecture called server side Program logic server side execution We'll cover that after the midterm But why and it's a trade-off though because now we have like There's the logic for application is now split between the database system and the application code itself And this is what I'm saying it's philosophical Some people say oh the database system should know everything and because I can reuse that logic across all different applications Other people say that's a bad idea because now like as a programmer I'm writing some you know crappy python code and there's the data to do a bunch of stuff That I don't know what it's doing so it's better for me to put everything in my application code because the program we can see that it's a long It's a long answer for why plus one is in here, but uh, trust me There's some useful there's some usefulness to us. Okay So the expression for this the where clause where b dot value equals question mark plus one Would look like a tree like this so in order to invoke and evaluate this expression We need to have some contextual information about What's going on in our query as we invoke this expression? So we'd say well, what's the current tuple we're processing? Again, think of this is like we're going to call evaluate Inside of the the for loop inside of our one of our operators so every single tuple we're getting We're going to we want to evaluate this tree. So we need to know what the current tuple. We're looking at We also need to know what our input parameters are for this query All right, so this is what the client passed us when they want to invoke this query to substitute the question mark So in this case here, it's 999 And then we need some information about the schema of the tuple that we're processing So we just know like, you know, here's the here's the record The first attribute is called id and it's integer. The second one is called value. It's also integer Right, it's just more context about what the tuple actually looks like So the way this is going to work is that we're going to start the root We call evaluate and then we just go down in in in depth first manner To each of our leaf nodes and start moving moving values up So we start here go to the left side and this is a expression says go retrieve the Uh, the s value the the value attribute from the the current tuple So this would then just generate One 1000 right because the current tuple we're looking at has for the value attribute 1000 So then now I go back up here and go again going depth first search. Oh shit. Sorry We get on this side and this says oh, give me the parameter at offset zero I look at my context at offset zero. It's 999 So this thing's generates 999 I jump up here I go down to the other side This says give me the constant value of one that then you know produces the one that gets showed up to here And now I evaluate 999 plus one that's a thousand Then I shove that up here for my quality predicate and it says does a thousand equal a thousand Yes result is true. So this tuple would match this particular predicate So that clear and this is just we're doing this for every single tuple We we're looking at inside of our inside of our for loop and one of our operators Is this good or bad? Remember Fraser is this faster or slow? He says he says we're doing the plus every single time uh unnecessarily yes, so Again, we'll cover this we'll do characterization But one obvious thing that he pointed out here is This thing is this is also a constant two 999 Is always going to be the same because that was passed into the query. So we could rewrite this And have it be just rewrite this thing just to be a constant value of a thousand Yes But you still you know you're still going down the tree That that actually sucks So what i'm describing here again is what every single data system the first time they're implemented Well, this is how they implement their their expressions But it's going to be slow because now I've had a billion tuples I'm calling that function to evaluate the expression tree and traversing that expression tree a billion times So the high-end systems don't do this Right any of the super optimized systems don't do this so instead what they want to do is Almost the same thing is Sorry just in time compilation So say I have a stupid predicate like this where one equals one now I realize like again, you can optimize this way where it's always values are true But it's saying assume that you you had a crappy system that was always going You know always had this tree and always had to traverse it What you instead want to do is actually compile Exactly the The the predicate you want to evaluate for a given tuple Right, so now I can write in in one instruction Constant value one equals constant value one on the cpu and that's way faster than Traversing this tree doing look-ups, you know to see what kind of expression type I am figuring out what kind of output I need to copy back Then having these giant switch statements to say well, what's the operator? I'm trying to evaluate is an equal clause. There's a lesson clause If I can just strip down to be exactly what the predicate wants to do that's going to be way faster So again, this is what the high-end systems do postgres 12 just added this Uh, and it's quite limited at this point, but this is you know, my sequel doesn't do this as far as I know But like the high-end systems and the better open source systems can do this And the spoiler again for for the system we're building here at CMU Not only do we compile down the predicates into simple instructions We actually compile the entire query plan to be to be just a pipeline of instructions And then now you know longer again, you don't have indirection You don't have jump clauses It's just sort of the one giant function that actually is exactly your query plan sort of like It's like as almost someone wrote the code exactly exit your query And then compiled it we can do this on the fly Okay So the main takeaway from this is that the same query plan and different systems can be executed in different multiple ways And this depends on the environment whether it's a Com store or a row store depends on the workload whether it's a OLAP system or OTP system But as I said the iterator model the top down approach is the most common one And that's you'll see that the most in the wild In most cases the patients are going to always prefer to use an index scan over a scruncher scan as much as possible There'll be some in cases where this is not going to actually work out Um, and then the expression trees are nice because again as humans they're easy for us to reason about and understand the correctness Of the predicates we're evaluating But in practice they're going to be super slow and instead you want to compile them down All right, any questions about You know creating execution so far And hopefully you've seen what we've done in the semester is that we've talked about again All these different parts of the system how to do the scans how to build the indexes How to have a buffer pool and now we're starting to put all these things together and actually build a full system All right, so next class we will continue on query execution But we're going to focus on how to actually query execute queries in parallel Right and and the distinction I'll make between distributed and parallel systems is that For parallel execution we're talking about running a single query on a single database instance on a single box Distributed execution will be running queries on multiple machines But a lot of the ideas that we'll talk about in this lecture will be applicable to the distributed environment Okay All right guys, enjoy the weather and I'll see you on wednesday Yes