 I'm not feeling so well, but let's get through this. I just sent a post in Piazza. I'm going to cancel my office hours. If I pass out here while I'm lecturing, please don't steal my wallet, because they did it last year. I did not appreciate that. Okay? All right. Some administrative things. Project two, as everyone is aware of, is already out, and the first checkpoint is next week. Someone posted in Piazza about making checkpoint two tests available. We'll try to take care of that this week as well. We're also going to try to expose the project one test cases so you can go back and test yourself as well. Tanya, are we doing that already or no? We're running the... We'll take care of it. The midterm exam, believe it or not, it's October 1st. We're going to talk about the midterm. The midterm exam will be here in class on two weeks from now, on October 17th. It'll be an hour and 20 minute exam, and it'll be here. And it'll cover everything up to whatever the lecture before the... That Monday's lecture will not be covered on the Wednesday exam. It'll be everything before then. Okay? And as I said, they just sent the email saying we have our final exam on Sunday morning at 8.30 a.m. It is what it is. Okay. This week, we have another visitor coming. So these guys are Scream. Again, they're another GPU database out of Israel. So Scream, MapD, and Connecticut are sort of the three most well-funded players in the space. So they're coming on, again, on Thursday, they'll be pizza in CIC to talk about their system. Okay? And I'll post a reminder on this on Piazza. All right. So as I said last class, now at the point we want to start talking about actually executing queries. We're building up the layers in the system. We know how to restore pages. We know how to have a bubble manager. We know how to build indexes on top of that. Now we're going to talk about, all right, how do we actually run queries? That's sort of the whole point of a database system, right? I want to be able to put data in and then ask questions about it. So when we talked about relational algebra before, I sort of said in the beginning that the way it works is that you give it SQL and it converts the SQL into a query plan comprised of relational operators. And so initially, these relational operators would just be a logical equivalence to what we learned in relational algebra, like projections and filters and joins. And now at this point in the lecture, now we start talking about how to actually generate physical operators, these logical operators to actually execute the thing that they represent, right? So how are we actually going to execute a join? What algorithm are we going to use? So the way to sort of think about a query plan is that it's just a tree and that the operators are going to have these edges that show you where the data is flowing inside the tree, inside the query plan. So at the very bottom here for this particular query, doing a select and joining A and B together, at the root of the tree, we have the scan operators, the things that's going to actually access the data, whether it's an index or the actual heap or the table itself. And then they're going to feed that data up into the next operator, then it computes whatever it wants to do, and then passes it on to the next one. So the way to think about this is what I'm showing here and what I'm describing is how it works at a logical level. What we'll see in today's class is physically this is not always the way you're going to execute things, right? You may not be passing a single tuple, you may be passing a block of tuples or the entire output of an operator. And then at the very top is always going to be the root node and that sort of implicitly has an arrow coming out of this and that's the result that goes back to whoever asked for the query, right? So we can have data go from here to here, here to here, and then this thing says, all right, whoever invoked this query, here's your result. And that can be you with a terminal, it could be an internal query, it could be a store procedure, it doesn't matter, it just goes out where it needs to go. So today's class we're going to talk about three important things. So we're going to first talk about how the data system is going to process the query plan, how it's going to organize the execution of the operators and move data around. And then we'll talk about the access methods of how the data system will actually get to data, whether it's a sequential scan or an index scan. And then we'll finish up talking about how do you actually evaluate the predicates in your where clause and other expressions, right? And again, we're just building one layer on top of each other. So a data system processing model specifies essentially how the system is architected or organized to execute a query. And at a high level, the approach is either a top down or top to bottom or bottom to top approach, right? How are you actually going to invoke these operators? And this is sort of at the engineering level and at the physical level of the system. So the processing model also specifies what kind of data is going to get moved from one operator to the next. Is it a single tuple? Is it a batch of tuples? Or is it a single column? Or all the data that the operator is ever going to look at, is it shove it up to the next guy? And the reason why these different processing models is that as we'll talk about today, there's advantages and disadvantages for each of them. Certain systems, like you want to focus on OLAP queries, they'll do it, usually the vectorized batch model. But if you're doing OT workloads, transactional workloads, that don't have to actually touch a lot of data per operator, then the materialization model might be a better approach. So again, there's tradeoffs to how we're going to engineer the system to execute these queries. And one approach is better than another for different classes of workloads. So the jack-of-all trades, those sort of generic or basic approach to processing queries is called the iterator model. So this is another example where different people call the same thing different names. So I think in our textbook it calls it the iterator model. It might call it the volcano model. And volcano was a system in the, a very influential academic system in the late 1980s or late 1990s that sort of described how to do the iterator model in parallel. And sometimes it's also called the pipeline model. No matter what it is, it's all essentially the same thing. So the basic idea is that the physical operators on a query plan are each going to implement or expose in their API this next function. And the idea is every time you invoke next on an operator, it's going to give you back the next tuple that it processed and then wants you to process. You sort of think of a for loop, you iterate through and say next, next, next on an operator and it's going to keep giving you out tuples one by one. Then at some point when you run out of tuples, like you scan the entire table, you reach the very end, you just return back a null and say, you asked me for more, I don't have anymore, so don't ask me again. So this is an example of what's called top down or top to bottom processing because we're going to start at the root and the root is going to say, all right, I need to do, I'm going to need to compute whatever operator I represent, I need to compute it, but I need tuples to do this. And I'm going to have children in my query plan and I go invoke next on my children and I ask them to give me the next tuple that they have. And then they might not have anything, so they need to go down and, you know, interverse the tree to reach the very bottom and you reach your access methods, the things that are actually scanning the data. So let's look at a real high example like this. So normally I don't like to show code in any lecture or talk, but for this I think it's sort of the only way because it'll convey the idea pretty clearly. So again, this is that query we had before. We're doing a join on A and B and then we have a simple filter where value is greater than 100. And so this would be the query plan that the data system would generate for this SQL. So the way to think about this now in the iterator model is that for each of these operators you're going to have essentially a for loop. And so the for loop is just going to iterate over its children and call next on its children to get the next tuple that it wants to process. So starting at the root, right, you sort of organize these, starting at the root, this guy is going to loop through. It's going to have a single child. So it's this one here. So it's on this join operator. So it calls next to the next operator because it wants to do the projection, but it doesn't have any tuples, right? It depends on its child. So calls next goes down to the next guy, to the join operator here. Same thing. So this has two children, left and right. So it's going to call left on its child, which is then going to go down here. And now here we're at this A here. So now we're just going to scan over A and there's this emit function that basically says like a yield clause in Python or iterator model in Python where you just have a for loop and you say, all right, someone's evoking me. Here's the next thing that you asked me for. So it's going to emit that tuple here and that gets shoved up to the next guy, up above it, right? And so the top guy is going to do this on operator two, doing the join. On the left it's going to keep doing this until this thing comes back and says, I don't have anything. And then it switches down to the next for loop on its right child and same thing. It's going to evoke next on the right child and this thing then evokes next on its parent or sorry, its child when now we're doing the scan on table B. So again, if going from the top we call it next, next, next going down until we reach the very bottom and at some point we'll reach something that knows how to get tuples either from an index or either from a table and it shoves them up to the next operator. The next operator can do whatever processing it wants to do and then maybe emit it back up to the next operator. So the key thing to point out here is probably the reason why it's called the pipeline approach is for this path in the query plan we have a pipeline meaning we can take one tuple and we can pass it through a pipeline of multiple operators before we have to go back and get the next one. So in this case here we're doing the loop over the right child calls next, we go down here it's going to do an iteration over its child calls next here and at this point as we're scanning B it emits a tuple and then immediately we can evaluate the predicate and if it passes then we call emit and then it gets passed up here. So we don't have a context switching it's not the right word but get the basic idea we're not switching to the next tuple when we get the tuple here we do whatever processing we need to do on it then pass it up to the next guy. This is why this approach is widely used in disk based systems because if I have a tuple in memory I want to do as much work as I can while it's in memory before I go back and get the next one because I may not have enough memory to put all my intermediate results in memory and I may have to spill to disk so the worst case scenario would be I need a tuple, I fetch it from disk I emit it up here and then I process it and then before I pass it up here I go back and get the next one the store of the tuple that I just got so that gets swapped out to disk and now I fetch another page to get the next one but then when I want to get back up here I got to go back out to disk and get all the other ones all the ones that passed I had them in memory here but then I swapped them out so the pipeline approaches again try to do as much work on a single tuple for as long as you can while it's in memory so as I said this approach is the most widely used processing model in almost every single database system so this is just a small smattering of systems that are out there it's pretty much everyone these are the ones I can verify based on the documentation and what I know about how they work so again the idea is to generate query plans that can allow you to have these long pipelines so keeping things in memory as much as possible but at some point you hit what are called pipeline breakers where you can't keep going up the query plan with the tuple you're processing you have to go back and get more so these would be any time you need to see all the data on one side of the tree before you can go on to the next one so the most obvious one is a join so in this case here I need to see all the tuples from A and build my hash table before I can put anything up to here and I need to do the join here actually in this case here if you do the join you've matched you can push it up here but this side of the tree you have to see everything and go down the other side so as I said joins, subqueries, order by these are things where you have to see everything before you can make a final decision on what you're going to spit out to your parent the other nice thing about this is that it makes it really easy to use limits and control the size of your output because now all you have to do is as you call if I call next and I only want ten tuples if I call next and I get ten tuples I'm done I can just stop at the whatever point of the tree that says I know there's a limit here I don't have to worry about processing more data than I actually need we'll see this later in a few weeks when we talk about parallelization but this approach is also very easy to parallelize as well and that's what the volcano model originally proposed a way to have these little next guys run and separate threads and then coalesce the results to some later point in the tree but we'll worry about that later the next processing model is called the materialization model and this is sort of a bottom up approach where instead of having this next function that a parent calls on a child you start at the bottom and you have the operators do whatever it is computation they want to do on the data that they're reading and only when they have everything done they've computed the entire answer that they want to generate then you shove up data to the next operator and you never go back and evoke that operator again because at that point it's done all the work that it needs to do so we'll see this in the next slide the tricky thing about this is as I said before in the iterator model it's really easy to do limits because you just stop calling next when you've got all the data that you need in this case here the limit may be in the upper part of the tree and you may not know that oh I'm going to scan a billion tuples but I only really need ten so if you materialize the entire billion tuples and shove that up the tree you may not know until much later that oh I had a limit that said I only need ten of them so the way to get around this is to just embed logic in these lower operators to say do a scan but only give me ten things sort of inlining one operator side of another alright so again same query we had before and now what's different is that inside of our inside of each operator implementation now we have these little output buffers and what will happen is we say we start at the very bottom here on the scan on A we're just going to scan through it look at every single tuple and just put it in our output buffer and then when we're done we shove it up to the to our parent then we come back down here and do the same thing on B, scan through all that materialize the entire output into our output buffer, shove that into three which then does the predicate the predicate evaluation, shove that up into four where we then finally do the join alright so then it goes all the way top so I sort of already said this in the beginning but just make sure you understand what's going on what kind of workload would this be bad for which OTP or OLAP, transactional or analytical what do the queries look like in transactional workloads so transactional workloads would be like reading a single tuple you log into Amazon, it's Andy's account Amazon, you just look at that, it's like one tuple within all my orders or if I put something in my cart I'm adding one record into my cart so you're not touching a lot of data per query in analytical workloads I want to do like compute the average stock price of Microsoft over a 10 week window or something like that so I'm scanning a lot of data and I have to move a lot of data from one operator to the next so the materialization approach is better for transactional workloads because they're not touching a lot of data so the size of your output buffer going from one operator to the next is going to be small in the case of OLAP queries it's unless you try to push as much logic as you can down into the lower parts of the plan you're going to end up moving more data than maybe you need to so VoltDB does this which is again VoltDB was the academic predecessor, sorry, I wrote the system called H-Door, H-Door uses the materialization model, VoltDB is the commercial version of H-Door and as far as I can know from at least a year or two ago last time I looked they still do it the same way that we did it originally when we wrote the system in the university of doing the materialization model because this is the right way to do this in an OLTP environment because when everything's in memory calling next actually becomes expensive and so you want to avoid that as much as possible MoneADB is an academic OLAP system but they still use the materialization model I haven't looked at the documentation to see why and HiRISE was another academic OLAP system that I think they were doing materialization model but then they threw away all the code and started rewriting it but I haven't looked to see what they're actually doing so you sort of see this trade-off between like how much data I want to move around and how much overhead I have of calling next and going from top down versus the bottom up the last model we're talking about is vectorization model is a sort of hybrid of the two so like the iterator model we're still going to have a next function so we're going to call next, next, next going down the tree but instead of passing a single tuple in our emit function we're actually going to pass a vector tuples sort of thinking it's like a mini batch as we go along and then what's going to happen is the internal loop is going to know how to process this mini batch in an efficient manner using vectorization or SIMD or things like that so let's see what it looks like so now what I've done is it looks sort of like the the combination of the materialization model and the iterator model I now have my output buffers but inside my output buffers I check to see whether I've reached a certain when it's reached a certain size that usually depends on what the hardware looks like I then emit it up to whoever needs it and implicitly when I call emit here on my output buffers I'm clearing it out so the next time I come back on the loop I start with a fresh set alright and it goes down like this so again this approach is ideal for OLAP queries it sort of seems obvious but the first systems that really didn't start end up doing it until much later in like late 2000s vector-wise is sort of the first one this is the approach that we were originally doing in Peloton we're still working on the new one so I'm not sure what we're doing Presto does this from Facebook and then SQL Server Oracle and IBM all have these sort of column based or vectorized based accelerators for in-memory data sets and they all use the vectorized approach okay yes this one here for disk based systems this is probably the best for in-memory system we'll cover that next semester because there's another approach but this is probably the this is the most common one okay again this is just a summary of the different models we talked about iterated volcano is basically the sort of the general approach people use for OLAP and OLAP together vectorized is a top-down approach and this is good for OLAP queries and then virtualization is a bottom-up approach okay so as I said before in our query plan the nodes at the bottom of the query plan the operatives at the bottom are access methods these are the ways that the system is going to actually access data that's stored at their table so there's not really an existing there's no equivalent in the relational algebra for an access method it's sort of something you have to do in order to get data out of a tuple or data out of an index so at a high level there's basically two types of access methods you can have you can have a sequential scan and you can have an index scan and then there are ways to combine multiple indexes we'll talk about in a second and these are sort of called multi-index access methods so OLAP has a bunch of different names for the access methods there's like heap scan or row ID scan but that's my knowledge these are all just going to be the same thing it's either one or the other so again we're talking about you guys here down here so sequential scan is pretty straightforward to understand you're basically going to iterate over every single page in a table you're going to retrieve it from the buffer pool and then you're going to iterate over that table to see whether you should include it in your output this doesn't matter whether you're using a materialization processing model or volcano it's going to be the same approach and so internally Dave's system is going to maintain a cursor that tracks the last page slot that examines so that if you're doing this in an iterative manner you know where you pick up where you left off alright so yeah I'm not feeling so good here alright let's keep going I may have to cut it off but let's get through this real quick alright so sequential scans in general this is always the worst thing you want to do to way to execute queries right because it's it's the slowest thing you have to look at every single table on every single page but there are some optimizations we can do to speed this up we've already talked about prefetching before we'll talk about parallelization later we'll talk about buffer pool bypass but I want to cover these three things here zone maps, late materialization and heat clustering so zone maps zone maps are basically a pre-aggregated pre-computed aggregations that we're going to maintain for the attributes in a single page and the idea is that when a zone map versus a site when a query when a data system says I want to execute a query on a page it has to check the zone map to see whether it should bother accessing that page so the idea here is that we have original data we have this column here, a bunch of values so in our zone map we can process we can pre-computed some aggregations like the min, max, average, standard things like that so when our query comes along we, in this case here for a value greater than 600 we could look in the zone map and say well we know for this page there's no value greater than 400 so we don't need to look at it yeah guys, I am not feeling good, I think I'm going to pass out so we're going to stop, okay sorry let's just say that I haven't been that sick since I ate at Taco Bell when I was in high school so I've been sleeping the last 48 hours but now I'm fine, that's what the stomach clue is it comes and goes, right I like how I profess how much I love the flu shot and that doesn't protect you from terrible diarrhea we're going to pick up where we left off last time we'll go a little quickly so that we can get into what we should have talked about today so again, reminder project 2 is due on Monday, the midterm exam will be on the 17th, I'll send an update about that and there will be a review session for that class and then tomorrow the ScreamDB guys are coming to give a talk in the CIC so where we left off before I almost passed out was that we were talking about processing models and we said these are the different ways we can organize or architect the system to read tuples and process the query that the volcano model or the iterator model was the most common approach most data systems use this where you have this next function and you grab single tuples at a time from your access methods at the bottom then we said that the alternatives would be the vectorized approach where it looks like the volcano model but instead of getting a single tuple you get a batch or a vector tuples and this is better for OLAP queries you can do vectorized operations on batches of queries and then the other approach is the materialization model where instead of going from the top going down you start the bottom go up and each operator shoves up the entire output or all the tuples that it processed until it's next operator and you don't go into the next one until the one below it finishes right so this is all a bit of a blur so we'll just go over this from the very beginning with the access methods so the access methods are how we're actually going to access the data in our database it's the bottom part of the query plan there's no equivalent operator for an access method in the relational algebra because the relational algebra again was just an abstraction over how we would execute queries so now we need to talk about how do we actually implement these access methods how do we actually get the data we need from our database in order to pass them up into our query plan and do additional operations on them so at a high level there's only really two approaches you either do an index scan or do a sequential scan really the only two ways you can actually get data there's things like materialized views but those are usually stored as just tables if you use multiple indexes it's still an index scan so at a high level there's essentially two classes of approaches but we'll look at also what happens when you have multiple indexes or what Postgres calls a bitmap scan because this is kind of confusing when you look at explain you look at the query plan and Postgres says oh I'm doing a bitmap scan you may not know exactly what that is it's essentially just doing an index scan which is one of the optimizations we can apply for each of them so as we said many times throughout the semester the sequential scan is the fallback option for our database system if we don't have an index we don't have any fancy thing we can do to actually derive the answer we always can just scan the underlying tables and it's essentially just nested for loops for every single page in our troupel we're going to then iterate every single tuple in each page and just do whatever it is that we need to do we can compute an aggregation apply a predicate, do filtering we're just going over it one by one so how this is typically implemented is that the especially in a volcano model or iterator model is that the database system is going to maintain an internal cursor for your query that just keeps track of where did I leave off in the scan so if you're familiar with writing iterators in Python you have that yield function that's just syntactic sugar underneath the covers Python of maintaining its own iterator, sorry, its own cursor to know that when you go and ask for the next thing it knows how to jump where it left off in its sequential scan so we're going to do this on a per query basis in a materialization model we don't like to do this because we're just going to read everything and shove it up but if you're doing the vectorize or the volcano model when you call next and emit the tuple back up to you when you call next again you need to know where you left off and in case the sequential scan is just page ID and offset so there's not really any magical way we can make sequential scans run faster because again it's going to be bound by how fast we can read pages from disk if we need to read a thousand pages we have to read a thousand pages but there are some optimizations we can do to try to eliminate or reduce the number of pages we have to read to minimize the amount of IO we have to do so we've already talked about pre-fetching pre-fetching was a way to know I'm going to scan a bunch of pages so let me go ahead and fetch a bunch of them in order bring them to my buffer pool so that when I actually need them they're there for me we'll talk about how to do parallel sequential scans in a few more lectures and we'll talk about parallel query execution and then we also talked about the buffer pool bypass in Formix and Postgres where instead of polluting your buffer pool with the pages you're reading during the sequential scan you just maintain a little side buffer for your query that's just, you know, that only your query uses and that way you don't have sequential flooding you're not, you know, you're not blowing away all the locality information you have in your cache so I say I'm going to now talk about these three other optimizations we can do again these are things that we can apply to potentially make the amount of work we have to do during sequential scans be less so zone maps are a way to pre-compute aggregations on individual pages and then use those pre-compute aggregations to know what pages you can skip during a sequential scan so the way to think about this is I have a bunch of pages and I have a bunch of comms in those pages and I'm going to go ahead and pre-compute the standard aggregation functions we have in the SQL standard you know count min max sum average things like that and we're going to store this zone map in a separate page different then from the original data page and the zone map itself is going to be pretty small so we can store multiple pages the zone maps for multiple pages in a single page and so what will happen is now if we come along with a sequential scan operator like this select start from table where value is greater than 600 if we had to do if we didn't have a zone map we had to go fetch the page and just scan in linearly inside that page to see whether we have any values that match this with a zone map instead we can examine the predicate and say well it's looking for values that are greater than 600 and my zone map I know the max value for this column here is 400 so there's never going to be a tuple that will match my predicate so therefore I don't need to bother looking at this page I just go ahead and just skip it and again the idea is that the data pages will be much larger than the zone map pages so with a single page fetch to go get a zone map that may encompass or cover maybe dozens of pages and I can use that to figure out which ones actually need to go look at so this idea is not new it's been around for a while Oracle calls it zone maps and for better or worse in the database community when you say zone map we just mean the Oracle brand name of it but it's not specific to Oracle it's used in a bunch of different systems so IBM has it in DB2 for blue, cloud era uses this in NIMPALA, ATESA has this, Vertica it's in the parquet file format for cloud systems we'll talk about later in the semester so this approach is used all over the place MemSQL has this, sometimes they call preq to aggregates but if you say zone maps they essentially mean the same thing the next thing we can do is call late materialization for column store systems so remember for column store systems I said that in general you can think of a column or attribute for a table will be stored in pages by themselves and so as I'm processing the query I only fetch the pages that have the columns that actually need in order to process the query so in late materialization the idea is sort of a logical extension to this is that as I'm going up the tree of my query plan I only go fetch the pages that I need for that particular operator I don't try to stitch the entire tuple back together so say that we have a query like this select average on C from foo joining on bar and then we have our join predicate and then a where clause where A is greater than 100 so say our query plan looks like this and we're going to focus on the right side of the tree the scan on foo so as we're doing the sequential scan on foo and we want to apply our predicate on A we know that the only thing at this point to evaluate that predicate is just A the only page we need to go read is for the attribute A and then when we're done scanning we actually can throw away any values of A that we've brought into memory and only pass up offsets because we know at this point in the query plan because we know what the query plan is because the SQL is declarative we know that there's no other point up in this tree where we're ever going to have to look at A again so we don't even need to pass it around we can just throw it away and just pass up our offsets same thing when we go up here and do our join we only need to look at B or just the subset of B that match our offsets pass then more offsets up to the aggregation and now at this point we have to materialize we can't pass back the user offsets because that's an internal representation that means nothing to the outside world so at this point we materialize the tuple meaning we go grab for the offsets we grab the data that it actually needs so this is useful in some cases where the offsets will be much smaller than the actual tuples themselves the values of the tuples themselves let's say I had C was a varchar 100 and so that would be 100 characters, I would have to pass around from one operator to the next but with late materialization I just pass around a 64-bit integer per tuple then only when I had to go give the result back to the user then I go fetch the pages I need for C so again this is you can do this in a decomposition storage model or the column store system because the columns are broken up you couldn't do this in a easily in a row store because when you go fetch the tuple you're going and fetching all the rows for it so the last optimization is called heap clustering so we've sort of already talked about this before and we talked about index organized tables but the idea is that we're going to have an index that's going to tell us how we should store tuples in our pages so say we have our index like this we have a sort direction going along the leaf nodes we're going from left to right it's going in ascending order and then our pointers now from the index to our tuples inside our pages will be sorted in that order so now that means that for some cases along here and fetching the pages one by one I can maybe just jump to one location and just fetch the table pages because I know they're going to be in the order that I want so as I said so we talked about this before we said that some database systems like MySQL for example don't actually have separate table pages they store the tuples themselves inside the leaf pages of the index so that essentially you get a clustering index for free and other systems like in MySQL sorry in SQL server in Oracle you can tell it that you want this index to be a clustering index and it'll maintain the tuples sorted in the pages like this in systems like Postgres and you can declare that you want a clustering index so they have a cluster command and it'll do this sorting for you but it doesn't actually maintain it so every single time you update the table after you run cluster it's not going to guarantee that they're going to be in sorted order this will make more sense when we talk about how they do clustering control or multi-virgining but in general they're not doing it because that'd be extra work for something like MySQL because the tuples are stored in the pages themselves or the leaf pages then they get it for free so again this allows you to do some things like jump ahead to where you think the tuple you actually need will be instead of having to look at more pages so the other access method is to do index scan and we talked about this before we talked about how to in the case of a covering index we can process the entire query just on what actually being stored in the index the index scan is basically it's going to allow us to figure out what pages we have to go read or the process of the query and that's avoiding again having to do a sequential scan so the question of what index to pick depends on a lot of different things depends on what attributes are in the index and the different index that you have available to you depends on what attributes you're actually referencing in your query in the where clause or other parts of the query depends on the value domain of the index like if I have an index on whether a student is their gender is male-female that's not going to be that useful because it's going to be 50-50 mostly depends on how what the where clause looks like you have conjunctions or disjunctions and there are or's and then it depends on whether they're unique keys or non-unique keys so basically what happens is you write your select statement in SQL and it's up to the data systems job to figure out which index you want to use potentially for an index scan so that particular question we're going to focus on later in lecture 17 after the midterm when we talk about query optimization this is a pretty hard problem and there's a lot of literature on how to do these kind of things but for our purposes we'll look at it at a high level and then we'll see what are the different design decisions or implementation decisions we can have how to execute a query for an index scan these different questions so let's look at a simple example so say we have a single table of students and we have a hundred tuples and we have two indexes we have one index on the student's age and one index on the student's department and we're going to run this query here select star from students where age is less than 30 department equals CS and the country equals US so we don't have an index on the country just department and age so the first scenario would be there are 99 people in the table that under the age of 30 which is usually accurate for universities but then there's only only two people in the CS department so which of these two indexes would we actually want to pick we have to pick one so we have an index on age but then there's 99 people out of 100 under age of 30 and we're looking for students under age of 30 so is the index going to be really helpful there? no because it's going to fat 99% of them anyway right so then we have only two people in the CS department out of 100 so in that case here if we had an index on department we can jump exactly to the two tuples that we want so the idea here is that we want to pick indexes that are going to be the most selective for us that's going to filter out the most things as soon as possible it's the lowest portion in the query plan because the idea is we want to minimize the amount of data we have to transfer from one operator to the next that's sort of what late materialization was trying to do minimize how much data you move around an idea here is again to filter out things as soon as possible picking the index that's the most selective if you can reverse this say there's 99 people in the CS department but only two people under age of 30 in this case here we'd want to pick the index on age because again it'll find the two people that we want right away so there's nothing really special in the index scan you know what the predicate is in your where clause you know what attributes are in your index and you follow the tree or do a hash lookup and you find the things you're looking for and you just pass them up to the next operator there's really enough interesting there it's pretty straightforward where things get interesting is when you actually use multiple indexes together so these are called the multiple index scans Postgres again calls them the bitmap scans and the idea here is that say the data citizen recognizes oh for my particular query on this table I have two indexes that both look pretty good and instead of picking one versus another I actually want to use both of them and then combine their result set together and use that as the result set for my operator so the idea here is that we'll take all the indexes we want to use and we're going to compute sets of record IDs that match the particular predicate we're evaluating on our index and then we're going to combine together the record ID sets from the different indexes either using a a union or an intersection intersection if it's an and union if it's an or and then for all those record IDs that come out of the matching set we then go do our look up to find the original the tuples and the pages and we pass them up to the next operator in the tree so we go back to our example here again we have the two indexes on age and department so what we'll do is first retrieve all the record IDs in the age index that match age less than 30 and then we'll retrieve all the record IDs that match department equals CS in the second index take the intersection of the two of them because it's a it's an and clause in our predicate and then fetch those tuples then we go check now to see where the country is US visually it looks like this so again we do a look up on age less than 30 in that first index we get the record IDs that match that do a look up on the department index get a set of all those record IDs and now we just take the intersection of the two of them and that's the ones we know match age less than 30 and department equals CS so you can implement this with a bitmap that's a Postgres does, you can use a bloom filter you can use hash sets, hash tables it doesn't actually matter long as you have set membership and a way to combine them together so the one potential problem we're going to have and we'll see this come up again when we talk about sorting is that if an index is not ordered on the key that we want to if the index is not ordered on the key that we want to do a look up on then we may end up with a really inefficient index scan and it may actually be better if we done just a straight sequential scan so again here's our index the scan direction is in this way based on some key and now if we follow along the leaf nodes and start reading the tuples following the pointers down to our pages we'll see that they're going all over the place because it's an unclustered index the index is just pointing to where those tuples are it's not saying anything how it's physically sorted on the actual table pages so now if we do a if we follow along the leaf nodes or index scan that way what'll end up happening is that we're going to basically thrashing our buffer pool because we're fetching pages in and then throwing them away immediately to go fetch another page so say a really simple example is that I only have space for one page in my buffer pool so if I do my scan for all these pages following along the leaf node orders then what'll happen is that all these different boxes represent every single time I had to go fetch a page because I'm not being mindful of the locality of where the data is actually being stored I'm just blindly saying oh I want page 102, let me go get it now I need page 103, well I have to take 102 out and put 103 in so again I'm just ping ponging pages in and out and that's going to make things go really slow so a simple optimization is that just do the index scan first don't actually fetch any tuples just figure out what are all the pages I'm actually going to need to access the records I need to access and then you sort those tuples the record IDs based on the page ID so now you go then and say alright well I'll fetch page 101 first do all the reads I need for that then fetch page 102 and do all the reads for that and I never go back to page 101 so what went from maybe over 10 page reads 10 page IOs before we did the sorting now ends up going to be to 4 which is the bare minimum we actually need to execute this scan so even though you have an index it doesn't mean everything is going to go super fast for you you want to be a little bit smarter about how you schedule the disk IO operations you want to do and again this is another example of the beauty of the relational model because we don't care that the output is not in the order that you expect it right unless you tell us what you want to order by in the order that it wants so if we just took your query and executed almost exactly how you expressed it we would end up maybe in this situation here and that would be slow we would be a little bit smarter and say alright well if I sort things ahead of time then I'll basically have sequential access and I'll have good locality because I'm using every page every page I read in I'm reading as much data as I can out of it before I move on to the next page so this is a standard technique that a lot of systems use alright so the last thing I want to talk about is how do we actually evaluate predicates so we have these where clauses we have these expressions how these actually represent and how we actually implement them so the way to think about it as we talked about before the query plan itself is a tree and inside each operator in our tree we can have another tree called the expression tree and this represents the where clause whatever expression we have for that particular operator so the nodes in the tree are going to represent all the different types of expressions you can have comparisons, arithmetic expressions conjunctions, disjunctions, things like that and essentially you just evaluate them and they return true or whatever the type of value they want to return and then you use that to figure out whether your tuple matched a predicate or what's the output you want for that tuple in the query so let's take our simple query like this in our where clause we have aid equals bid and b.val is greater than 100 so we would represent that in a tree like this so at the root we have the conjunction clause and that returns true or false and then on one side we have the equal sign and on the other side we have the greater than sign at a high level this is how every database system is going to represent these expressions let's see how you actually evaluate them so say we have a query like this and it's a prepared statement so we have a we have a clause with a where clause with a placeholder for a value at runtime so as we're doing our central scan or index scan for every single tuple that we're going to evaluate this expression on we would maintain some execution context about what we're examining at that point in time so say I have a thousand tuples for every single tuple I'm going to build this execution context and then traverse the tree to see whether it's value is true or not so at the top here you see that we have the current tuple that we're looking at in our scan the query parameters that will get passed in at runtime to fill that value in and then the schema for the tuple that we're evaluating here because we need to know in the expression tree it will say I want to look at attribute, you know, vowel you need to know where that is in the actual tuple so we did basically just walk through this tree in a breath first manner alright so we start to talk about at the root is equal sign we go down to the left it says attribute value so we know that this is a reference to for the current tuple what attribute we want, we want value so we do look up in the table schema and say well what offset in the tuple corresponds to the name value then we look at the current tuple that we're looking at and we get the value that we want, a thousand traverse down the other side we have the parameter expression that tells us we want 999 because we want the first one in our query parameters we go to constant it's just one so this outputs one and then everything gets fed up now into the tree until this thing produces true or false and if it's true if it's true then we know whatever tuple we're examining satisfied our predicate and we can produce it in our intermediate output alright pretty straightforward so what I'll say is that this is, as I said this is pretty much how everyone does it this is actually a slow way to do it and the current trend which we'll cover in the advanced class is actually using cogeneration or just-in-time compilation is instead of traversing this tree we'll actually generate a program to compile this and run this exactly in machine code so instead of saying does something equal 99 plus 1 I'll just traverse that tree to generate that value I'll just generate machine code that does the exact same thing and it's way faster there's an announcement I think Postgres just added this and Postgres 11 which comes out later this year but a lot of the major commercial systems do this alright so the main takeaway from all this is that the same query plan can be executed in much different ways and one approach versus another will have different advantages or disadvantages based on what kind of workload you're trying to deal with and for at least most systems they're going to try to use index scan as much as possible because it's always going to be faster now there's some systems like Vertica for example that have no indexes at all, won't even let you declare them but they pre-sort everything which is essentially, it's the same thing as an index and then the expression tree just again to show you what they look like to understand how these workloads are going to get evaluated the way I'm showing you here is super flexible you can have any kind of expression you would want can be represented in these trees but it's going to be slow the way if you traverse it manually so any questions about this that's my favorite all about what is it is that s-t crooked i-d-e-s i make a mess unless i can do it like a Gio I use cube with a G to the E to the T now here comes Duke I play the game where there's no rules homies on the cc I'm a fookas I drink brook, put the bus a cap on the ice bro bushwick on the go with a blow to the ice can I come will it deep that's me rolling with 501 South Park in South central G And I party by the 12-pack case of a fart 6-pack, 40-act, gets the real price I drink proof, but yo, I drink it by the 12-ounce They say bill makes you fat But saying eyes is straight, so it really don't matter