 The class about hash tables, remember I showed hash functions and we talked about, there was that saw tooth pattern that you saw in city hash and farm hash from Google and we said that the reason why it was seeing that behavior is because of SIMD instructions. Well it turns out I was incorrect and the way I found out I was incorrect was the most unlikely possible place to get it to be corrected and have actually someone say something insightful and that's on YouTube. If you guys have ever seen the YouTube comments, they're notoriously trashy or just very purile. So we know I'm going to get comments like these. So these are people commenting on our lecture videos that we posted on YouTube, and he's awful, I get headaches, just non-stop diatribe against me in my lectures, but somebody actually posted a good one. So this is somebody here and he said, he points out the reason why we're seeing that saw tooth pattern in the city hash function is because they're maintaining this buffer that's normally 64 bits and so when you try to use as much of that buffer you get the best performance. So they're not using SIMD at all because they want to use these hash functions in a variety of devices that may not be running with the full-fledged Intel Xeon heavyweight processor. If you're running on something like on a cell phone, like an ARM processor, the newer ones have SIMDs, the older ones do not. So this again, they're pointing out that again, for my purposes we don't care so much in this course how the hash functions are actually implemented, we make sure we're correct to say that it's not actually using SIMD, and we're not really going to discuss SIMD in this class, this will be covered in the advanced class in the spring, and then also if you're taking 418 or 618, I think they cover that pretty heavily as well. So the overview of where we're at in the course now is again we're going from the bottom to the top, and now we're in this middle part here where we're going to talk about how the database system is going to execute queries, and those queries are going to access our data, our tables and our indexes. So we're sort of in between operating execution and access methods, so last week we talked about hash tables and we talked about indexes, we said this is a way and database storage and heaps and things like that, we talked about how the database system could organize tuples and organize indexes and allow something in the system to be able to access them efficiently. So now here in this class, now we're going to talk about that something. So there's a whole bunch of other operators we'll talk about in the future lectures about how to perform joins, how to perform aggregations and things like that, but here in today's class we're really going to talk about how the operators in our query plan will access our underlying data structures or access methods to retrieve data and then prepare them to be passed along to other operators in our query plan. So today's class is sort of the both of these and then starting on Wednesday we'll start covering the operating execution after you've accessed it from the table. So the first thing we now have to understand though is what a query plan looks like in a relational database management system. So when I talked about relational algebra before, I talked about how it takes a SQL query, runs it through a parser, generates an abstraction syntax tree and then converts that syntax tree into relational operators. And it's sort of you end up with a tree data structure like this. We take our select query and then we have a bunch of access methods that do filters or do a join, then do projections. So the way it's hard to think about this is that logically the data is flowing from the leaf nodes, the bottom nodes in our query plan up into the operators and then the final result of the query is whatever the root of the tree spits out. And typically these operators are binary, meaning you can have left and right children or you can only have one children. Some systems can handle having more than two children but for our purposes we'll just stick with two. So our discussion is really now talking about in this class is sort of how we're actually going to take this query plan and execute it. And what we'll see when I start talking about the processing models is it seems like the way you want to implement this is start at the bottom and then push things up but we'll see that there's different tradeoffs and there's different implications of whether you go from the top down or the bottom up. At this point here it's just sort of a logical query plan. We didn't specify anything about how we're actually going to perform these operators to execute them. I'm not saying here what hash or what join algorithm I'm going to use, we're dealing with sort of at that level. So today's class we need to discuss three things. So the first thing we're going to discuss is how the database system is going to process that query plan. Then we'll talk about how it actually then accesses the data that it's going to need in the query plan using its access methods. Getting it from the table, getting it from the index. And then we'll finish up talking about how the database system will take predicates that are in our SQL query and then evaluate them on tuples. So how do you take the where clause and look at the tuple you're looking at as you're reading the data and apply the predicate to see whether it's true or false and you should include it in the output. And then in the remaining time I'll quickly go over what's expected for you in project two, okay? All right, so the first thing to talk about is the database system's processing model. So the processing model essentially specifies how the database takes that relational query plan and will execute it. And as we see as we go along, there'll be different tradeoffs for different classes of applications or workloads where one processing model may end up being better than another. And there's this whole other discussion that we're not going to have in this class that we will have later about parallelization or distributed queries or distributed databases. For our purposes, we're going to assume that the query plan is going to run in a single thread and we're just going to just focus on that. But these different processing models will have different ways you actually wouldn't implement them if you're doing multiple threads running the same query and some are actually better than others, right? So as we go along, we'll discuss the advantages and disadvantages of each of these. So at a high level there's essentially three approaches, the iterator model, the materialization model, and the vectorized or batch model. There is sort of a fourth model that's a hybrid of all these that's called the push-based model, but that mostly appears in memory databases where you're doing query compilation, which is something, again, we'll push off until the advanced class. So let's go through each of these one by one. So the most common processing model that's used in pretty much almost every single database system is called the iterator model. So the confusing thing about databases is that there's different names can be given to the same concept. So sometimes you'll see this called the volcano model, because there was a famous paper in the early 1990s for a system called volcano and they sort of laid out what we're describing here, although it existed before then. There's also sometimes peers in other textbooks or in the literature as the pipeline model. They're all essentially the same thing. So the way to think about the iterator model is that every query plan, or every operator that's in the system that your query plan can invoke, has to implement this standard API that has this next function or this next method. And what happens is every single time you call next on that operator, the operator either needs to return the next tuple that is, that it needs to produce or emit, or returns a null marker that says, I don't have any more tuples for you. And then therefore you should not invoke that operator anymore. And so internally what's gonna happen is that when you call next, there'll just be sort of this for loop on the inside of the operator implementation that scans over whatever data it's trying to access. And for every tuple that it finds, it returns that for the invocation of next. So that means you have to maintain some metadata, like a cursor on the inside of the operator so that when you call next again, you pick up where you left off. And that way you don't return back duplicate tuples. So what'll happen is for some operators that have children, when you call next on the parent operator, they then call next on the children operators. And then those children then have to do whatever they need to do to produce a tuple. So this is really easy for humans to understand. And this actually makes it also very easy to parallelize, which we'll talk about later. But let's go through an example so you can understand this better. So let's say I have a simple join query here, and then this is that same relational query plan that I saw before. So normally I don't like to show code in my lectures or any presentation that I give, but I think it's kind of unavoidable here. And hopefully this will make it more obvious to understand. So for each of these relational operators, there'll be an implementation of their next method, right? And what'll happen is when you see it for these, they have this emit function that basically is like a yield in Python. We're saying here's the thing, here's for this current iteration, here's the element that you want, here's the next tuple. And so the way this will get executed is that the database system will start at the root and invoke this next method. And internally for this projection, it has a for loop that's gonna call next on its child. It only has one child. And for every tuple that the child produces, it can then spit that out to be the output of the query. So this thing iterates through and calls next, which then calls this function inside of our join. Now this one's a bit more different because in order to do the join, you actually have to get all the data from your children, right? So it first has the for loop where it calls next on all the left node children. So this case here, this is the access method of the scan operator. It's just gonna keep looping over this and keep embedding every tuple that it has in the table A. And then the join operator then takes that and builds the hash table, which is used for hash joins, we'll talk about next week. And then when it has everything on the left side, then it goes to the right side and does the same thing. It calls next on the filter operator, which then calls next on the scan operator, right? And then you're just sort of taking a tuple one at a time and pushing them up or moving them up to the query plan, right? So this is pretty straightforward, you don't understand, right? This is, it's sort of a nice programming paradigm, API, right? It makes it really easy to then plop in different operators. As long as they implement the next API, you don't care about how they're actually generating the answer. You sort of chain together these different operators and different ways to compute any possible SQL query that you could want, right? So as I said, this is the most widely used processing model in every database. Pretty much every database system you can think of is likely using it. These are the ones I can actually confirm from their documentation. Another aspect of this that can make this actually really efficient, is that it supports what's called pipelining. Meaning, I can take the output of a child operator that calls next, and then I can do some processing on it and then pass it up to the next guy, and then pass it on to the next guy. So you could have this sort of pipeline of operators that are taking the same tuple and doing as much work as you can possibly do with it. Before you maybe go back and call and get the next tuple, right? And this is efficient if you think about it from a disk-based database system, right? I can take one tuple, fetch the page that it's in in memory, and then do as much with it as I can before I go back and fetch the next page for the next tuple, right? Because otherwise, you could have this sort of ping pong effect where you fetch a tuple, get the page, fetch a page, get the next tuple, then immediately go back and get another page to get the next tuple. And then later on, now your buffer gets full, and the tuples that you brought in earlier are now get swapped out the disk. So when you go to the next operator, you got to fetch those pages again, right? This is also why this is sometimes called the pipeline model. So that pipelining thing that I just talked about though, the pipelining optimization, you can't actually do it for all the operators in your query plan. So I showed in the previous example for the join, I got to get all the tuples I need from the left and all the tuples from the right, and then and only then can I can I compute the join, or at least you have to do it for the left side, you don't have to do it for the right side. You have to do this sort of blocking for the same way when you do subqueries, or if you have an order by, you don't know how to sort things unless you have all the elements you want to sort. So there are some operators that will, that are not going to be able to do that pipelining and they're sort of pipeline breakers. But in one nice aspect of the iterator model is that it's really easy to then again sort of compose complex queries by just chaining together these different operators. But then there's also sort of this natural flow control for certain operators that you want to limit the number of output, right? In the case you have the limit clause, if you attach that to a scan operator, the limit will keep track of how many tuples do I have so far. If I only need a hundred and I have a hundred tuples now, I'll stop calling next cuz I have everything that I'll need. So you have this sort of built-in mechanism to do flow control or output control just based on how many times you call it next, okay? So I would describe the iterator model as sort of the jack of all trades, this general purpose scheme that is good enough for OLTP queries and good enough for OLAP queries and good enough for parallel queries and sort of embedded database systems, right? It's just, this is what pretty much everyone does and it's good enough for most things. So now I want to talk about the materialization model and the vectorized model. These are sort of optimizations or these are schemes that are actually better for one particular class of workload. Whereas this one's more general purpose. So in the materialization model, the instead of going from the top down as you did in the iterator model, you're gonna go from the bottom up. And what'll happen is every time you invoke an operator, it does whatever computation that it needs to do on all of its input. It assumes that it has all the input it's gonna need all at once. And then it does that processing and then it ships it up, the complete output that it's gonna generate, it ships it up to its parent. And then you never invoke that operator ever again, right? So again, it's called materialized because you're materializing all the output for that particular operator. So this is tricky for some things because now we don't have that natural control where there's a limit clause. It can stop calling next to make sure that it doesn't produce anything it needs for this because you're gonna process all your data all at once inside the operator. The operator may not know anything about limits because the limit clause would be further up in the query plan. So to make this work for these kind of scenarios so that you're not scanning more data than you actually need, you actually do a push down on providing hints inside of the scan operators and the lower level things at the bottom of the query plan. So that they know when to stop or they know how to filter things out early on. Let's go back to that example we have now. And so now what you see is that in the every single operator, I now have this output buffer that I'm gonna generate for my operator. And then when the operator finishes, just assume that it's gonna get shipped up. So for here at the bottom of my query plan, I have the scan on A and B. So I'll invoke this guy first and then it's gonna ship up its output buffer to that parent. And then I'm gonna invoke this operator and does the same thing, ships it up to the next guy who then does the filtering and then ships it up there. So now at this point for the join, I have everything that I need. I have the right side. I have the left side. Does the join feels out as output buffer and then shoves it up to the parent. Right, again, there's implicit in this sort of code here is like a return call to send the output up the tree. So let me take a guess why this is, what particular workload would this be good at? What particular workload would this be bad at? Take a guess, yes. Right, so he says, you wanna be careful that your output can fit in memory because if say this is scanning the entire table, you're basically making another copy of that and then shoving it to the next operator. So where would this be a bad, how could this be bad? So let's say that in this case here, say that you can recognize that I don't have a filter, I'm just scanning this table. So maybe I'll just pass the pointer to the table to the next guy, right? But in this case here, you have to do some filtering. So you have to materialize the tuples because otherwise the top thing might be joining things that shouldn't be there, right? So what kind of query would this be bad at? What kind of database would this be bad for? O-Lap, right, so if you have a really large database, I have a billion tuples. And my filter maybe filters out only half of them. So I have 500 million tuples that I need to materialize in an output buffer, then then shove it out to the next operator. So this is typically bad for O-Lap queries. Because again, the intermediate results could be really large. And for O-LTP, this is fine because you're only grabbing usually a single tuple, right? For this particular query, you actually maybe want to flip the join order, but we'll talk about that later. But like, so for this, you're typically maybe going to only access like maybe a couple dozen tuples in an O-LTP query. So that output buffer is small. And so this is also a better for then the iterator model because I'm not calling next, next, next over and over again, right? I'm just running my operator. It generates my answer and I shove it to the next guy. So this is actually how we implemented a query processing model in an h-store, which was the academic predecessor of Volti B. And the last time I checked, they still do it the way we did it back in the early days. The advantage you're going to get from this is that it provides for lower execution and coordination overhead because you don't need to maintain an internal cursor for every single operator to figure out what you left off the last time you called next, right? You just take the data, crunch on it, and then spit it up to the next guy. So I say it's more difficult to paralyze. And what I mean by that is you have to sort of maybe it's the same as iterator model. But you basically have to figure out ahead of time how to partition your table in such a way that you can have one thread read one piece of output, one thread read another piece of output. And then you have to combine them together at some point in your query plan. So when I say more difficult to paralyze, it doesn't mean it's impossible. It's just more, some more things we have to deal with. So what's interesting is that Monadb is probably the other most famous system that's using the materialization model. Monadb is actually an OLAP system. I need to look at their literature again, but they use materialization model. I forget why they explain why. And then they had a bunch of papers that came out later that basically did parallelization to sort of break it up so that although each operator is materializing its entire output, each operator is only maybe processing a portion of the data. And you have to be very aggressive in pushing down predicates and other things to limit the size of the output buffer. So the last model we'll talk about is a vectorization model, which sometimes calls the batch model. And it's essentially like the iterator model where you call, have this next function that you keep calling as you go down the tree. But the difference here is that instead of calling for, when you call next, instead of getting back a single tuple, you get back a batch of tuples. This sort of seems obvious, but the no system as far as I know actually implemented this until the mid 2000s. So again, when you have that internal loop inside of your operator that's gonna look through over all the tuples, instead of after processing one tuple, and then immediately returning the result to the next operator, it sort of maintains a buffer, fills that buffer up. And then when that buffer is large enough, then it finally then emits that to the parent, right? So the tricky thing about this is that the size of the batch is hard to get right depends on what the hardware looks like, depends on what your data looks like, what the query's actually trying to do. And so there's been some work in this area to figure out how to tune these batches properly. So now we go back to our example. Now it basically looks like the materialization model where I'm filling up some buffer, and but now I'm checking to see whether the buffer is larger than my batch size, and if it is, then I shove it up. And again implicit here is that when you shove it up, then you go ahead and clear it out, right? So again, now we start from the top and go down. It calls next, or gets the output buffer from its children, then gets the output buffer from here, and so forth on the other side here, right? So can I take a guess what this would be good for? What kind of workload or queries would this be ideal for? So he says there'd be no improvement for operators such as joins, right? His statement is that in the case for the join operator, I have to wait for all the results anyway, so it's not that big of a win. Well, except that you're not gonna call next as many times, right? If I have a million tuples and my batch size can be 100,000, I would call next 10 times. So what kind of queries would this be good for? Correct, analytical queries, right? Because these type of queries are often doing large scans on the entire table. And that means now down in the sort of the low-level access method, the operators that are actually accessing the data, you can actually now use simd and other optimizations to sort of speed those things up, right? So the big thing is that, again, you're gonna reduce the number of next calls. This is not so much a big deal for a, for this database system because going out to disk is always the most expensive thing. So doing a look up in the virtual function table or having to follow a function pointer to get to execute the predicate is not that big of a deal, or actually to invoke next is not that big of a deal. But again, if you have really large databases and a lot of it can fit in memory, then those things start to add up. So again, we won't talk about this in this class. But if you're processing batch of tuples, you can actually do, you can vectorize a bunch of the comparisons and other thing, the filters you wanna do on that data because you can operate on batches. Now, most of the systems that actually do this technique to do simd on the scans, or simd is in your query plans, only really do it on the low-level scan operators. Cuz that's where you actually get most of the benefit. It's hard to do simd in a join algorithm, and it actually doesn't help you that much cuz the cache line misses always kill you. So vector-wise was the sort of first system as far as I know that defined or implemented the vectorized model. We actually use this in Peloton, the system we're building here. But this is also used in Facebook's Presto. And then the SQL server, so Microsoft and actually we should include IBM here. So Microsoft and Oracle and IBM all have the specialized in-memory column store engines that follow the vectorized model. Whereas the regular system, like the original system that they have use the iterator model. And so again, that you get better speed up for these guys because, again, you're processing batches of tuples and you can apply simd predicates on them, okay? So just to summarize everything I said, again, we have three query processing models, iterator, materialization and vectorized. And the volcano and the vectorized are doing top down, materialization is bottom up. And then the volcano or the iterator model sends out a single tuple, whereas the materialization model sends out the entire result set for that operator and the vectorized one produces a tuple batch. And then the iterator model is general purpose, materialization is for OLDP and vectorizes for, yes. Is the use of any of these models mutually exclusive? What do you want to have a database system that can materialization for no LTP and vectorize for no vector? So the question is, are these query processing models mutually exclusive? Or could you have a system that actually could implement all of these? So there are some hybrid systems, I think high rise can do materialization and vectorization. The examples I gave before for, so SQL server has this engine called Apollo, Oracle has this thing called in memory column store and then IBM has DB, it's called DB2 blue. These are sort of these extra engines that sort of you can plug into your database system that use, that are targeting these classes of workloads. The upper level parts of the system are all still the same, but internally these different engines implement these different models. I think it'd actually be difficult to implement a database system that a single database engine that could do all of these, right? Because basically you have to implement all your operators twice, right? So there's no technical reason you couldn't do that. But from a software engineering reason, that'd be quite hard. And so typically what you do again, you have these different engines that you can plug in, that the upper level parts of the system can route queries to and then they can apply, they use the different query processing models based on what they're targeted for. So in the case of Peloton, the vectorization model is essentially the same thing in some ways as the iterator model, right? If you make your batch size one, then it's the same, right? The original version of Peloton does the vectorization model. The newer one does this query compilation stuff, which is you're sort of pushing things up from the bottom, that's advanced, we don't need to discuss that here, okay? That's a good question, anything else? All right, so now we can talk about what these bottom operators are actually gonna do. So in all my examples, I talked about how there's this for loop inside at the bottom of the query plan for these operators that are iterating over something, the table, and they're emitting tuples. And so now we need to talk about how we actually implement these things. So in a database system, these operators are called, usually called access paths or access methods. It's how the database system is gonna access the underlying data of the table. And so there's no actually mapping or corresponding operation in relational algebra to these access methods, right? When you define a relational algebra query, you just said, here's my relation and then you apply to filter or you did whatever else you want on it. So now we're actually talking about how the data system is actually gonna, how it's gonna decide what algorithm is going to use, what physical data structure wants to target to get data out and then can shove it up to the other operators in our query plan. So there are three basic approaches that we're gonna talk about. Essentially, there really is only two, the sequential scan or the index scan. I'm including the multi-index because it's slightly different how you do a regular single index scan, but there's, if you look at maybe like Oracle's documentation, they have all these different access methods or access paths. Heapscan, table of row ID scan, at the end of the day, they're really gonna be either one of these, right? Either scan every single page and look at every single tuple inside of it. Maybe you have a nice way to jump around to it to get only look at the pages you maybe actually need, or you're gonna scan the index, right? So at a high level, there's only really two that we care about, but I'll sort of break it up in three. So sequential scan is, as I said, is basically where you have no index you can use to speed up your query. And so the operator basically jumps to the very first page for that table, fetches it, then scans every single tuple one by one, and then when it's done, then it goes to the next page, right? And then depending on whether you're doing the iterator model or the materialization model, you may read a tuple, send it up to the next operator, or just keep reading all the tuples in that page. So as I said, the database system will maintain this internal cursor inside the operator that will keep track of the last page and slot that it examined as it does this processing. And so you can essentially implement a sequential scan as a nested for loop, right? For a given table, iterate over all its pages and for every single page, iterate over all its tuples. Then you evaluate some predicate and then do something with that tuple satisfies it. So this is like the, how do I say this? The sequential scan is like the fallback method for the database system to execute a query, right? There's no index, it doesn't, can't figure out what to do. This is like the dumbest thing to do. But it's guaranteed to always produce the correct answer. The problem is though, this is really, really slow, right? Because if every single page of my table is out on disk, then I go fetch that page, right? And I'm sort of blocking every single time I need to fetch something. So there are some optimizations we can do to speed this up. But in general, I would say this is always, again, sequential scan is like the worst case, it's just always the worst thing. Or almost always the worst thing. Some joins can be pretty crappy too, but this is usually bad. So a lot of times when you see, if you look on Stack Overflow or see other message boards about people saying my database is slow, how do I speed it up, often what people say is just how many sequential scans are you doing and can you avoid them by adding indexes? Because we talked about, so before when we talked about buffer pull replacement policies, when you're squintually scanning something, you're basically trashing your buffer pull. If the system's not careful, right, scanning stuff in and throwing out pages that may be needed very soon and throwing them out for pages you're going to read once and throw away. So we talked about some optimizations before to speed this up. We talked about prefetching, right, where the data steps can recognize you're doing a sequential scan. So when you read the first page and you've returned that back to your thread to process it, in the background, maybe that fetches the next couple of pages ahead. So that when you come back, while you're doing your processing, that data can be fetched in. So when you come back, it's already there for you. We'll also talk about how to do parallel sequential scans later on in a few weeks. But I want to talk about three additional optimizations you can also apply to sort of speed this up. And again, the end of the day, the sequential scan is always going to be super slow. So these are just, these aren't going to make these things go. These tricks are not going to make your sequential scan be matchly fast like your index scan, but it's, they help. So we'll talk about these last three here. So a zone map is a, the way to think about a zone map is that it's a pre-computer aggregate for the data within a page that'll be stored in a separate page inside the database system. So zone map is the oracle term. I, other systems may call them other things for trademark reasons. But in general, when people say a zone map or people talk about the concept I'm talking about here, they would typically use the term zone map. And so what will happen is, say you have, you have your original data, right? So they say we have a table that only has a single column and has five tuples with these, these five different values. So what it can go do is that it'll pre-compute five different aggregate functions for all the data in, in that, in that page, right? So let's say again, say this was now, you know, a thousand tuples. I'm still only going to have five different entries in my zone map. Because they're an aggregation of all, of all the tuples in that page. So now when my query comes along, right, say something like this. I want to do a select, I want to find, I want to find all the values that are greater than 600. If I don't have a zone map, I essentially have to do a sequential scan. And, and you bring the page in and look at every single attribute. But with a zone map, I can just go look and say, well, I know my predicate is checking for values that are greater than 600. But the max value for all the data that's in this page for this particular attribute, the max value is 400. So 400 is less than 600. So I know there's not going to be any tuple that I could, that I could ever find inside of this page that would satisfy my, my query. So for, I just skip this in page entirely, right? And so because these zone maps are much smaller than the actual, the original data, you can try to keep these in memory. Or you can, you can, you can pack a lot of them within a single page. So when you go fetch one page for zone map data, you're bringing in a lot of different zone maps for, for different pages. So this is sort of like a materialized view, which we haven't talked about yet, we'll talk about next week. But again, just the database system will, will automatically keep the zone map data in sync with the original data. So if I insert a tuple into this page that now has a value of 600, the data system knows that, I like to go to my zone map and make sure I update that to make sure it reflects the same value. Cuz the data system cannot have any false positive or false negatives, right? Some systems can, that's not what we're talking about here. So let's assume that the zone map is always gonna be strongly consistent, or always synchronized with the real data. So this is in a bunch of systems as I said. So, so Oracle does this, Natesa is a specialized system that would go all by IBM, DB2 does this, and Paula does this inside of the parquet files in the Hadoop file system. And then Vertica does this as well, Vertica's, all these are OLAP systems. This doesn't really help OLTP because OLTP's workloads are usually gonna have indexes to go find the single thing you need. This really is only helping for sequential scans. The next optimization we can have is what we'll call buffer pool bypass. And the way to think about this is that, as you guys implemented for your first project and as we talked about in the lectures, to go into the buffer pool to get a page is actually kind of an expensive operation, right? You take a latch on the page table, then you update some pinned information, maybe update some reference counters, update the LRU chain, right? So all this sort of adds up if you're scanning a lot of pages. So with the buffer pool bypass is that you basically fetch the page from disk, you can't avoid that. But then rather than putting into your buffer pool with all the other pages in your database, you store it in your memory that's local to your thread, do whatever processing you need on it, and then when you're done with it, just throw it away. So this is obviously tricky because there's some extra work you have to do to make sure that nobody's trying to maybe modify the same page the same time you're reading it. But if your workload is read only or read mostly, then this actually might work out. So this works really well for sequential scans where you need to read a bunch of data that you know continues on disk, right? You have these 4K pages and they're all lined up on the same platter in the same order, so you can just do a single sequential read on disk and go fetch them all into your memory. So as far as I know, I don't know of any system that actually supports this feature other than Formix, and they call it light scans to essentially do this buffer pool bypass. Informix was a system that came out of the 1980s that got bought by IBM. It's one of these legacy enterprise systems. It's very expensive. All right, so the last thing is to do what's called heap clustering. So before when I talked about table heaps, I said that the database system would store tuples in unsorted order in the heap pages. And then I alluded a little bit when we talked about indexes to say, oh, systems like MySQL and SQLite, they can actually pack the tuples in the leaf node pages of your B plus trees, and that keeps them in sorted order. So this technique actually has a name. It's called clustering indexes. So anytime you see people talk about clustering in the context of database storage, don't think of a distributed cluster. Think of what I'm talking about here, where you're clustering the tuples in the heap pages in the order specified by an index. All right, so now what happens is that when you want to go maybe do a scan, and you have some key, maybe as a starting point, you don't have to start from the beginning and just go in sequential order. You actually can maybe do binary search or jump to some offset where you know that you have your starting value. So the way to think about this is say you want to scan across all the leaf nodes here, as you scan across and look at every single record ID that's pointing to some page down below, you're going to get them in the order that's specified by the index. So this can help speed up queries, again, when you're doing a lookup on the thing that is sorted by. So as I said before, some systems will do this automatically for the primary key, and if you don't specify primary key, they'll synthetically define one for you, like in SQLite, it's your row ID, that they've then used to figure out how to cluster your tuples. So this will come up later on, but some systems actually don't support this. Well, they support it sort of half of the way. So with a true clustering index, any single time I insert or update or delete a tuple, the data system will do work to make sure that those things are always maintained in sorted order. Postgres actually can't do this or do that automatic update. They have a cluster function, a cluster operator that you can invoke that will take all your table pages and sort them all at once based on some key or some attributes, but it doesn't actually maintain them in that sorted order. So you can sort them once and then the operation finishes, but then if you come back and update things or delete things or change things, it doesn't maintain that sort order. And they do this because that's how they implement their concurrency troll scheme, which we'll talk about in a few weeks, right? So all the commercial systems can do clustered indexes. And I think I talked about last time how in Oracle and DB2, or not Oracle and SQL Server, you can specify that sometimes you want some tables to be used to clustering indexes and other ones don't, right? This is some option you can turn on in some systems. All right, so now related to this is how to do an index scan. So the basic idea of index scan is that the data system is going to pick some index, one or more, and it's going to know how to do a lookup to find a starting location in that index and then scan in sorted order to find all the tuples that it needs. So for our discussion here, we'll just assume that we're dealing with a B plus tree, but there's nothing that I'm talking about that you couldn't also be applied for the Radex tree or skip list. Hash indexes, you can't do this because the keys aren't stored in sorted order. So the first thing the data system has to do in order to do an index scan is obviously pick what index you want to scan. So to do this, you have to look at a bunch of different things all at the same time. So you have to look at what attributes that the index is based on, what attributes are in the where clause of the query that it's referencing, what the values look like in the actual, those attributes, right? Is it just two values male, female, or is it a billion different things? Then you have to look at what the predicate is actually doing. Is it greater than, less than, equal to? And then lastly, you have to know whether the index is actually, is it using unique keys or non-unique keys? So all of these things are what the data system will do to figure out what's the best index to use for my query. What's the best set of indexes used for my query? So we're gonna punt on this entirely for now because this falls under the domain of what's called query optimization or query planning. This is where, again, the database system takes your SQL query and has to generate an efficient query plan. And as part of that has to figure out what index to use. So we'll talk about this in a few weeks from now. So for now, we'll just assume that the database system knows how to pick indexes. And to show you why this is, how this can affect you. Say that we have a really simple table, a really simple database, that we have a student's table. We have 100 students and we have two indexes. We have an index on age and index on department. And so my query wants to do a lookup to find all of the students that are under the age 30 that are in the CS department and are resident in the US. So assume that we wanna pick one index to do our index scan. Basically what happened is we would pick that index. We would extract whatever the constant value we're doing our lookup inside of our where clause to know how to traverse that index to land at some leaf node. And then based on whether we're doing less than or equals, we could either just go grab the single tuple or scan along the leaf nodes to get all the tuples we need to satisfy it. So the first choice is say if you had, or first scenario would be, say that of your 100 people in your student table, 99 of them are under the age of 30, but then there's only two people in the CS department. So the data system will look at this and say, well, I need to pick, I have these two indexes, which one should I choose? Am I saying a guess? What's that? A department, because you only have two people. And it's a non-unique index, so you just land exactly to get the data you need. But let's say if it was a verse of that, we had 99 people in the CS department, but then only two people under the age of 30. You actually wanna end up picking the other index. So this is actually a bad example in some ways because you wouldn't actually wanna build a, depending on the number of different departments you have, the fact that there's two people in the CS department might be a bad choice of the index, right? So the data system's gonna look at this and figure out, well, what do these two scenarios, or it looks at each scenario and looks at what the data looks like, what my query looks like, and then it'll choose one of them to scan. Yes? Sadie, I'm sorry. So the question is, say my data really looks like this. And then I dump everything out. And then load it back in, then it looks like- Can you make it so it doesn't do this and then- Yeah, it starts up with this and then it becomes this. His question is, does how the data system decides how to index things change? The answer is no. Most database systems, all database systems are, you have to tell it how you want to index things, right? So if I just define my table and I don't specify anything being unique, or specify any primary key, the data system is not gonna make any indexes for me. DBA is sort of what a DBA's job is. They have to come in and say, well, my data looks like this, my queries look like this. I want to have this index or that index. And as the data evolves over time, the assumptions you made maybe for the first scenario may no longer hold for the second scenario. But the data system's not gonna know how to modify itself. Another way to think about this too is there's sort of two problems. What indexes do I want to have? And then how do I actually want to store those indexes? Do I want to use a radix tree, a B plus tree, a hash table? So in some cases here, having only two people in the CS department might mean a hash table might be better for that. And in theory, the data system could change the index data structure on the fly based on how the workload changes, so the database changes. As far as I know, no data system actually does that. So without trying to toot my own horn, this is essentially what we're building now at CMU. So we're trying to build a data system that can look at your database, look at your workload, and make these choices for you. Right now, humans have to do this. There's a bunch of tools that the commercial data system vendors provide where you feed a bunch of query traces, you feed it your database, and then it spits out, hey, I think you should have these indexes. But those are sort of a manual process and a human has to make the final value judgment to decide, do I want to add this index or not? Oracle announced their autonomous database system yesterday. As far as they know, they're not picking indexes yet. Their idea of autonomous means that it patches itself for security upgrades automatically. Which is actually a big deal, so that's actually pretty cool. I will knock a lair, he might be watching these videos. Okay, so the world is not as black and white as I just showed here, right? In the previous example, it was obvious what index you want to use for either of these two scenarios. But it's usually not cleanly divided like that. So some database systems can actually support multiple index scans where you can do a probe on the different indexes and you have a way to coalesce the results and then do the final filtering you need to produce the answer. So the way this basically works is that you're going to compute a set of record IDs for each index and then you then combine those two sets together, right? If you have a bunch of conjunctions like an and, then you just take the union. Sorry, if it's conjunctions, you take an intersection. If it's a disjunction, you take the union of them. And then once you have all those set of record IDs that satisfy whatever you're predicated is across all those indexes, then you go actually retrieve the actual underlying tuples and then do whatever final filtering you have to do. So this is sort of a variation of the index scan and a combination of the index scan and the sequential scan because you may have to go back and do another sequential scan after you probe all the indexes. So Postgres calls this a bitmap scan because they're going to generate bitmaps for each index and then combine them together, but you could implement this in any different data structure. And then all the commercial systems I think can do this. So let's look at our example here, right? So we have an index on age, we have an index on department. So the first thing we'll do is go to the index on age, compute that, set a record IDs, then go and get the record IDs for department equals computer science. Then we take the intersection of the two because it's an AND clause and any bit in the first record set that's also set in the other record set will then be set to true. And we know those are the tuples that satisfy both predicates. We go fetch them, then we apply our last predicate on the country. Because we don't have an index on that, we still have to look at the underlying data. So visually it looks like this, right? So as I said, in Postgres they implement the record sets as bitmaps. But you could also do hash tables, you could do a bloom filter, right? Bloom filters will give you false positives, but they're not false negatives, so that's okay. So we do our scan on the age, get a set of record IDs, do a scan on department, get another set of record IDs. And then it's this middle part here where we have the intersection. We know that we have tuples that satisfy both of them. And then we go do a sequential scan on the pages to go find those particular tuples. So yeah, I think we'll go here. Okay, so the last thing to talk about for index scans is that if you're operating on, if you're doing an index scan on an unclustered index, meaning the tuples are not sorted in the order that's specified by that index, then you have this problem where if you scan along the leaf nodes and look at all your tuples one by one as you find them, then that actually might be the worst thing you possibly do when you actually need to retrieve data from disk. So again, let's say our scan direction is going in this way. And then so as my cursor bops along the leaf nodes and look at each record, again I do my look up to say what page is this tuple in? Go fetch the page, then go find the tuple at the offset, and then I go to the next tuple, which might be in a different page. So as I do my scan like this, you can see that I'm doing look ups in all these different pages in sort of random order. So when you actually map out what pages you're accessing to do the scan, you can see here these blocks represent every single time that I had to go fetch a page. Now, if you have enough memory that maybe this is not big of a deal because the result set for the working set for this query could also fit in memory. But if you have a really large database and you're scanning a really large index, then you may actually have this problem where you're doing separate fetches for every single time you get a new page. So the way to solve this problem is just scan the index first, figure out all the record IDs you're going to need to access, then sort them based on the page ID, and then access them together so that you combine together all the look ups for the page, for the beginning page. So that way, now before where I had all these random page fetches, now I only have to fetch four, right? Yes? What is a clustering index? What is a clustering index? So as I said before, the clustering index is when you have an index that specifies the sort order of tuples in the database pages, the table pages. So when we talked about the heap before, I said it was unordered. With a clustering index, the data system will enforce the sort order of the tuples across all the pages. Think of it sort of like one giant file that's broken up in different pages. It'll keep those tuples in those pages in the order specified by the index. So if you have a clustering index, you don't have this problem because, which I can jump back to. You don't have this problem because as you scan the pages, as you scan the leaf nodes, the tuples that you scan across will always be either in the current page or the next page. And when you go to the next page, you never go back to a previous one. So I do want my one fetch, get the thing I need, find all the tuples I need to find in it, then I throw it away and go on to the next one. Whereas if you don't have a clustering index, then it's random, which one do you look up on, which is really inefficient. So you want to then just find all your record IDs ahead of time, then scan them and then shove them out. So, yes. So this question is, for this particular example, the query has a where clause that contains the lookup that the index is based on, correct, yes. Yeah, correct, so we'll sort of get to this next. But basically, think of like, if your where clause is where age is less than 30, say, where age is greater than 30, you just find the first tuple, the first record with age equals 30, or maybe age equals one less than that. And then now when you scan across, you're guaranteed that every person you find will be greater than 30. Now say you have something like where age is greater than 30 and less than 50, the data system will know that, oh, I had this less than 50 thing, and that's my stopping point, right? And this is sort of what I was saying before too about pushing down hints for some of these different processing models, because if you have age greater than 30, limit 20, you don't want to scan across all the leaves, you want to stop after you have to find 20 things, right? So, okay, so now we want to talk about how do we actually evaluate these expressions, these predicates. So in the same way that a query plan is represented as a tree data structure, our predicates will be also represented in what's called an expression tree. So say that we have a really simple query like this, right? The where clause will be represented with some data structure that more or less looks like this. And so you have different nodes in your expression tree that represent the different things that you're trying to do in your predicates. So you have comparisons, equal, the greater than, less than, all those things. All your conjunctions, ands, and ors, arithmetic operators, constant values, and then tuple attribute references. So any expression that you can have in your where clause, and I'm focusing on where clause because it's the easiest one to understand. But the same idea would apply to doing havings, doing projections, and other things. It can always be represented as a tree like this. So I'm going to avoid subqueries for now. They complicate this slightly, but in general, this is how every single data system can initially represent your expression. So now the question is how do we actually evaluate this? So when you run a query, say a simple one like this. The where clause is where b dot val is equal to some dollar sign, which is a placeholder parameter for an input value that the application gives you. Plus one, so we will represent it as a tree like this. So just like before in the Arrayer model, we start from the top down. When you traverse this expression in order to evaluate it, you start from the top down. And every single node as you go along, you figure out what's this node trying to do. And I look my left and right children and figure out, traverse down to them, evaluate them and then push things up as needed to your parent. So to do an evaluation, typically what happens is you have some context information about the query that's running that you want to apply this predicate. So you'll store things like what's the current tuple that I'm examining, if you're doing a sequential scan or index scan, it's still the same. What are the input parameters for this query invocation that I want to use to fill in my question marks? And then what's the table schema? How can I interpret any tuple that's referenced? How do I interpret the attributes that are stored inside of it? So in this case here, I start with the equal sign. I go to my left child and I see that it's a reference to an attribute in the tuple. And it's referencing the attribute value. So I would look up in my table schema and say, well, value, the value attribute is the second attribute for this tuple. So I know how to go look now in the current tuple that I'm processing, jump to the second location for the second attribute, and then I get my value 1,000. Then I go back up to the equal sign and go down to the right side and then do the same thing. I look at my plus sign, it goes down to the left. It has a reference to an input parameter. I now look on to my array that's passed in my context. I sped out 99, go to the right side. It's looking for the constant value 1. That's just 1. And then now I push everything up, do the addition and do the evaluation, and I end up with true, right? So this is pretty straightforward to understand. And this is how every single database will represent their expressions. It's not how all of them actually evaluate them. So what's an obvious problem with this? Correct, so this statement is if you have a really deep tree, you're losing out on a lot of efficiency problems. Now, some systems will be clever about certain things, which we'll talk about when we do query optimization. If you have transit enclosures or other things where you know A equals 1 and B equals 1 and A equals B, you can just put that down to, you can strip out one of those expressions. Well, let's say, sort of not even worry about a really deep tree. Let's just say you have a billion tuples. Every single tuple you're going to evaluate, you have to traverse this tree and do all the things I just walked through and apply them one by one. And it's actually even more, more onerous than I'm actually letting on, because what'll happen is at every single time you want to look at maybe like the attribute of a tuple, you have to look in the scheme and say, well, it's an integer and it's 32 bits. And here's the offset of where it is in the tuple. And maybe the thing I'm adding it with is a 8-bit integer. So maybe I need to cast that up to be 32 bits. So there's all this type checking it's doing for these different values based on the schema to make sure that you can actually apply the operations you want to apply. Because that way, someone passes in a constant value for the string 1, 2, 3, or string 1. It knows how to then convert that to the integer. So this is really, really slow. But as I said, this is what most database systems actually do. And so it's not just all the type checking, but it's just the process of traversing that tree is really bad. So let's say in this really simple example, I have a query that has a where clause, where 1 equals 1. Now as we'll talk about later on, the data system can be smart and recognize that this is always equal to a true. So it can optimize this away. But if you couldn't do that, then again, you're starting with the equal sign, going down the left, going down the right, and then implying the comparison. So what you really want to do in your database system, sort of think of this as like just-in-time compilation. What you really want to do is just evaluate the expression directly. This allows you to use a significantly fewer number of instructions. And you don't want to chase pointers. So instead of having this tree that I interpret, I can just take 1 equals 1, write the machine instructions, actually to do that comparison, and just execute that. This is much of work you have to do to make sure you're type safe and all that good stuff. But this is significantly faster, especially when you have doing these large scans on over a billion tuples in your table. And this problem applies and not only when you're doing index scans, but also sequential scans. Just because you have an index that allows you to jump to some location along your pages much more quickly, as you go down through the index, you're basically traversing this tree, right? For every single key you do your evaluation on. So a lot of the commercial systems and actually I think the newer version of Postgres 10 can do this kind of expression evaluation by doing more or less compilation. Think of this as the difference between like Python interpretation or versus like compiling something to like C++. Yes. I wonder if it looks like I don't, it sounds to me like it is free, basically the forest you should compile along these things. So the statement is, instead of doing just in time compilation as you're doing it on the fly as you evaluate it, you can actually compile things before you even start running the query. You're correct. This is, yeah, so this is what actually systems actually do. And this is actually what we do in our system. And the speed of you get from not having traversed this tree is actually quite significant for OLAP queries. I'm saying compilation sort of figure it in the same way like you can generate the byte code but then you still have to interpret it. But if you compile the machine code, it's way, way faster. And again, not to keep bringing up the advanced course in the spring, but in some systems not only do they compile the expressions into machine code, you actually can compile the entire query plan to machine code. So in the process models I showed you, we talked about traversing those operators and you're essentially doing interpretation. The operator is a scan, what does it want to scan? How do I scan it, right? And figures those things out as it runs. All those things because you have to see SQL is declarative and you know what the schema is ahead of time. You can generate the actual machine instructions to just execute the query plan just without doing any interpretation. And the speed up is quite significant. Tricky thing though is compilation is not free, right? That takes a little time. So there's some tricks to get around that. Okay, the main thing I want to say is again, the internally will represent the expressions in our where clauses, having clauses and projections and everything else as these expression trees with these different operators. And the most primitive database systems, if I don't use that term, will actually just traverse this tree in order to do the evaluation. This is what people typically implement the first time. The first time you start building a system. But what you really want to end up doing is do some kind of compilation where you just can now look at the expression directly and apply it to your tuples, okay? All right, so just to finish up, in this class, we talked about the different ways the database system can execute a query plan. And what's actually kind of cool is everything we talked about here is pretty much independent of all the different things we talked about in the course so far, right? Whether you're using a row store or a column store, all the different models that we talked about here still apply. Whether you're using a B plus tree, a radix tree, a hash table. All of those, the prospects models still apply as well. In general, most database systems are gonna always try to do an index scan as much as possible, cuz they're always much faster. This is definitely the case for OLTP workloads. If your OLTP application is doing sequential scan, it's never gonna work, it's never gonna scale, so you always wanna add indexes. And the reason why I say most database systems is because, again, there's the specialized column store systems that maybe don't want any indexes at all. So Vertica, for example, Vertica you cannot, you can't call CreateIndex, cuz they pre-sort all their columns. And therefore, that's the same thing as having binary search in an index. So most systems wanna try to do index scan as much as possible. We show how to do this in both for single index or multiple indexes. And then for the expression trees, what I showed you is sort of the most common or flexible way to actually implement them. But they're slow because of all that pointer chasing. Okay, any questions? Okay, so project number two. As I said, it's gonna go out today. The website is not live, but it will be later on after the class. So you will be building a single threaded B plus tree index. The reason why I'm saying single threaded is because we didn't haven't talked about how to do latching and other things. So for now, we're just keep this single threaded. And so you'll have to design the layout set. So the same rules as we talked about before, things to be mindful of is that you should not have to change any file other than the six ones that we specify. Now there is gonna be an updated source tar ball that we're gonna provide for you guys. And cuz there's much of new stuff that we've added in order to run this second project. So you'll need to copy over the six files that you submitted from the first project and plop them into the same location for the second project. All these projects are cumulative. So you're the buffer pool manager that you use, that you implement it for the first project is what your index would use to go fetch the tree node or node pages. And as always, please post any questions you have to Canvas or come to the TAO's office hours. So again, with the ghost without saying, please don't post any of your code on GitHub. Please don't borrow code from each other, steal code from each other. Everything you turn into us should always be your work. Cuz we're gonna run the code checker, the plagiarism checker on Autolab, okay? Any questions? Yeah, we can make a patch, yes, okay, yeah. For next class, I'm actually gonna be away and Joy Rourage, my PG student here at CMU. He'll be teaching you guys. Joy has a disorder known as sort of minor Tourette syndrome. So he has this problem where he curses a lot when he speaks. So we've asked him to try to limit that, and we'll have somebody in the back that sort of can monitor this the way he doesn't go into tirades of profanity. So if there's more cursing in the next class, in this class, I apologize, Joy's been working on that. But he's gonna now again talk about more query execution. So now we know how to read the data from the tables and the indexes. Now we're gonna shove them up to the other operators in a query plan and start doing things with them. So he's gonna focus on doing sorting, and then actually start talking about how to do joins, okay? So please give Joy the same respect that you would give me. All right? I'll see you guys on Monday, but come to class on Wednesday. Thank you.