 Okay, I think let's get started today, we actually have a lot to discuss today, alright. So first again, the administrative stuff, product two is due tomorrow, right, we talk about we have mentioned this extension on Piazza. So one thing I need to emphasize today is that like we have mentioned in our Piazza post, we actually update our grading scripts and especially we have included the formatting and client ID and those style checks in our grading scripts as well. So my colleague Andrew, he rerun your oil submissions last night. He actually I think mentioned to me that more than, oh not more than but nearly half of the submission has actually failed formatting and style check. So you should definitely double check on grid scope if you have already submitted the product two and see whether your formatting and styling is a passing or not. Hopefully even though it's not passing, it's not, I mean it's just fixed formatting, hopefully it should not be too bad. But again, it's due tomorrow night. And then for project three, we'll release that today, right. So also postpone a little bit but we'll keep the due date as November the 14th. And lastly, we'll release the homework for next week and it will be due in November as well. Okay, so I actually don't know whether we have advertised these talks in this class before but essentially, I mean our database group at CMU are also hosting database seminars every week where we just invite outside database practitioners from various companies like either big companies or small startups to talk about, I mean database techniques, right, toward that they are building in the wild. So next week, our database seminar and I will include the link in our slides. It's actually from a startup called Trino that is like doing a kind of like a federated database engine but actually, especially going to talk about their optimizer and the challenges and their solutions that they faced when they are building their optimizer. So it's actually very aligned with our topics, these lectures. If you're interested, we could just check it out how our query optimization theory would actually be applied or something would may even be broken in practice. I'm pretty looking forward to that, all right. So that's the mystery stuff in terms of the content today. We are just still going to talk about the query optimization and planning technique in a database system. So again, this is arguably the most difficult part of a database system. And then again, we have a lot to cover today actually. So just a little recap. Last week, we talked about simple heuristics or rules that we can apply to rewrite our logical query operator trees, so that we can generate some preferred query plan as early as possible, such that we can reduce the search space that we need to explore during our plan enumeration or cost-based search phase. And then today, we're going to go deeper into the second part of cost-based space, cost-based search phase, so that we can enumerate more complicated query plans and generate better alternatives or potentially optimal alternatives. So again, there are two parts of this cost-based query optimization process. The first part is that we need a cost model so that we can estimate the potential cost of executing a query plan without executing it, obviously. And the second part is that we are going to use that cost model as our guide in a search process so that we can compare the cost and benefit of different alternatives or the query plans and then pick the better one. So it's a cost model and then there's a search process. And today's agenda. Last week, I gave a little bit of a heads up on the cost model and today we are going to talk a little bit more about how we are going to complete this cost model and then I'm going to talk about the plan enumeration part as a second part of this lecture. So again, just a quick recap. For the cost model, we talk about three possible components of the cost model. The first is that what would be the factors that would affect the cost of a plan execution related to the hardware, like a CPU, a disk, memory, etc. etc. And again, these factors related to physical properties would be primarily considered in commercial systems. In open source systems, pretty much not considered except some important factors like the cost between memory access versus disk access or the cost difference between a sequential IO and random IO. For open source system, that's pretty much that. And the second, logical cost. Pretty much everyone needs to consider, which would just essentially mean that the number of tuples each operator needs to process inside their query plan tree. That includes both. How many tuples would be fed into the each operator as to how many tuples would come out of the operator, essentially a result size. And lastly, just the algorithmic cost. Let's say you want to sort some data then the algorithmic cost for sorting would just be in log in for other operators as well. So last week, we also talked about Postgres and DB2. There's two examples of an open source system and a commercial system of how to use their cost model. But an important part that we haven't talked about. Of course, for the algorithmic cost, that's kind of a straightforward cost of sorting, cost of building a hash table. We covered that in our earlier lectures this semester. But then the important part we haven't talked about is the second logical cost. How do we actually going to know how many tuples are we going to process? Especially while you are processing or executing the query alongside your query plan tree, you are going to execute different operators. The number of input and output for each operator during the plan tree could actually be different. And if you don't know how many tuples need to process, then none of these cost model components we talked about would make sense. Because if you only know the complexity, then you cannot estimate the cost anyway. So today, we're just going to talk about how are we going to estimate the second logical cost, especially the number of tuples, each operator need to process. And for this, what we are going to primarily rely on is something called statistics inside the database system. So in the database system, statistics, the term statistics are specifically referring the property of the data stored in the tables of a database system that could include how many tuples in each table would be the number of distinct values, et cetera, et cetera. So statistics is specifically referred to the data property in a database system. It could have other meanings in other topics or areas. So one thing that should be remind for is that because database may have lots of tables, each table may have lots of attributes, et cetera, et cetera, it would often time be costly to compute those statistics. So what database system typically do is that instead of while or before you execute a query, calculate the statistic on fly, instead of doing that, you would have a system would actually have special commands that you can execute to calculate the statistics for all the tables and attributes ahead of time and then store them in its internal tables. These commands could be, I mean, the command itself could be different among different systems like in Postgres or SQLite, it's analyzed in SQL Server, it's updated statistics, et cetera, but at the end of the day, they pretty much do the same thing. It's just a pre-compute of the statistics you need to guide the cost estimation of the cost model. And the different system would actually handle the invocation of these commands to generate statistics a little bit differently. Most of the time that actually is actually required from the users or database administrators to issue this command to the database system at the best time that the users or developers deem to be suitable. So, I mean, after you invoke this command, the system will generate the statistics and store it. But in some systems, right, not that often, but in some cases, the system may automatically generate the statistics when the system workload, the workload volume is low, right. I think, for example, Oracle, they may have an option to, I mean, automatically generate statistics every night, like after 12 p.m. or 12 a.m. or something like that, right. And some other systems like Postgres, it may actually piggyback other background maintenance tasks to generate the statistics, I mean, without even you notify them. But that's less common, okay. So, to just define the different statistics that the database system may use to help the little cost estimation, here in this slide, I'm going to define the two most basic statistics first. I mean, then, I mean, the first one is just for every relation, R, the database system will just maintain the number of tuples, right, in the relation, R. And then we denote that as an R, right, that's very straightforward, how many tuples in the table. And the second is that what would be the number of distinct values for each attribute in the table, right. Let's say the table has three attribute ABC, then we'll have VAR, VBR, and VCR, et cetera, to denote the number of distinct values in each table. And again, that's the two most basic statistics, and we'll have other advanced statistics as well. I'd like to give you some high-level examples of how useful this statistics would be. For example, the number of tuples, well, you may need that to estimate the cost of a sequential scan, right, if you want to read the entire table or the cost of that, you need that information to do the estimation. And for the number of distinct values, for example, if your query has a select distinct, right, the number of distinct values will directly tell you, hey, what's the output size? Well, assuming that there's no other predicates. But even other than that, the number of distinct values could also be useful in the case is that you have a hash drawing operation, right, then you want to know what do we possibly be the size of the hash table acceptor, right, so this statistic may also help. Okay, we'll get into more details later, but right now I'll just give you a heads up. And then the next derived statistics we are going to talk about is called a selection cardinality. And this is just nothing more than you divide the number of tuples nr in a table by the number of distinct values, right, var, if assuming you are looking at a property or attribute a. So essentially, the statistic just represents the average number of records with a given value for an attribute, right, makes sense, just a simple division and this derived statistics. So maybe you have already noticed that when we calculate this average number of records for a given value of a given attribute this way, we are making a huge assumption here, right. We are just essentially assuming that the data is uniformly distributed. For every single value in this attribute, they all have the same number of records. And of course, this is not, in practice, many data sets or probably most of the case, this is not true, right. I mean it's very rare that you have an exact uniform distribution. But again, just to simplify our calculation and to simplify our overall query optimization process, we are making this assumption, right, to generate better query plans in time, right, instead of figuring some complicated relationship between different properties forever, right. And again, I could give you a simple example, assuming that there are 10,000 students, I mean, in CMU, right, assuming there are 10 different colleges, right, the actual number probably more than that. Then in this case, the selection cardinality for each attribute in this, in this, well, for the attribute college in our data set would just be a thousand, right, assuming that each college have the same number of students. In actuality, this is probably not true, but that's the assumption that we're going to hold in most cases to simplify our lives. Okay, so look back to what I talked about earlier, right. So the primary goal of these statistics is to help calculate the logical cost in our database system, right, so that eventually we can do a cost estimation in our query optimization. So, but in some cases, right, in some, I would remind that in some simple cases, you may actually need the complicated statistics, right. So, for example, assuming here's a case, right, you have a simple select query on the primary key, you know, in a table called people, right, called people, and then you just select everything from this table with a equality predicate on the primary key. Then in this case, you actually don't need those statistics, right, and because it's a primary key, then at most, you can only have one tuple that would be matched, and in that case, the estimation is easy. But in most other cases, right, let's say you have a range query, or you have a cyber predicate with a string value that needs to be matched with a well card, etc., then in most cases, you need statistics to help you decide how many tuples you are able to get in each operator, okay. So, again, to give you a little bit more formal definition, so we are going to formally define the number of tuples, or actually the fraction of tuples we are going to get after each operator as the selectivity of this predicate or this operator. So, which means that, say, you have an operator that is going to scan a table, but with a predicate that would select 10 attributes out of a table of 100 records, then the selectivity of this operator or this predicate would just be 0.1, right, because for this operation, you are going to look at 10% of the total data you have in the table, right, that's just the term we use. And then, of course, for different predicates, different situations, you'll have a different formula to compute such selectivity, which we are going to detail right now, okay. Terminologies are clear, okay. Okay, so now, give you, I mean, a little bit more concrete example, right, assuming that we have this table, right, with five distinct values on this attribute H, right, and then the total number of tuples in this table would also be five, right. Don't think of this as, like, this age as concrete value, just think of them as, like, a younger age group, older age group, etc., right. So, assuming that we have a table with five age groups, and then they are actually exactly five tuples with different age groups, right. So, if we apply this predicate, right, this query, select star from table, we are age equals two, then the selectivity would just be the number of tuples. Setify this divided by the total number of tuples, right. So, in this case, again, as I show here, I mean, the table has five age groups, and each of them has one tuple, then the selectivity of this query with age group, with age equals two, or age group equals two would just be one-fifth. Make sense? All right. So, to give you another example, right, let's say now we have a little bit more complex query, right. Instead of looking exactly at age group equals two, we are looking at, hey, what if I want to select tables with age greater than equal, greater or equal than two, right. Then we'll just look at these three tuples in this query, and then the selectivity of this query, right, would just be three divided by five. And then, again, the formula for this would be, I mean, I can read it out to you, would be select something, if A is greater or equal to then small a, then the selectivity we are going to calculate would just be a max minus then small a plus one divided by a max minus a mean plus one. Just again, this only works, assume that data are uniformly distributed, and then you can calculate the selectivity this way. Lastly, how do we calculate the selectivity if there's a negation predicate? Well, that's kind of, actually, that's kind of straightforward, right. Essentially, you are first to calculate the selectivity without negation, right. And then you just use a one minus that selectivity, okay. So, again, if you want to calculate the selectivity of age, age, age group not equal to two, then you are first to calculate the selectivity with age equal to two, and then you just do a one minus that, right. So, that would just be a four-fifth, right. Is it all mixed in? Okay. So, one observation so far we can probably have is that the way we calculate the selectivities is essentially similar to whatever we learn from basic statistics or probability theory, right. Essentially, we're just looking at, hey, what is the probability that a certain tuple is going to satisfy my predicate in this table, right. So, with that observation, we can actually borrow many similar concepts from statistics 101 or probabilistic theory that help us to calculate these selectivities, especially with the more complicated predicates, okay. So, here, for example, assuming that we have a conjunction predicate now, right. So, not only, I mean, not only that we could look at a single predicate, look at even one attribute, look at only one attribute with a property. Now, let's say we are looking at, we're trying to figure out the selectivity for a predicate with a P1 conjunction by P2, right. So, again, assuming that the predicates are independent, right, which is another assumption we made to simplify our life, then using the basic statistical technique, we can just simply multiply the selectivity of these two predicates together, right, to be the selectivity estimation of this conjunction predicate. I mean, this is just a recurring thing that we are making many assumptions that could just break in practice, but again, to make the equalization problem attractable and efficient so that you can generate a query plan in time instead of taking forever, we are making these assumptions just as a trade-off. Okay, yeah, essentially, if time that, and then this part, it would be the selectivity for the conjunction predicate. Then, for disjunction, and again, I probably won't repeat this, this recurring here, but essentially you are just going to treat it as the, well, essentially that's how you calculate the OR operation in a predicate, right. So, again, I assume this is like a master level slash upper level on a greater cost, I'll just skip the equation here, all right. So, next, with this ability to calculate the selectivity of the number of tuples for a predicates with a basic scan, so what other information we need to calculate or to estimate the cost of a query? Well, there are many queries that contain multiple relations and essentially will contain a drawing operator, right. So, in this case, we need to operate how many tuples would come out from a drawing operation. And essentially, in other words, the question is that for a given tuple in a relation r, I assume that we are joining r and s, then for a given tuple in a relation r, how many tuples of the other relation s will find it to be matched? So, here, another assumption that we're making, which is that we are just going to assume that for every key in the inner relation, which would be r in this case, there will exist at least one key in the outer relation s that would just match, right. So, obviously, again, in many cases this won't hold, but I mean to make the problem tractable or calculatable, right, we just have to make this assumption. And I would say in actuality, in many cases, this would actually hold. And especially when you are joining two tuples and then there will be a foreign key constraint on one tuple towards the other, right. That's something in many cases that's how that's when you are going to perform a drawing. And if that's the case, this problem, this assumption would actually be satisfied. But again, in many other cases, it won't. So, what would be the formula here? Well, essentially, I mean to define it more formally, we are assuming that we are joining a relation r and s, and we also assume that we only have one joint predicate, right. Essentially, when we calculate, we look at the overlap between the columns in r and the columns in s, there will only be one column overlap, right. And that is the joint predicate we are going to look at. And also, we are going to assume that this joint predicate is not on a primary key column, because otherwise it will be much easier because every key is unique. And in that case, it's straightforward to know what would be the maximum size that is possible, right. Okay, then what we are going to do here, and again, remind for our assumption earlier, which is that for every tuple in the relation r, there will be a one tuple that will be matched in s. So, what we do here essentially is just we are going to time the number of tuples in r, right, in this case nr, with the selection cardinality of s, right. So, here I just expanded it to be the division between the number of tuples divided by, and the number of distinct values. But essentially, the right-hand side of this formula in s divided by VAS is essentially the selection cardinality, right, which will be the average number of records for each value in table s, right. So, we're just going to multiply these two together. But that is not the best we can do, actually, because if we are going to join r and s, then we will actually get the same number of tuples, assuming inner drawing, okay. Then we are going to get the same number of tuples when we join r and s, s when we join s and r, right. So, what we, a simple thing we can do to make this estimation a little bit more accurate, which is that we can flip the order, right. We can also estimate, hey, assuming that we are joining s to r, then what would be the number of tuples, right. This would be completely the reverse, right. So, we will just use the second equilibrium here. And then, I mean, since, I mean, these two joints, right, they are symmetric, they will have the same number of tuples. Then for this joint operation, it can have at most, I mean, it cannot have more tuples than the estimation of one of these estimations, right. So, essentially, we are just going to divide this multiplication by the maximum number of distinct values in each table, right. So, essentially, on the other words, we are taking a minimum on the two estimations, as the final estimation on this joint operation, okay. Does this make sense? Any questions for estimating joint select, selectivity? All make sense? Okay, sounds good. So, recap the, all the assumptions we have made, right. The first, we have made the assumption that the data is uniform, such that we can estimate the selection cardinality, which means that the average number of records of each tuple, right. Because otherwise, we cannot estimate that. The second, when we calculate the selectivity for sequential scans or index scans, right. Same thing, but for scans with predicates, we are going to assume that different causes in this predicates are independent, right. So, that we can use a probabilistic theory to directly times them together as the estimation. And then lastly, when we are calculating the selectivity for joints, we are assuming this inclusion principle, which means that for every tuple in the inner table, sorry, inner joint, inner relation, we are going to find a match in the outer relation. And then, of course, in many cases, these assumptions, like I mentioned, would break. So, let me give you a few, there's two examples of when these things will break, and then how potentially we may fix it. But when I say potential, because even though there are some methods to cope with that, but we are not going to fix it perfectly, right. Because it's just challenging. And in most of the cases, we are still making these assumptions, okay. First is about the independent assumption, right. So, one case that this independent assumption would be broken is that you have correlated attributes. So, assuming that you have a data set of automobiles, and then assuming that you have 10 different makes of cars, and then 100 models of the cars, and then assuming that you are trying to execute a query with the following predicate, right. You are trying to find, hey, what are my records with the make of the car is Honda, as well as the model of the car is a car. Then, if you use the earlier activity, we talk about assuming independence, then you will just multiply the activity for Honda by this activity of a chord, which would be 0.001, right. But we as human in actuality, we know that only Honda produce a chord, right. There's no, I mean, no major other major automobile makers would actually produce a car model called a chord. So, in that case, we actually know that the real activity of a chord is just 0.01, because there are 100 models in total, and there's only one chord. So, in this case, this activity would just be broken, right. So, what would be the some potential, actually, yeah, what would be the some potential ways to fix this? Well, then you can actually define correlated column statistics on multiple attributes in a database, right. So, a little bit similar to how we define the statistics, number of values, number of distinct values on one attribute in a table. You can actually define a number of tuples, and you access the number of distinct combination of like the middle pairs on two or three or more attributes, right. And then, if you have that, then in this case, you can directly use the two column statistics to estimate the activity of this predicate. But as you can see that, there are many different possible predicate, or sorry, attribute combinations in a table, right. So, it's actually, I think it's exponential. Then, obviously, you cannot define correlated attributes statistics on every combination of attributes. So, what would the database system do in this case? Well, typically, that's just offload to the users, right. The users or the database administrators will actually need to tell the database system that, hey, I know that in my dataset, there will be two or three attributes that are correlated, and then in my workload, there may be some query that are querying those attributes together, so that you need to define multiple column or multiple attribute statistics on these combinations, and then the system can use that, because it's a bit difficult. Even though some system, I think some system are trying to do that, but in the most cases, it's difficult to figure that out automatically, because the combination is just too many, okay. Okay, then the other assumption we have made earlier, which would be the uniform distributed assumption, right. So, again, assuming that these, we have this table, right, with 15 values, right, then here, when we have the data, every value have the same number of occurrences, then our selection and cardinality estimation would be perfect, right. But in practice, oftentimes, the number of occurrences of each value in the table would not be uniformly distributed, right. So, here, it's giving you an example of this. And in this case, one simple thing we can do is that, or naive thing we can do is that, instead of recording the average, I mean, the total number of distinct values, or the one number called a selection cardinality, a naive thing was just to record the occurrence of each single value, right. But then, the obvious problem is that this can become pretty large, right. So, assuming that you have just these 15K keys, and then assuming that you are using a 32-bit integer to remember the value of each, assuming that the number of occurrences of each value, then with just these 15 values, you already need to spend 60 bytes to record all the occurrences values, right. Then, assuming that you have a table, right, in this area, I mean, it's not that uncommon to have a table, a super large table with billions of rows, right. So, assuming that you have a table with a billion rows, then 32 bits per value, or per row, will just end up to be four gigabytes of data, right, for a table with billion values. So, assuming that the table may have, I don't know, tens or even more than 100 attributes, then that's just overhead that is not affordable. So, to be a little bit smarter, right, to be a little bit, if we want to estimate the number of values for each attribute or occurrences, for each value in the attributes more accurately, but then we don't want to pay this huge overhead. One thing that a little bit smarter we can do is that, is that of keep track of the occurrence of each single value, we can track the occurrence of multiple values together to amortize the cost, right. So, in other words, that's the standard term for this kind of data structure to track these values, we will call histogram. So, here, I mean, very simple. Here, assuming that we are going to track the occurrence of every three values together, right. In this case, we'll just, I mean, the standard term for these groups of values would be called bucket, and each bucket would just have a range of three. So, we are just going to look at each bucket and calculate what would be the total number of occurrences of values within each bucket. And then when we need to calculate the number of occurrence of a particular value, we're just going to first locate which bucket it exists, and then we're just going to divide by the value recorded for that bucket, by the number of, I mean, some number of elements in that bucket. For example, here, in this case, if we look at the last bucket, if we want to, say, calculate the number of occurrences for the value 14, then we are just going to divide the count 14, by the total number of elements three in this bucket, that would be almost five. And this would be less costly, but of course, it's going to be less accurate, because, I mean, the number of occurrences of all the values in this bucket may still be different. Here, if we go back to the earlier example, the number of occurrences for 14 would probably be correct, but then for 13 or 15, it's actually incorrect. So, for example, for 15, the number of occurrences is almost 10, but we are still going to estimate it to be just a little bit below five. So, in this case, an optimization you can do here is that instead of divide this total amount of values into buckets by just the number of values, I mean, you are going to count, what we can do is that we can group these values based on the number of occurrences. So, essentially, what we are trying to do here is that we are trying to divide these values into buckets, such that the total number of occurrences in each bucket would be almost the same. So, in this case, we are hoping that there will be a less difference in the number of occurrences of values in each bucket, and then the value would be a little bit more accurate. So, here, in this case, instead of having every bucket to be covering three elements, that would be equal width earlier. Here, we are doing that for every bucket, they are going to cover roughly 10 occurrences, and this could have a different number of values actually in each bucket. Then, in this case, the histogram would be called equal actually adept histogram. And to take our earlier example that we have looked at, in this case, if we look at the last bucket, then the number of total occurrences in this bucket would be 12, but then there would only be two elements in this bucket. So, in this case, we are just going to estimate the number of occurrences for both 14 and 15 as six, which is a little bit more accurate than what we have before, especially for the value 15. So, this will be showing you the number of values in each bucket. Okay, that's the two oppositions we talk about that potentially deal with the assumptions that may break earlier. Any questions so far? Yes, please. Yes, so the histogram calculation usually don't directly affect the number of tuples we calculated for the drawing. So, usually, where this histogram would be useful is on the predicates. So, either you have an equity predicate like I discussed earlier, or you have a range predicate it will use. So, for drawings, typically it's still the same equation. Okay, so I will talk about most of the common approaches to maintain statistics for the attributes in the table. There are simple ones like number of tuples, distinct values, and the histogram would be a little bit more complex, but they're still used in some systems. Now, I'm just going to talk about some less common ones. So, the ones I talked about earlier, they are more straightforward to compute, easier to maintain. For example, the Postgres just used the histogram one as far as I remember. For the later ones, including the sketches, as well as a sample technique I will talk about these are a little bit more complex, require more cost to calculate it, to generate it, as well as require some overhead to maintain it as well. So, and when you're trying to use them, it may also be a little bit more costly than the straightforward formula we talked about earlier. So, these two techniques would be less common, but some people would use it if that's suitable for their go. On the first, I'm going to talk about here is called sketches. So, I don't know whether Andrew has covered a Bloom filter earlier in the semester, but yeah, if he covered it's essentially a very similar thing. Right? Bloom filter, you have a probabilistic destructure to identify whether a value exists. I mean, in your dataset, here again, you are just going to use a probabilistic destructure. Actually, some implementation would share similar intuition as well, but again, it's a probabilistic destructure that you can maintain that to estimate the occurrence of each individual value. And since it's a probabilistic structure, and obviously, you can imagine that there will be a trade-off between how big this destructure is versus how accurate the estimation can be. So, there's a trade-off. Either the developers or the users have to balance. So, it's not as easy to use as the earlier simple techniques. And the most common two examples of this would be a count-mean sketch and a hyperlog log, which I won't go into details in this class. And the next, another option to do estimation, gain less common, is to do a sampling. So, instead of say, hey, maintain these different data histogram sketches or whatever about the property of the data, why not I just keep certain samples of the data. And then, when a query comes, I just look at the samples in my data set, sorry, the samples I reserved for my data set and see, hey, how many tuples satisfy a certain predicate or property in the sample and then extrapolate from there. So, here, I'm giving you an example. Say we have a table that is selecting the average age from a table called people, but I want to only look at the table with age group greater than 50. So, assume that this table is really big, has a billion tuples. I'm showing you a fraction of that. What we can do is that we can keep a certain number of samples in this table and assuming here that we are keeping three samples here. So, when we are trying to execute this query and then to estimate the execution cost of this query with our cost model, we're just going to look at this sample table to see how many tuples satisfy this predicate, which would be one out of the three in this case. And then, we are just going to use that as this activity estimation for us in our cost model. And we times this one-third with a billion to estimate the total number of tuples we need to process. But again, one very obvious disadvantage of this is that, well, but first, we have to generate a sample. I have to keep it somewhere as a temporary table so that you reference it. But then another obvious overhead I have to pay is that you sort of have to execute another mini query on this sample table just to figure out, hey, how many tuples would satisfy my predicate in this sample table? You finish that query execution and then you can use that in the plan, numeration and plan search of the original query to estimate the cost. So, obviously, this would be more costly to do. But then the advantage is that you don't need to maintain all those data structures we talked about earlier. So, now we talk about all these techniques we can do to calculate the negativity, number of tuples need to process, et cetera, et cetera, to feed in our cost model to calculate the cost, to execute a particular query plan. Now, I'm just going to talk about how we're actually going to use this cost model in the plan search process to figure out what would be the better plan to execute a particular query. So, before I dive into that, any questions so far about the statistics and the tuple selectivity estimation, et cetera, or cost model? No questions? Okay, cool. Then for the remaining half an hour or so, we are going to talk about the last piece of query optimization, which is to search over alternative query plans and then use cost model to estimate their cost and then figure out what the best plan would be. So, I mean, generally, there are three categories of search that we can roughly group this domain into. The first would be the search over a query that is only opposed on a single relation. So, the search for alternative plan there would actually be rather straightforward. We can still use some heuristics to generate alternative query plan and then cost them and without a exhaustive super complicated search framework. That's for single relation. Then the second category would be called multiple relations, which means that queries with the joins involved, whether it's two or three ways or more of joins. So, for that, we actually typically need a more involved search mechanism or principle search mechanism to help us compare alternatives because for those queries, the number of possible alternatives would just be too big. If you do it naively, it can take forever. I can mention the possible number of plans for the NVJoy join, I think, is to fall to the power of an extra exponential. The last category is called nested sub-queries, which we are not going to talk about here today, but I just want to mention that for this third type of nested queries, what we usually do is actually what we talked about earlier, like in the last class, we will actually typically look at first whether we can expand the nested queries to unload it or flatten it to a join and then we'll just treat it the same way as the second category or we'll just leave. If possible, we'll also try to leave it up to a separate query. Then we don't need to consider the nested queries together either. In the worst case, we just have to exclude the inner query over and over again for everything from the outer query. But that's all we are going to cover or talk about the nested query today. Today, we are going to more focus on the single-relation as well as the multiple-relation. First, like I mentioned, for the single-relation, the things to do there are not that complex, if you will. Often times, you can just use simple rules or statistics to generate alternative query plans as well. But note that the alternatives we are generating or when I say simple heuristics is actually different from the heuristics we talked about last time to generate logical query optimization, to rewrite the query or open the query at a logical level. In the last class, when we talk about logical query write, the rules will directly rewrite the query without even considering the cost estimation, data quality, etc. Here, when we say we use simple heuristics to search over different alternatives, we are generating alternatives using heuristics, but we are still going to cost it using our cost model and then compare the cost alternatives instead of directly rewrite a query plan to a specific form. In this single-relation case, most of the time, it's just a look at access method actually, either whether you directly use a sequential scan, or sometimes you may use a binary search and say you have a clustered index on the relation, or if you have an index on the relation, of course, there's also the alternative of an index scan. Here, for a single relation, our search process will just generate alternative query plans with these different access methods. And there are some other things we can play with here as well. For example, if you have a predicate with multiple clauses in a conjunction predicate, then you can try to play with the order of these different predicates as well, so that you can filter out as most tuples as possible in the earlier evaluation of this predicate rather than later. But that's pretty much the more important thing you can do or you need to do for a single-relation or query. And usually, these queries would come from OLTP workloads, where you just either just insert a tuple into your table or just look at a specific record or update it instead of a complex analytical query that would typically involve multiple relations. So, for these simple queries, especially single-relational queries show up in OLTP workloads, that's actually a term for them which is called a sargeable, which is an abbreviation of a search argument able, which in other words just means that for these queries, usually there could be index that can help accelerate the search of this query. So, again, a single-relational query where you just look at, hey, what would be the people in this dataset that certify a particular age group, then if you build an index on the age group, then this query would very likely to be executed in a very efficient way already, so that would be considered a sargeable query and it's not very difficult to optimize those queries and with the simple heuristics. Actually, for these queries, sometimes there could be a drawing query that's still considered a sargeable, but again, usually these are simple drawing queries that have a specific form. For example, you have a drawing query, but then it's just a two-way drawing where you have a foreign key constraint on the two tables and in this case, you can just use the foreign key index to accelerate the drawing of this query and then this is also very easy to optimize and to generate a good query plan for this query, so that would also be considered a simple sargeable query. Here, there's like an example I talked about earlier, you start from this table of people on a specific predicate, this case would be ID and then you can just directly use the primary index to execute this query. Any question on the single-regional query or this simple sargeable query? Now, we talk about getting to the more complicated or advanced stuff here. What if you have a multi-regional query? In this case, the most challenging thing, you can probably think of in terms of figuring out what would be the good query plan for a multi-regional query, it's just what would the drawing order should be. That's probably the things that you have the most alternative and then make the search space big. The search space would be exponential, like I mentioned. What we need to do here is that we first need to restrict the search space a little bit, otherwise it's just going to take forever. This actually goes back to the first or at least one of the very first implementation of original database system we talked about last class, which would be the system R. In system R, again, each person takes one part of the database system and tries to optimize it. At that time, one important opposition decision they did in system R to reduce the search space of multi-regional query planning is to actually only consider the left-deep join trees in the query plan enumeration phase. Just throw away all the right-deep or bushy plans and just only consider the left-deep tree. That's just an assumption that they made at that time to reduce the search space and to make the problem tractable. And interestingly, that assumption actually still carries today. I wouldn't say in all systems, but in many, many contemporary database systems, they actually still use the same assumption just to reduce the search space. And it actually turns out that in many cases, if you only consider the left-deep join tree, the query plan you can get at the end of the day is actually pretty good or near optimal if you compare it to what you consider all the other trees. But of course, it's a little bit heuristic way to restrict the search space, but it just turns out to work reasonably well in practice and many people are still doing it today. To be a little bit more specific about this, what do we mean by this left-deep join tree would be that the join operation would only show up as the left child of each node in the tree. So which means that for every operator in your plan tree, if it's a join operation, has a left relation, a right relation, only the left relation would be allowed to have results as another join result. For everything in the right operand or the right child of a join operation, they all have to be a table itself. So in this case, the left plan tree would satisfy this property, but then now the other two right trees would satisfy this property. Make sense? And that's just obviously that would restrict our search space to be much smaller. Okay. So besides, I mean, restricts the search space, that's actually one interesting advantage that matters more than right now, which is that with this left-deep tree approach, it's actually much easier to pipeline the query execution, which essentially means that while you are executing this query with a multi-way join, you actually don't need to materialize the join result in a temporary table or write that back to the disk during the join. Because every time we have a join, the temporary result coming out of a join operation would just be immediately joined. If there are more joins, that temporary join result will just be immediately joined with another base table. Compared to, for example, if we back to the earlier example here, if you look at the right most picture, then if you want to perform that join, you actually have to, I mean, no matter which join you perform for, COD or EODB, either way, after you perform one join, CCND, you actually have to save the result somewhere, and then you go to perform the join on A and B. And finally, you can merge or join the two join results back together. So this matters more back in the day that the system don't really have a large amount of memory. So in the case that if you perform one join, you have some temporary join result just for the two relation, but then if the join result is big, you actually have to write that back to the disk to save it, and then you perform the other join. And then once that's done, you actually have to load the result from disk back to memory and then put them back together. So again, for modern systems, in some cases or in many cases, this is not that big a concern anymore, but in case the amount of available memory in your system is low, then with the left deep tree, it also has the advantage that you don't have to materialize the temporary result or join on creation so that you can just pipeline all your query execution. Make sense? So the possible, to be a little bit specific about what would be the possible alternatives, we need to think of in a multirational query planning. We talk a lot about the query ordering, right? That's obviously one type of alternatives, like left deep tree one, through three, et cetera, et cetera. Then we talk about different physical, then we also need to think about different physical algorithms that we would need to perform for each join operation, right? That would be either you can perform a hash join, sort merge join, unless a loop join, et cetera, et cetera, right? That's what we talk about, I think, I believe, earlier in the semester. No matter what order you choose for each join operation, you also need to choose a specific algorithm to finish that, right? And depending on data property, different algorithms may be better. And lastly, in the last level of your query plan, at the end of the day, you still need to read the data from your table, right? So of course, you also need to consider different alternatives of the access methods or access paths as well, either it's index or sequential scan. So in your multirational joining query, your multirational query planning, you have to need to consider the alternatives of all of these elements. So again, to reduce or to make this search process efficient, we are going to leverage a technique called dynamic programming, right, to help us to perform this query optimization process. And I think many of you probably have learned dynamic programming in your, I mean, either undergrad or earlier algorithm classes, et cetera. But essentially, another, if you haven't heard of our term, another way to look at this is just that you are going to perform a memorized search, right? So you don't need to search over duplicated or to numerate duplicated query plans over an algorithm. You want to reduce the number of possible, possible alternatives you need to consider as much as possible, and also trying to remove the obvious less optimal choices, during your search process. So I will give you a specific example here. So here you have this simple join query. It's like a join over three relations with one predicate on relation r and s, the other predicate on relation s and t, right? So here I'm illustrating the, I mean, a subset of the possible nodes in your search process or possible alternatives. And assuming that we are starting from the left, right, RST individually, we don't know who is going to join who first with what method. And we are trying to, trying to figure out the best way to reach the right side of this slide, right? It's essentially be the join result of the, of the three relations. And here I want to mention that for all the nodes I'm joining here, I'm actually joining them as a relational algebra equivalent, we use them to represent the result for the equivalent queries under a relational algebra equivalence, right? So it's essentially the right node R join s join t is what we eventually want to figure out how do we perform the best, but it doesn't necessarily to be, I mean, the order as it shows up here, right? We're just trying to figure out what would be the best way to get the result, join result of these relations. It could be s join R first and then join t later, etc, etc. But we use the same node to represent this end result. Okay, so we start from the left. We don't know what, I mean, order we are going to perform. So here, as it is here, there are two possible alternatives, right? I mean, because there are two different protocols, you can either join R and s first or you can join a t and s first, right? So these are the joint order alternatives. But furthermore, you can also have a different alternatives to a toy or a toy or a joint algorithm, right? So here, as you read this, for each of these two choices to join which first, you can either use a hash join or sort merge join, right? For example. And then after that, so here, for simplicity, I ignore the access path for now, right? But in your actuality, there's also a choice of access methods. So here, we can just use the cost model we talked about earlier in the class, right? With the statistics, etc. to estimate the tuple numbers. So we can estimate the cost of each of these operations so far, right? And then say, hey, this is the cost distribution. And then we just, at this point, right? Since we are doing dynamic programming, we can already eliminate the less obvious of the less optimal choices. I mean, for these states already, right? Because to reach the state of, for example, to reach the upper state of our join as first, right? Then join t later, we don't need these other less optimal choice anymore, right? Because to reach that state, there are different choices. But then there's only one optimal choice, at least according to our cost estimation, to reach that intermediate state, right? And we only need to save that. And after that, it's a similar thing, right? For each of the intermediate state, there are different ways to reach the final state that we want to optimize for. For the first one, I mean, you can join that result with t, but with different algorithms. And for the second one, you can also join that result with r, with different algorithms. And there are different costs. And eventually, you are just going to, again, first, only save the optimal, relatively better cost for each of these access paths. And then after everything is done, right? You're just going to look back from the final state and to see, hey, what would be the best way to work from, to walk from start to end? And to execute this query, right? That would just be to first join tns with hash join, and then join r later with a software join. And then that would just eventually be the plan you pick. Here, assuming that you have enough budget to enumerate all these plans, right? So in the case that you don't have enough budget, you just cut it down earlier, cut the search process earlier, and then report the best plan you explored so far. Does that make sense? Yes? Yes, yes, yes. Yeah, thanks for the question. It's actually a great question. Yeah, we are simplifying here. So here, by state, I'm only showing you which query has been joined so far, which has not. But in actuality, it's actually much more complex than that. I also mentioned that but yeah, I don't have a slice for that. But essentially, for each of these intermediate states, it could also include the property of the two posts you get, right? So for example, here in the intermediate state, t join s first, right? Then join r later. You can also include whether this intermediate state already have data sorted or not, right? And then you just one state, and then you have a different state with data not sorted. And you have different ways to access those states with different costs. And then that will have different costs to transition to the final states, right? It's a great question, okay? So here I just want to show you, it's the same example actually, but I just want to show it a little bit more specifically to give you a little bit better understanding. It's almost a reiterate on what would be the factors we are going to consider in this join enumeration. Again, the first would be the different ordering of these joins. The second would be what would be the algorithm choices, right? For each of the join. And the last would be what is the access method you use, right? Again, like I mentioned, I answered the question earlier, in actuality, I mean, there are much more to consider. And the actual search process involves many more elements in terms of the states. And then there are also other mechanisms to help you either prune less optimal results earlier, or maybe to generate better query plans easier, right? There are many more elements here. Actually, I think in some universities, they actually have the whole semester, lecture of the whole semester to cover query organization as an advanced class in grad school. But here, I just have to simplify things. But either way, I just show you a more concrete example, right? See, again, like our other example, drawing three tables, I mean, there would first be a different possibility of the ordering of these relations. And then here, I'm actually showing you the Cartesian product on the right columns as well, because I'm just showing you the full space to consider, right? But in practice, and for example, I think in system R, what they did is that they were just directly eliminated these possibilities from the beginning, right? But that's just a choice. I mean, in theory, you could do it that way, right? But in system R, they would just first eliminate these possibilities, and then only consider, and also it will only consider the left deep, after deep tree in its drawing order enumeration phrase, okay? And next, we have the choices for two enumerated different drawing algorithms, right? Let's say here, for example, we already have an order by I, R drawing S first, and then drawing T, then it just have a lot of different choices, right? You rather use Nessie loop drawing, hash drawing, and et cetera, et cetera, like what's called the software drawing. And then, yeah, you would enumerate the same thing for every other plant, right? And eventually you pick the best possible one for this particular order, right? Let's say it could be this hash drawing in both cases. And lastly, right, for each drawing order, for each combination of the drawing order, as that combined all the choices of your drawing algorithms, you also need to choose the access path in your lowest level of your plan tree, right, or your base table. So again, again, assuming that we fix that drawing order with the hash drawing, for each drawing operation, we also need to enumerate different choices for sequential plan, sequential scan, index scan, or just directly a search in the binary search if the table is custard, right? You also need to consider those alternatives and use your cost model to cost each individual case and eventually find the best one, all right? Makes sense? It's almost like a reiterate of what would be the possible choices you need to consider, okay? So lastly, I have roughly 10 minutes left. Actually, a very interesting example, if you will, of query optimizer automation is actually Postgres, right? So Postgres as a, I mean, you all know very well-known open source, they have a system. I mean, in terms of open source system, it actually have a very decent query optimizer, right? It's commercial, they have a system, definitely would have much more advanced techniques to help query optimization, but in terms of open source system, Postgres optimizer is pretty good. And what it has is that, well, it actually has, interesting, it has two different query optimizer implementations, right? So the first, it has the canonical or traditional dynamic programming approach we talked about earlier, right? You look at the drawing order, you look at different, different drawing algorithms, access paths, and you have a search process to enumerate different possibilities of them, and then pick the best one, right? That's the canonical thing we talked about, Postgres has that. But another interesting thing it has is that it has something called a genetic query optimizer, right? So what, when it will use this optimizer is that when the number of relations is large, and for the sake of Postgres currently, it consider 12 relation would be large, right? So under 12 relation, it would use the traditional dynamic programming to enumerate different alternatives, but then if the number of relations in your query is greater than 12, it will consider that, hey, I have just too many relations, right? They're just, the search space is just too big. No matter whatever opposition we use in my search algorithm, it's just, at the end of it, it's just going to cover a very small portion of the search space, and very likely that the plan is not very good. So in that case, you actually seek to a different alternative of query optimization. I mean, again, which is called a genetic query optimizer, and it actually utilized a randomized optimization. It's almost kind of like a multi-color simulation to find a better query plan instead of enumerating the search space. So let me use this, right? So what it does is that, again, instead of starting the search process, right, starting this full dynamic programming with four different alternatives, it will actually just start with a few randomized order, as well as randomized randomly-picked drawing algorithm, as well as, I think access pass probably you can use heuristics to choose better ones, but for a drawing order or for the drawing algorithm, it will just start with some randomized choices, right? In this case, it would be three, and in practice, it would be more, but I mean, that's for illustration purpose, we're just going to look at the example of three. So once it has these randomized query plans, it will just, I mean, first, obviously just cost it, right, cost them, and then, I mean, because they are randomized, they may have a different, very different, like a cost, estimated cost, and then what we'll do is that it will pick the best explored plan so far, right, or record the best explored so far. In this case, it will be the last plan with cost 100. It will remove or throw away the worst cost plans. In this case, it's just a one plan that is worst, but I mean, in practice, you can throw away the plans with the number of plans with the cost estimation that are highest, and then you keep the plans that have a medium cost and the, or preferred, a little bit better cost, as well as the best cost, you're just going to pick elements from these plans, right? So either this could be the drawing order of some subtree of these query plans, or could also be the drawing algorithm specified on this specific drawing operation. But essentially, at a high level, it will just pick the elements of the, from the query plan trees of the better plans you have explored so far, and then do a random, current randomized combination of those elements as well, right, to generate a second batch of randomized plan. And what we'll do next, oh, we'll actually initially do the same thing, right? It will cost the second batch of randomized plan generated from the elements of the first batch, right, from the better plans. And then, again, record the best explored so far will be the first plan, throw away the worst plans, and then pick the elements from the better plans, and then so on so forth to perform the third generation until the planning time budget is exhausted. So what, why is called genetic here, right? So essentially what this is doing is that it's kind of like mimicking a evolution process of the genes, right? So for every batch, right, you just sum of the bad genes, right, with this here will be the bad plans, will just be eliminated or thrown away, right? And then of course, you record the best gene so far. But for the better genes, since, I mean, we don't really know, it's not a formal search process, right? We don't really know, hey, whether it's this drawing order make this query plan very big, or whether it's whether it's a drawing order to make this query plan very efficient, or maybe it's because of a specific choice of the drawing algorithm that makes the query plan very efficient. We just don't know, right? We just know, hey, this plan seems good. So what we do is just pick the elements of the genes from the survivor ones, and then reorder them to combine them together, and also there will be a little bit of randomization, right? It's kind of like the evolving process. And then eventually, after batches and batches, or here it's called the generations, after generations, hopefully your gene can evolve to be better and better, or even optimal, right? So that's how they deal with the case where the search space is too big. They just deem that the normal plan enumeration and search process probably just are not going to generate or enumerate a large enough portion of the search space, and so they just resolve this randomized method to find potentially a better plan. All right? That make sense? Okay. So just wrap this thing up, right? So today we talk or cover a lot about the query organization, right? So we first talk about filtering. We, I mean, just actually some of them is from the last class where we talk about we always want to push down the predicate as low as the query plan tree as possible, so that we can eliminate the enumeration two-pause as early as possible, right? In today's lecture, I also mentioned a little bit about when you are trying to look at the predicates, estimate the selectivity of each clause in the predicate, you can also leverage that information to reorder the predicate a little bit so that you can evaluate a predicate with as low, with lower selectivity earlier to filter out more two-pause in your select, your predicate evaluation process, right? We talk about how do we estimate the selectivity of query plans. I mean, especially we talk about three assumptions that we made to just make our lives easier, right? There will be a uniform distribution, also different attributes are independent as well as for joints, which is assumed that, generally, which is assumed that the joints are inclusive. And then we also talk about the alternatives to cope with some of the assumption that doesn't really satisfy in practice. For example, there could be a histogram and there was sketches I didn't really wrote it here. I will also talk a little bit about how we first calculate the joints selectivity. We also talk about how do we use dynamic programming or this genetic method to explore different alternatives of the joint ordering to eventually generate better plan as the query optimization result. Again, the last note I want to emphasize again is that the query optimization is pretty difficult. Like the earlier question mentioned in the class, there are actually many more properties in the planning evolution and many more techniques to prune the search space to generate potentially better plans, et cetera, that just for the sake of time in the scope of this class, that's just all we can cover so far, all right? So next class, we'll just switch the topic to the transactions. So if the optimizer is arguably the most difficult parts of the database system, then the transactions will just be arguably the second most difficult part. All right? See you next week. Thank you. Sorry, great question. Oh, argue is based in my opinion, right? The first, the large, the most hard part of the database is probably what we just talked about in these two weeks, query optimization. But even though there are many topics we haven't explored, right? But then the second difficult part is probably just transactions, how to deal with concurrent queries and make sure that they can access the database and get the results correctly.