 So, real quick, so just again, the things that are due for you guys in the course, homework fours due today at midnight. The midterm exam will be next Wednesday here in class, and then project two will be due the week after the midterm on Wednesday, October 25th. Any questions about any of these things? Yes? I think his question is, will we grade the homeworks before the exam and provide them back to you ahead of time? Yes? So, everything should be graded already. All the solutions are online up to homework three. Homework four will put out this weekend, and then my plan is to release all the grades and all the graded everything by this weekend. His question is, will we be holding a review session before the exam? No, we will have the review session right now, like at the end of this lecture. Okay? Any other questions? All right, awesome. So, where we're going to talk about today now is how the database system is actually going to figure out what the query plan should be for SQL queries that show up, right? The last couple of lectures, we've talked about all these different algorithms, our way to execute the queries, and now we need to figure out, you know, when our SQL query shows up, which one of those algorithms it should use. And so that's what we're talking about today, and this is generally called query planning or query optimization, and the reason why we have to do this in a relational database system is because SQL is declarative, right? You don't say, I want to do a hash join, you just say you want to do a join, and it's up for the database system to figure out what the best plan or data should be, or what the best algorithm should be. And as we saw in the last two lectures, there'd be quite a difference in the performance of these queries depending on what the plan is, depending on what the algorithm uses, right? In the worst case, for that one simple example, if you did the dumbest nested loop join, it could take 1.3 hours, but if you did a nice memory hash join, it could take, you know, half of a second. So this is why this matters, and this is sort of like what, this is one of the big differences that will, of why the commercial guys that cost a lot of money, the commercial systems, the Oracle, SQL servers, DB2s, why these systems are in general much, have much better performance than the open source guys. So the very first query optimizer goes back to IBM System R. Remember, I talked about System R in the beginning of the course. I said that it was this internal secret project at IBM Research in San Jose, that they, where they took Ted Codd's relational model paper, and they actually tried to put it into reality, tried to build a real system. And what's really fascinating about the System R story is that they didn't know what they were doing in some ways, because there was no textbook, there was no database course. They were sort of building the first one, and they had to figure out these things as they went along. And basically the way it was sort of organized is that it had seven or eight people that all had PhDs, and this is very early days of computer science, and not all of them probably had PhDs in computer science, because that wasn't as common, and they sort of carved off different parts of the system to have, you know, the person with the PhD would work on. So one particular project that was very, very successful was the query optimizer. And what they wanted to show was people were claiming back then that there's no way the database system is going to be able to pick a query plan that can perform better than a query plan written by a human. And the same argument would be made at that time that, you know, there's no compiler that could actually generate machine code or assembly code faster than what a human could write, or better than what a human could write. Of course, now we know that's not true. And so this is sort of, they were trying to prove this, that, yes, you could have a declarative language like SQL and generate a query optimizer that could generate good plans. And what's really fascinating about system R, as you'll see as we go along to today's class, a lot of the basic concepts from their original query optimizer are still in use today. They make some assumptions that we no longer make, which we'll go through, but the general idea of how they're going to do query planning or query optimization is the same. So there's essentially two ways to do query optimization, two categories of optimizations you can apply. And the first one are sort of simple heuristics or rules, I shouldn't say simple, but there's sort of hand coded rules that you can write in your query optimizer to identify certain patterns in the query plan and apply some standard optimizations. I'll show some examples in a second, but these don't require to know anything about what the actual data looks like. You can just look at what the plan's trying to do, the query's trying to do, and say, oh, this is kind of stupid. I can rewrite this, make that faster, and I could prune this thing out because it'll be the same. So there's a bunch of those kind of rules that we'll talk about, but then there's also this cost-based search approach where the database system can't figure out just by looking at the query what exactly should the right plan be, so instead it does sort of a search to look at a bunch of different query plans and uses an internal cost model inside the database system to estimate which query plan will have the lowest cost, and that's the one you should pick. And this idea of a cost-based search model or search approach for query optimization was the major contribution from the system arm approach. So for today, we're going to first start talking about the Syrheuristic, a rule-based optimizations, right? These are based on relational algebra equivalencies, and then we'll talk about how to do estimation of what a query plan is going to cost when you actually run it, then we'll talk about how to do plan enumeration and join ordering searches, and then we'll finish up with nested subqueries and do our midterm review, okay? All right, so the first approach, as I said, these are sort of rules that you can build inside your query optimizer to automatically rewrite the query in a way that will optimize it. So traditionally, this is usually called query rewriting, the basic way to think about this is that it's just a bunch of list of things that you look at the query one by one, and you see whether you have a certain pattern, and if it does, then you know you can apply a certain optimization. So this, again, the big advantage of this is that you can do some pretty good optimizations without ever actually having to go to the cost model inside the system and say, well, what's this query going to cost? So what will be the overhead of executing this query? We're not going to talk about this in this class. We can talk about it in the advanced class, but this becomes actually problematic from a software engineering standpoint, because now you have all these sort of hard coded rules in your query optimizer and it becomes really difficult to debug. So Postgres has this, Postgres has a query rewriter before you get to the actual cost-based search optimization model, and maybe it's gotten better in a few years, but one of the lead programmers for Postgres is actually in here in Pittsburgh. He's CMU alum, and we had lunch with him, and we asked him about the rewriter, because we were thinking about taking it and putting it in our system, and he's like, yeah, that's a dark forest, right? It's very few people know actually how this works. So again, we won't focus on this in this class. I'll show you what kind of optimizations you can do, but from a software engineering standpoint, we'll talk about more about this in the advanced class in the spring. So let's look at the most basic query writing optimization you can do. Praticate push down, right? Say I have a simple query, I'm joining two tables, and I have my join clause where the student ID from the student table equals the student ID in the role table, but then I also have this additional praticate where I wanna filter out anybody with, or I only wanna get the tuples where the student got a grade in the class. And so if you take the sort of query plan as written in SQL and do a direct translation into relational algebra, you would end up sort of roughly with a query plan that looks like this, right? At the bottom, you take the access methods and get the tuples out of the two tables, then you pass that into your join operator, and then you do the filter on grade, and then you do the projection to filter out the tuples that you want. So what's the obvious thing we can do here to make this go faster? What's gonna be the slowest part of this query? The join, exactly. And there's nothing about the join that requires us to do the filter afterwards. So instead, we can just push the praticate down so that we filter now all the tuples from the role table before we do the join. And that means we have to evaluate less tuples in the join, and a query's gonna run faster, right? This is pretty simple. You can always do this unless the filter requires data from the table you're joining on, right? So this is like grade equals student name plus something else, right? You need to have both values from both tables. You can't do that praticate push down. In this case, this is really simple, and you can do that, and this would be a huge win. So again, the reason why we can do this is because we can rely on the fact that we're dealing with relational algebra, and we know that moving that filter around in our rational algebra expression doesn't affect the output that the database system will generate for the query, right? So the expression on the top is what I showed beginning when you sort of did a straight translation of the SQL query to relational algebra, and then the one on the bottom is when I added the, I moved the filter over to be directly on the role table, but before the join, right? And we can look at this. We know the output will be the same. Relational algebra is unordered, so we don't care about ordering, right? We just care about, we get the exact same results. In this case, that we do, and we know their equivalent. So there's a bunch of these kind of rules that you can apply for different types of queries, right? So if you have selections, we already showed how to do praticate push down, right? You want to filter the tuples as early as possible. And in the case, if you have complex expressions, then you maybe want to push down, you know, to the different parts of the tree so that you do the filtering on the one table and filtering on the other table, right? In general, you always want to try to filter as much as possible before you get to the join, because the join's always been the slowest part. You can also do other things like simplifies, sort of somewhat complex predicates. Again, relying on Boolean logic or commutativity, associativity, right? So in this case here, x equals y and y equals three. Well, we know that the equivalent to x equals three and y equals three. And say x and y come from two different tables, then we can then split that up into separate filters that we can run before we actually do a join. Or if, you know, we care about performance in a memory system, doing the lookup to find the value of x and then finding the value of y in memory, putting them, you know, copying them to some buffer and then doing the evaluation to them, that's more expensive than just taking x equals three and y equals three, right? That's a direct comparison on a constant and that can be done much more efficiently. All right, so the bunch of these kind of rules that we can apply both in the where clause. For projections, we essentially want to try to maybe produce or generate smaller tuples as soon as possible in our query plan so that we're passing less data from one operator to the next. We talked about this before when we talked with the volcano model or the iterator model, right? You're copying tuples as the output of an operator and passing it to the next one. So if you're passing along columns or attributes that are not even needed in your query, then you're wasting time copying things you don't actually need. So this is not, you know, we don't have to worry about this in a column store because you said before in a column store system, the database can be really smart about only grabbing the columns that it actually needs, but in a row store, you go to some location in your page, your entire row is gonna be there and you probably even copy it up to the next operator. So instead, if you push down the projection, then you can copy less things, right? So let's say, go back to that example here. Here we see that in our final output, we only need the student name and we only need the course ID for the class they were enrolled in, but we actually also need the student ID from the enroll table and the grade in order to actually do all the other parts in our query. So we just can't look at the output of the query. I don't look at all the stuff that's inside of it. And so we can do projection push down where on the student table, we'll only copy along the student ID in the name and then in the enroll table, maybe we'll do a projection to get the student ID and the course ID after we do the filter on grade. Or we could put another projection in there before the filter and copy even less data. So for a small table, this is not really a big deal in a column store system, then you sort of get this naturally because you only grab the data that you actually need. This actually can make a big difference in distributed databases because maybe you're going over the network to send data from one node to another because you're doing some kind of distributed query execution. And if you do the projection early, yes, you're paying a little computational overhead to make an extra copy to prune things out, but now when you send things over the network, you're sending less data. And that's been much more expensive than copying things in memory. So we'll see this more in when we talk about distributed query execution. But again, the basic way to think about it again, you push down projections early so that you end up copying less data from up the tree. So I want to go through a couple fun examples that there's a blog that somebody wrote, actually only a few weeks ago, where they had a bunch of optimizations you can apply to SQL without requiring a cost model at all. And what's really fascinating, and I encourage you guys to read it, is that they didn't just take Postgres and try it out, they tried Postgres, SQL Server, DB2, Oracle, and MySQL. So they took a bunch of these examples where based on relational algebra, we know that we can apply certain optimizations and he went through and showed how different database systems support and some support and some don't. So I only have Postgres and MySQL available for a demo, but if you want to see how it works and the other systems, you should take a look at that. So the first kind of optimization we can do is to remove either impossible predicates or unnecessary predicates. So in our first example here, we have select star from table where one equals zero, obviously that's false and therefore it can just remove that. You know you actually be even smarter too, right? Because you know no tuple is gonna qualify so it shouldn't even run the scan. There's no point in looking at every single tuple because it's always gonna be false and no tuple is gonna come out of it and actually can just return nothing right away. And we'll test to see whether they actually do that. In the case of the second one, we have select star from table where one equals one. Again, that's always equal to true. So in this case here, we don't need to actually evaluate this predicate for every single tuple. It can just say, all right, I'm doing sequential scan without anywhere clause at all and just admit all the tuples. And the second one here, this one's a bit more nuanced and this one's a bit more tricky. I don't wanna get into transitive closure but the basic idea here is that for this query, we're doing a self-join, right? Where we have some primary key ID and we're saying where A1.ID equals A2.ID. It's both the same table, it's the primary key so you know they have to always match. So this is actually really only equivalent, this query is equivalent to just select star from A without doing a join at all. So the question is what if you have a non-unique key? Right, so then you, for this example, it's just the primary key as soon as unique, right? We can try it with the non-unique next. All right, two other interesting ones are ignoring projections. So in this first query here, we're doing a select on A and then we have a where clause that says match any tuple where it exists that there's some tuple with the same ID. But again, we're doing a self-join on A so A.ID will always, in the first table, will always match in the second table because it has to exist because they're the same table. So in this case here, we're doing again select star from A on the inner query. So if you ran this without being in a nested query, what would happen? It would take all the attributes and shove them back to you as the result of the query. But in this case here, the exist predicate says only evaluate to true if there exists one tuple that matches this. And I don't care what the contents are, I just care that it exists. So I don't actually need to put, I don't, the system doesn't actually need to materialize the output of that inner nested query. And furthermore, it should know that it's gonna evaluate to true because I'm doing a self-join. So we can actually not have to do any copying of data from the inner query. And if we're really clever, we can just get rid of the inner query entirely. And then the last one is to do emerging predicates. So in this case here, I have select star from A and I have some value where I have a between clause and this is just shorthand notation to say greater than and less than. So I say value is between one and a hundred and value between 50 and 150. And so I could rewrite that, actually that's an impossible part of it. That should be an or, right? So if it was an or, you could say value equals between one and 150. And then now you don't have to evaluate both sides of the expression tree, you could just evaluate one of them. So let's go look now in Postgres and MySQL and see which ones these actually support. You guys see that? Good, okay. So here's the first one where we had we had an impossible query, right? And again, it explains gonna spit out the query plan that tells us what actually happens. And this is a sample database that's available online that is modeled after a DVD rental company or a movie rental company. I'm dating myself with DVDs, right? So we'll do this first one. It's an impossible query plan. So nothing should evaluate. So Postgres says that it actually looks like it actually wants to scan it. So I might be missing that. So let's run, I lost my cursor, awesome. There we go, sorry. Again, we can run, explain, analyze to say what it actually did when it ran, right? And here it tells you it never executed, right? It was smart enough to recognize you have an impossible predicate and it said I'm not gonna execute this at all. And then it immediately returns nothing. So let's try it now in my SQL, right? So here they declare that they have an impossible where clause. So again, it's smart enough to know that it can't possibly execute this query so it doesn't even bother with it. So let's look at another example. So I would say as a miner aside, the explain output in Postgres is way better than my SQL. Like in Postgres, you can actually see the query plan tree in my SQL. They don't actually expose that to you. All right, maybe I'll hung, awesome. Demo time at the Uber. Okay, sorry. Let's try the one now with one equals one, right? And here Postgres was able to recognize that everything always values to true so it doesn't even bother trying to apply the predicate. In the case of my SQL, right? That again, it didn't bother applying the filter and it immediately skipped everything. We have to see what the warning is though. All right, so let's look at another example. So let's do the range query one. So again, we have a table where we we have an impossible range here, right? So we're gonna look up all the film IDs from this film table between one and two and film ID between one and nine and 200. This is impossible because again, they're not intersecting sets. So Postgres recognizes, or sorry, my SQL recognizes that there will not be any match in this query so it doesn't even bother actually executing it. The const table is sort of this placeholder to say, I don't need actually to run this query, I can produce the result right away. And we go to Postgres, the same range. You see Postgres was not able to combine it and it just, it rewrote the between clause to be to the long form notation. So it actually looks like it applied actually everything and actually ran it, right? So again, there's a bunch of different optimizations, not all systems support them all and then the commercial guys are much, much better at this. Let's see, we can do another interesting one, the projection filtering because I think that's kind of interesting. Right, so we can do something like this. Select, that's gonna fail. Select exists and if you put anything here, it'll evaluate to true, right? Exist says return true if something actually exists in this and that's not gonna work. Gotta put a select one in that, right? So this says select and run the exist clause and return true if anything matches inside of it. So something matches it and so that's why it returns true there, to use true. So let's see whether it actually ran the the inner select statement, the actual projection and the way we can do this is do division by zero. So if it throws an error, we know that it tried to actually run the projection. If it doesn't throw an error, then it did recognize that I don't need to actually do anything and I'll just return true. Who says it's gonna fail? Who says it's gonna work? Raise your hand if it says it's gonna fail? This is Postgres. Everybody puts their hand out, right? They say fail, who says it's gonna work? One, two, okay. It worked, right? And again, what's going on here is that Postgres recognized that it doesn't need to return anything from the inner projection or the inner query. So it just says, yeah, I got something. I'm satisfied, right? I don't get details, but init plan is what they call when you have only a, if you do a select without a table. So let's try this in my SQL. Who says it's gonna work? Who says it's not gonna work? Raise your hand if you say yes, it'll work. A few raise your hand if you say no. Let's see what don't know, okay. It worked, but this is actually problematic because my SQL, at least in this version, doesn't actually throw an error when you do one divided by zero. So if you go back to Postgres, you do one divided by zero, right? It throws an error. So we need to make it do an impossible thing. Test it better. And another way to do that is to do, you take a power which doesn't actually, shouldn't actually work, right? But actually, it produced the same answer. So in both cases it worked, all right. All right, so again, so these are optimizations again that the data system can apply that doesn't require it to go to anything, doesn't require it to go to the cost model, right? In the case of joins, we can do the same kind of thing. We can rely now on the commutativity or so sensitivity property of joins in relation to algebra to do any rearrangement of the join orders. And this is gonna be challenging because this is where the bulk of the time we're gonna spend in our queries to be on joins. And the performance you can have between one join ordering versus another can vary a lot. So we need a way to figure out how to identify what's the best join ordering for our query. And the problem is it's a really, really large search space. So the number possible solutions is called what a Catalan number is. It's called a Catalan number. Again, just it's roughly four to the n. There's a Wikipedia article, you can look about it. Look up to learn more about it. But again, the main thing here is that for an n-way join in a query, there's a ton of different join orders you can have. So it'd be take forever for us to go and try to pick what the right one is. So we're gonna do a bunch of tricks to try to limit this down. So we'll see in a few more slides what kind of basic optimization we can apply to prune down our search space and then we'll see how it actually then to figure out what the best one is. All right, so now we need to figure out now that we sort of know we know how to do basic query rewriting to generate a query plan. And now we have all these different choices about the algorithms we can use to implement our joins or run our joins. We need to figure out how to identify where the one plan is better than another. And again, this relies on not only on the join order and what algorithm you're using, but a whole bunch of other things like what's in memory, what's on disk, things like that. So the way the system's gonna do this is it uses what's called an internal cost model. And this is not a, it's not trying to estimate things in terms of absolute numbers in the real world. It's not gonna try to estimate we think your query is gonna take this number of milliseconds or seconds. It's always based on some internal metric based on the resources that the query is gonna have to consume. And that's sort of a stand in or it's emblematic of what the real runtime would be. So you could try to do things like measure how much CPU time you're gonna have to take to run your system or run your algorithms. Commercial systems do this, the open source guys don't because this is pretty tricky to actually figure out. Typically they do things like figure out the number of times you're gonna have to rewrite things from disk, the amount of memory you're gonna have to use, and then if you're in a distributed database, you also estimate how many messages you're gonna have to send over the network. So in order to do this though, we have to know how many tuples we expect. All the operators in our query plan are gonna have to read and write. In order to do that, we have to keep some information about what our data looks like in order to make an approximation, a good estimate about what the number of things are gonna come in and out. And so to do this we're gonna use, the data system is gonna maintain internal statistics about all the various components or elements of your database and then use that at runtime to figure out how to estimate what query cost is gonna be. So we'll show in a second, but you can measure things like obvious things like the number of tuples you have in the table or the number of keys you have in an index, but other things like what the distribution of values are for individual attributes or individual columns. So the different data systems maintain different metrics or different statistics. And again, this is what differentiates between the open source guys versus the commercial guys. The commercial guys are way more robust and have much more accurate statistics than the open source guys. And then there's another question of like when do you actually go and maintain or collect or update these statistics? You could do it every single time you insert a new tuple, check to see what the values are inserting on the attributes and then go update some internal histogram or statistics table, but that would be really, really slow. So typically the way these things work is that the system will periodically run a scan and look at all your tables and then update the statistics or if you've made a major change to the table, like you've updated a hundred tuples, then it'll go through and figure out, recompute the statistics and other things. In Postgres, actually in every single database system you can force the system to actually recompute your statistics. And I don't think it's in the SQL standard, but typically everyone has the analyze keyword. My SQL's always gotta be different. You gotta say analyze table. But he basically analyzed this table and go through and recompute all my internal stats. And then that information can then be used by the cost model to figure out what the execution cost will be for a particular query. So the obvious things that the system's gonna maintain to figure out about your tables are again the things like the number of tuples and then the numbers to distinct values for an attribute. And again the real system's actually maintaining way more things, but this is just to give you a rough idea of what you can actually do with this information. So just based on these two metrics, the number of tuples and the number of distinct values, we can derive what's called the selection cardinality, which is gonna be the average number of tuples that within a table for a given attribute that will have a particular value. You just take the number of tuples divided by the number of distinct values and that's how you roughly what the cardinality will be. And so what's one obvious problem with this approach, or this formula? I'm taking the number of tuples dividing by the number of distinct values I have and then that'll tell me for a given value the number of records that have that value. What do you say? It doesn't affect the distribution at all, right? This makes this huge assumption that everything, the distribution of data is uniform. And we know in reality this is not true, in the school of computer science, so actually at CMU we have roughly about 10,000 students and we have 10 colleges, but SCS makes up a large portion of it, right? And some other college has a smaller number of students. So if we use this formula to try to estimate this, we would end up with possibly incorrect estimations. So I bring this up just to say that for what we'll talk about here, we'll assume that our data is uniform, but in reality it's not. And different systems do different things to get around this issue. All right, so now we can go through and talk about how we can use the selecting cardinality and to figure out what the number of tuples we're gonna have to access when we run certain queries. So the easiest one to do is equality predicates. When you know that you have unique keys, select star from A where ID equals one, two, three, assuming ID is the primary key, I only have to read one tuple, right? That's really easy. The tricky thing now comes when you have sort of more complex queries, more complex predicates, where it's not exactly obvious what the selectivity is and you can't just rely on that simple formula that we have before. So we have to make a bunch of other assumptions as we go along in order to compute what the estimation of the cardinality would be for these different queries. Again, the goal here is to try to figure out the number of tuples we're gonna have to access or read or write for our query and then we can use that to figure out what the estimated cost would be for running our entire query plan. So we would know for reading from table A, in case for the first query, I'm gonna read one tuple, right? I'll read one page, go fetch it and then read one tuple out of it. And then if I have another query plan or another operator in my query plan, like doing a join, I know that I'm shoving up a single tuple to that next operator. I can build upon that and sort of recursively compute these things to figure out what the total cost will be for my query. So now we're gonna define a concept called selectivity in our predicates that is sort of like, relies on the selection cardinality that I talked about before. And the basic idea here is that for a given predicate in our query, we wanna estimate the number of tuples that are actually gonna satisfy it or evaluate to true and therefore they're gonna be moved up into the query plan. And so to handle all possible predicates, there's a bunch of different sort of formulas we have to apply to do all these different things, the quality predicates, range predicates, negation, conjunction, and disjunction. And again, we're gonna make this huge assumption here that our data's uniform, but for now, for our purposes here, that's fine. So as I sort of said before, the most simplest one to do is if you have a quality predicate on a unique key, then you know you read exactly one tuple, but if you have a non-unique key, then you basically run the formula. The formula is taking the selection cardinality of the predicate divided by the number of unique values, and that'll give you the selectivity of a particular quality predicate, the attribute equals some constant. So the way to sort of visualize this is say you have in this particular query here, age equals two, and say your data looks like this. The system's gonna maintain some internal histogram that says for all my distinct values for this attribute, here's the number of times that they occur. So in this case here, where age equals two, the selection cardinality is just looking at the bar for age equals two, and I'm missing the y-axis, but the value is one, so we know the selective cardinality for this has to be one, so we can divide one divided by the number of unique values, and that tells you the selectivity of age equals two is one-fifth, right, sort of obvious. There's five unique possible values, we're looking for one entry, so the selectivity will be one-fifth, and then we can multiply that by the number of tuples, and that'll tell us how many tuples we expect to emit. All right, so now the more complex thing is like a range query. So here we can use this sort of simple formula up above, where you take the min and the max, and you subtract from each other and divide it based on the thing that you're looking for, and then you would end up with an estimate of what the selectivity of this predicate would be. So now this is a good example because it's not, it shows you how these formulas can be inaccurate because they're sort of simplistic, and they assume this uniform distribution, right? So in this case here, the true selectivity for this query when you actually run it would be three attributes, right, three and four, but we make the assumption in the formula that we're gonna start from two, sorry, you start from two and be inclusive of this. If you had age greater than two, you would run the same formula, and then now you'd be counting things that actually shouldn't be in your query anyway. So you can get these inaccurate estimates based on just looking at the sort of simplistic histograms. This assumes that your histograms are accurate as well, which is a problem. But again, we're not actually running the query based on these formulas, it's actually just estimating what the cost is gonna be. So our estimate would be wrong, but we're always produced the correct answer. For negations, it's pretty easy. It's always one minus the selectivity, right? In this case here, the selectivity of age equals two is, sorry, selection cardinality of age equals two is one, and then the negation of this is these two other regions here, and that matches up with four-fifths, right? So that's sort of obvious because the selectivity of just age equals two is one-fifth, so one minus that is four-fifths. So this works out fine. And again, so based on this, it's sort of obvious now that the selectivity is roughly equivalent to the probability of a tuple is gonna satisfy our predicate. It's not exactly, it's not an exact match, there's some corner cases where this doesn't pan out, but this is, now because we're gonna say it's probability, all the same principles and tricks you can do for when we take a probability course, we can then apply to our predicates here. And this shows up in things like doing a conjunction. So if we assume that they're independent, then the selectivity of one predicate is, and you multiply versus another predicate, then that's what you get the total predicate of joining these two things together, right? So if age equals two and name like a wild card, in the case of applying now to, if you multiply the two predicates together, you end up with a middle region here, right? And this assumes again that the predicates are independent, which they're actually not, not always. Same thing for disjunction, you can do the same thing, it's essentially just the union of the, of all the selectivities of the different predicates, and then you just run this formula here and then you produce the total estimation of any predicate. All right, so these are sort of obvious, or not obvious, these are sort of straightforward, again, they're not always gonna be accurate, but the thing we're really gonna end up caring about is the estimation size for joins. And so what we're trying to do is that for a given join on two tables, we wanna estimate the number of tuples we're gonna emit, because that's gonna be important, and that's gonna tell us how many tuples will then be fed into another join operator, and we wanna determine whether we wanna maybe flip around those joins, so that we're always printing out things as soon as possible. So the problem we're trying to solve here is for a given tuple to R, we wanna estimate how many tuples will actually match an S after we apply our join predicate. So I don't wanna go through all the math, but basically you get the estimate of the number of tuples that will match on the right-hand side, the number of tuples is gonna match on the left-hand side, and then you divide that by the max number unique attributes, and that'll roughly tell you the number of tuples that will match. And again, the formula, the exact details of the formula doesn't matter, but this is roughly what people do in a real system. All right, so I sort of mentioned this before, that the system was gonna build these histograms to try to actually come up with these numbers, right? You basically look at your histograms, figure out what the values are, and then apply these formulas. But as we said, all these formulas worked out nicely when we assumed that our values are uniformly distributed, right? And so your histograms actually looks like this, you have along the bottom, the number of distinct values for an attribute, and then the number of occurrences. But again, real data doesn't look like this, real data usually looks like this. And actually typically it's more likely to be ziffian skewed, but for now it's fine. So if we now have to maintain a histogram for every single distinct value in our table, this is gonna be really expensive to do. Let's say I have a unique key. I don't wanna have to maintain a separate 32-bit integer to count the number of times that value occurs because it's always gonna be one, right? Or if my table has a billion rows and maybe half a billion distinct values, I don't want to maintain again a histogram for every single one. So what people end up doing is they end up using what are called these buckets to combine together different values in the histogram and then only have to maintain the number of entries for all the values in that particular bucket. So in this case, first here, for the range between one and three, I'm gonna keep, say the count is eight, right? Because that's the sum of all of these guys here. So now what I end up wanna do is I wanna say I have a value two. How many distinct values do I expect there to be? I divide the count for that bucket divided by the number of entries that are in it and that'll tell me roughly what the, roughly what the number of distinct values is for the value I'm looking up in that bucket, right? So what's the problem with this? Right, so the statement is if your buckets are all the same range, all have the same sort of number of distinct values sort of same width, then you end up having really bad estimations, right? So in this case here, this last bucket had 13, 14, 15. The value of 15 had a really high count. The value of 13 had a low count. But then when I put it into the bucket, I sort of lose that upper bound and lower bound. So the way we can fix this is to use what are called quantiles or echo width histogram. And basically, we still bucket up our values, but we're gonna try to bucket them in such a way that the count per bucket is the same. The number of distinct values might be different, but the count will always be the same, right? So in this case here, now instead of having five buckets, I'll have four. The first one will have five entries with a count of 12. All these others have a count of 12 for this middle guy here, he has a count of nine. And so now when I compute my ranges like this, I end up with a sort of a closer approximation of exactly what the distinct value count is for value in the range. Yes? It doesn't sound like you can always get numbers that are like this because- It sounds like what, sorry? So your statement is that this doesn't always work, right? That's correct, yes. But this is, what's the alternative, right? The alternative is to maintain a distinct count for every single value. That'd be super expensive to do, right? This is an approximation. So his statement is rather than trying to do this equit thing here, or the quantile thing, just randomly generate buckets. But when do you do this? Data's coming in, presumably, right? If your data's read only, then you can compute a super accurate histogram. But the database is changing all the time, right? And maybe in some systems they can be smart about, oh, I only change these pages and not these other ones. So therefore, maybe I only compute incrementally update my histogram and things like that. This becomes, this is a big challenge and this is actually one of the major problems that they have in database systems. There's no magic solution to make this work. And Pua spent a lot of time in studying this. It's a very well known old problem. So I said, so again, most database systems will do something like this. What's an alternative to maintaining histograms? Can anybody think of a real simple solution that kind of maybe produced the same result without having to maintain these histograms which could be inaccurate? So it's sort of kind of like a B-plus tree. What do you mean? Oh, sorry, I don't remember if it's this. Instead of maintaining a data structure that you could then use to derive statistics from, what's an alternative? Sampling, right? So you don't maintain the histograms and when your query shows up, you go peek ahead and look at the table you're gonna scan or whatever your data you're in access, randomly pick a small subset of it and then examine the data and then figure out what the actual selectivity is for the predicates you're trying to run. So let's say I have a simple table like this, but it's huge, I have a billion tuples and I wanna estimate what the selectivity is for ages greater than 50 and so I'll randomly pick some subset of the tuples, make a copy into memory and then actually just then to scan through the sample or actually I could also compute a histogram or a super accurate histogram and then find the number of entries that sat us on my predicate and then I can extrapolate from that that based on this sample, if I know what the selectivity is my predicate in the sample, I can say that that's the selectivity on the true data. So this is actually what the, again, the commercial systems do that the open source guys don't do. I know, I think SQL server does this and the SQL server has the best query optimizer. So again, there's a whole bunch of other design questions how do you actually implement this? Do you always keep the sample around or do you regenerate it for every single query? You always, typically you keep it around but then when you go back and resample, it's usually there's a trigger or a threshold to say my table has updated this amount by this much and therefore I wanna go grab another sample and then this is not even bringing in transactions how do you maintain consistency with the data when you copy in your sample, right? For all that we can ignore. Okay, so now that we know how to roughly estimate the selectivity of our predicates, question is what can we actually do with them? So this is where we now get into the actual query optimization side of things where we're gonna use our cost model based on these selectivities that we can generate to now estimate what the costs will be or actually different queries. So the step to go through and do this is essentially you generate a bunch of alternative plans and then for each of those plans that we can generate we'll use our cost model that uses our statistics or sampling method to estimate what the cost is for a query plan and then try to identify whether one query plan is better than another. Again, the cost model is not something in the real world it's an internal metric to say that this query plan is better than another and then whatever query plan we find has the lowest cost that's the one we'll pick. So for single relation queries, the name of the game, the problem we're trying to solve is essentially trying to figure out what the best access method to use. And again, if it's a primary key that's simple because you know exactly what to go find, go look it up on. But in this case it's often simple heuristics are good enough for this. You don't need to actually do a sort of branch of bound search or an exhaustive search to try to find these. And this is especially easy for OLTP queries because they're always gonna be able to do lookups on indexes. So it really is just trying to figure out what is the best index to use for a particular query and then that's just the one that would have the best selectivity. So figuring out or having a query that can be easily ran on doing using index scan are typically called Sargeable queries. So search argument able. I don't know if the textbook says this but this is sort of common in other literature. And again, what we're doing here is just looking at our queries, looking at our where calls and finding what predicates we have and then picking whatever the index is the best one to use. And then we don't even bother running through the cost model search that we'll talk about later. And so Postgres does this, MySQL does this. Oracle and DB2 do this. SQL Server I think runs everything in another design model which we won't talk about here but we'll talk about in the advanced class. So the tricky thing we gotta deal with is how to do multi-relation query planning or again join orders. So again the two things we gotta figure out are in what order should we join the tables and in what join algorithm we should use. So as I showed before, the number of join orders you can possibly have is huge, right, four to the n. So we can't just sit there and burn cycles forever trying to find the best join order, we have to do something to sort of prune the space. And so typically the way query optimization is taught in an undergrad class and an intro class like this is we tell you what system R does but I'll say that it's just meant to give you the idea of what you can do to prune the search base but not all systems actually make some of the assumptions or do some of the pruning that they did back in the 1970s. So we'll go through what system R does but I can talk a little bit about the end of what real systems do. So the key thing that system R is gonna do to reduce the search base is only look at what are called left deep trees. So left deep trees where you sort of have all the joins on the left hand side and the output of that join is then fed to the left hand side of the next join. So this over here it's called a bushy tree, this is sort of a weird hybrid. So immediately right away system R would not even consider these other query plans, they would only consider these here. And I'm gonna take a guess why? Besides just reducing the number of things you have to examine, what sort of one benefit we could have if we execute query plans using only left deep trees? Exactly, yes. So he said you only have to keep one join table in memory, right? And so this is again, this is called pipelining. So the output of one of the first join can then be fed now immediately into the input of the next join, right? It may not fit entirely in main memory but I'm not gonna flush everything out and then go run another join. If you go back to the bushy tree here say I run the D and D first, right? Then I materialize that output and then now I go over and I'll run the join on A and B and I'm not using any of the intermediate results that I've generated from the first join. So the data centers can just flush that out the disk or swap it out the disk. And then now when I'm finished the join A and B then I bring back the join and CD together and then join that, right? But in the left deep tree I can try to maximize the amount of reuse of the data that I generate as I go up the tree. Now that's not to say that all left deep trees are re-pipelined again, if the intermediate result is too big it may get swapped out but in general this will work. So what we're gonna do now is now to figure out all the possible things we wanna search on. There's three sort of basic steps. We gotta enumerate all the different join orders we can have. Then we gotta figure out what all the different algorithms we could use for each join and then we gotta figure out the access methods or access paths you wanna use to access the table, right? And so the way system R does this is through dynamic programming sort of divide and conquer approach where it's gonna break the search problem up into smaller chunks and build upon what it learns from the previous chunk and then always move forward to the final answer. So the way to sort of think about this is in a really simple example like this. Say again I wanna join tables R, S and T and so I wanna figure out what the right join order is and so I have a bunch of different paths of how I can get to my final result. And so in the first path the edges will correspond to the different join algorithms we can use. So I can go either join R and S using a sort merge or a hash join or I can join T and S doing either algorithm as well. And then what I'll do is for each of these possible paths I'll compute the cost of executing that query plan using my internal cost model based on the statistics that I derived before and then I'll throw away whatever one had the I'll throw away all the ones but the one with the best cost. And then now from each of these new positions I'll do the same thing. I'll estimate what the different costs are to do the second join in both of these directions. And then at the end again I picked them all with the best cost and I throw away the others. And now this at this point I have a path now from the beginning to the end but I have multiple paths and each of these will have a total cost and I just throw away the one that had the highest cost from these two. So I don't need to go examine when I'm back here I don't need to examine again like well maybe I wanna examine the hash join and the sort merge join going along the bottom. At this point I know the hash running was the best way to go so I only use that when I figure out how to go to the next one. We're not exhaustively searching this but it's a close approximation that's good enough. So the again the three steps of type of four is that we're gonna enumerate all the join orders we're going to pick our algorithms and then we're gonna enumerate our access methods. So let's sort of walk through an example and show all these different steps but I'll just say again our real systems don't actually do this. So again enumerate the join orders we already know about we have all these possible combinations but we can also do joins versus cross products or Cartesian products in that case the system immediately throws this out you almost never wanna do a cross join unless somebody asks to do this so we'll skip all those and then we'll pick one of these join orders we'll go through and enumerate all the different algorithms we could use in this case as nested loop join or hash join and then we then now do this for all the other plans that I skipped in the last slide and then I pick one of these guys and then I pick all my different access methods that I could possibly have and then I build that giant search graph that I showed and then I'll walk through and then do the search and I don't bother computing the cost for the for pass that I know I'm never gonna take so this is just showing you how you build actually the complete search graph and then you estimate the cost for every single step as you go along okay so to have what how much left 10 minutes awesome all right so finish up the next tricky thing to talk about is how do we do sub queries so again this is another good example where traditionally the open source guys were much worse than the commercial ones nested queries are always cause problems my SQL was notoriously bad at this but they've gotten much better lately and Postgres was actually always pretty good but so we had this nested query inside of us and we needed to be able to derive what our selectivities are and then the cost is gonna be but it's tricky because now the output of the query will affect what the predicate actually is that we're trying to estimate and we don't know what that the output is gonna be until we actually run it and by then it's too late for us to do cost estimates and so what the systems are gonna do is they're actually gonna rewrite the queries in sort of in two different ways to avoid having to do the slowest thing so this two approaches to rewrite your sub queries the first is gonna rewrite them to decorrelate them or flatten them just to be a sort of a single join or we can actually break the nested query out run that first store its result and attempt table and then feed that into our next table using a join so let's go through a really simple example so here we have again a select with an exist clause and inside that we have a nested query and in this case here we see that we're essentially just doing the join so select name from sailors but inside of the nested query we're comparing the sailor ID from the reserves table for the sailor's table so this is essentially just a join so we could rewrite this to be a flat query that's not nested and now just runs as a single join so the way to think about this the reason why we're doing this for the top query the dumbest thing to possibly do is that for every single tuple in the outer query I'm gonna rerun the inner query get the result and then check to see whether the predicate evaluates are true you laugh but this is what my sequel used to do and this is what a lot of early systems would do but if you understand the semantics of what the query is actually trying to do then you can rewrite this and now this is much faster so let's look at a more complicated example so here in English what we're doing is that for every single sailor that has a rating sorry every single sailor with the highest rating over all sailors and has at least two reservations for red boats we're gonna find their sailor ID at the earliest date in which that sailor had a reservation for a red boat so we had this inner query here where we're trying to get whatever the max rating is for all possible sailors and again the dumbest thing to do would be just to run this query over and over again and so instead we wanna do is decompose it and break it up so that we can store that result somewhere else and then reuse it later on so this is essentially what decomposition does this is harder to do than flattening but it makes it actually easier for our query optimizer because now you only have to optimize one query at a time you don't have to figure out how these things are actually related to each other so again using our example we have the nest of block here we can take that out run that separately first and then we'll have whatever value it generates we'll get fed into our query here and then we then just execute the outer block and now it's no longer running this thing every single time it's a constant expression we know actually now how to compute the selectivity of this and we can use that to estimate what the best query plan is gonna be so you can see when on time we get open up Postgres and iSQL actually Postgres would be better and you can run nest of queries you can see whether it's gonna rewrite them using decomposition or flattening I think in Postgres, most of the times I see it doing flattening try to rewrite things as joins alright, so just to finish up the main things from query optimization is that we will always want to try to filter as early as possible always push down as much computation as we can to the bottom of the query plan to try to have it do less work or move less data up to the other operators that's the main thing and the way we're gonna estimate whether one query plan is better than another is rely on these internal statistics that the database system maintains about what your data looks like and then uses the predicates that are in your query to figure out to estimate what the selectivity will be and then that will tell you how much data the operator is gonna read in and how much data it's gonna write out we can use dynamic programming for join ordering again, in the advanced class we'll cover that in much more detail and then we can do rewriting of nest of queries to avoid having to run the same query over and over again alright, so any questions about this? query optimization is very very hard I covered this in a single hour like we could do an entire course on this I will also say too this is a very desirable skill in the real world I had three students build our query optimizer last year every single database company that contacted me interviewed all of them right, because the only people that really worked on query optimization are like crusty old guys and there's all these new database companies and all these new people that are new systems that need a good query optimizer and there's not a lot of people that have experience in this so we'll cover this in much more detail in the advanced class we'll talk about what actually the real systems do but if this is the kind of stuff you're interested in get in touch with me because there's a lot of research even though it's an old problem there's a lot of new research I think we could pursue in this okay, in the remaining 15 minutes so I'm going to pass around now the practice exam so one of these will be the solutions and one of them will be a clean copy that doesn't have the solutions so that way if you wanted to test yourself you have both so this is the exam we gave pass around the exam we gave in the old version of the class from I think two or three years ago I stripped out the things that we didn't cover in the course and it is roughly everything that it's a good approximation to everything that we'll talk about so again, everyone has to come next week in this room it'll be from 12 to one the things you need to bring are your CMU ID because I will check it when you turn to your exam you should bring a calculator because there'll be for logarithms and things like that and then you're allowed a 8.5 by 11 sheet of paper handwritten notes you can write on both sides please do not bring live animals we had this problem a year ago someone brought a dog that's not good for the other test takers okay, so please do not do this the therapy snake actually is real we'll talk about that later okay, so what is the exam going to cover? it'll cover everything up to including what we talked about today and obviously I can't ask you a very complex problem about query optimization because I just taught it to you and so the focus will really be on the other material we've gone through so far if you have any special combination needs please email me as soon as possible that way we can make arrangements yes, that's all I'll say so there's a link there that'll take you to it's roughly a summarization of everything that's in the slides here what to bring, where to show up and what the exam will cover so I'll go really really quickly now and sort of talk at a high level what are the main topics we're going to talk about in the exam so again, we started off talking about relational model, relational algebra you should understand the basic concepts of these things you can understand what the basics of integrity constraints again, if you just understand all the operators and what they do in relational algebra, that will be enough we will cover in SQL I didn't teach the basic operations but you should have figured this out in the first homework assignment and we'll talk a little bit more about more complex things that we covered in class and now obviously I can't have you write a real complex SQL query because one that would be painful to write with a pen and paper and also it would be difficult for us to actually test this so that should be sort of obvious what I mean it's sort of obvious what I'm trying to say here like we're not going to ask you super hard questions in the SQL again, so then we're going to talk about more we'll cover the storage things so you guys got a buffer pool manager in the first project so we'll cover all the different management policies you can have for how to swap things in and out we'll talk about how you can organize the data on disk what do the pages look like, what do heaps look like how can you organize your heaps and I guess the most important thing is you should understand at a high level what are the trade-offs between these different storage models right that you know of doing pages one way versus another doing heaps one way versus another we covered a little bit about the LSMs right again at a high level what are the advantages and disadvantages of these different approaches then we spent time talking on hashing we talked about extendable hashing, linear hashing we talked about the open-dress hashing, cuckoo hashing, chain hashing all those things will be covered in the exam and then roughly what are the benefits or trade-offs between using a hash table versus an order-preserving tree for tree indexes we spent most of our time talking about B plus trees and then the project was based on B plus trees the homework was on B plus trees but we also talked a little bit about radix trees and skip lists again think of it in terms of like I care more about you understanding at a high level what these data structures look like and when is one better than another for sorting we talked about two-way general merge sort and sort of like in the last homework assignment the current homework assignment being able to figure out you know if you have a certain number of buffers and your data is this big what would the cost be to actually sort this then we talked about query processing again the advantages and disadvantages of the iterator model versus the materialization model versus the vectorization model we talked about join algorithms nested loops sort merge and hashing again should be able to like in the homework assignment if you give you you have this many buffers and your data has this many pages what will be the cost of the IO cost of executing these different join algorithms and then we talked a little bit today about query optimization and planning again at a high level understand what the selectivity is of a query or what the cardinality is for different predicates any questions so on the link on the website for the midterm exam guide there's a link to the textbook web page from the Yale guys and they have the solutions to some of the practice questions and I've listed out exactly what all the chapters we will cover okay alright so what do you need to bring next week see the ID calculator key sheet what should you not bring live animals yes okay you think I don't have to say this but you have to say this alright so on Monday again Monday's class will not be included in the exam on Monday's class we start talking about how to do query execution so now we know the algorithms we know how to pick good query plans now we can talk about what optimizations we can do in the system to take advantage of extra cores and so the difference between parallel query execution and distributed execution the way I define it is parallel query execution is we're still going to hang out on a single box on a single server and we say how can we have queries run in parallel using different cores okay alright guys have a good weekend I'll see you on Monday