 Mushu has kids. Do not know this. So he's dealing with that. Right. I was surprised too. Yeah. That's his problem. All right. So the midterm exam that we did before the break, by the way, fall break was awesome. I'm glad they started doing this. Yeah, of course, absolutely. Yes. I mean, I didn't do anything like, huh? Do you think it's complex? Do you think? Yeah. It's good for everyone. All right. So the midterm exam, the grade ones are in my office. Everyone should have gotten a notification, I think, on grade scope. But it's also your score is posted on Canvas. And as I said in the email, the numbers look as expected in previous years. Or compared to previous years. So this is the distribution for this class. And then this is last year. So it's actually better than last year. It looks more correct, in my opinion. More normal. Right? So some of you emailed and said, hey, it didn't do as well on the midterm. They're freaking out. And there's a project-based course. So if you understand where you're going with the projects, that's probably the most important thing. All right. So then the Checkpoint 2 for the Project 2 is due tomorrow. And then, again, post some piazza if you have questions about this. And then Project 3 will be out. It's actually out now. But we'll cover that in class on Thursday. OK. Any questions? For Project 3, there will be actually a reference solution. That kid, she, is amazing. He compiled BustHub into Wasm so you can run it in your browser. Yes. The question is, do us the difficulty of creating Project 2? I mean, again, somebody asked this before. It depends on if your C++ is getting better, your debugging speed is getting better. You should understand a little bit when we're going with BustHub. I don't know. It's tough to say. It should be easier. Because you're getting better, not because the project's made easier. OK. All right. Yes, sorry. His question is, how important it is for Project 3 and 4 that everything works correctly in Project 2? Project 3 will be it's implementing query executors. So you would need your index to do an index scan and an index nest loop join. But you can still do a sequential scan without your index working correctly. So, yeah, there'll be some things you can do without it. But the buffer pool is probably more important to take and make sure that's working correctly. All right. So we have a lot to cover today because it's, this is a condensed, what have been two classes on query optimization and query planning into one. Let's see how this goes. I'll say up front too also that the query optimization is the one thing on databases that I know the least about. What's that? Sorry. Because it's hard, right? It's very, very, very hard. And, I mean, so I'll cover the things that, again, we'll cover the things. We're going to go a bit deeper than what the textbook goes into because I think there's more things that they actually discuss that are worth discussing. But I'll say again, oftentimes when we invite outside speakers, I try to get them to talk about the query optimizer because this is, to me, this is one of those interesting parts because it's the black magic of a database system, right? So up until now, we've been all saying, hey, we've got this SQL query shows up. We want to parse it, convert it into some kind of plan, a logical plan. And then we want to generate a physical plan that we can execute it. And I've been making some noise about how, oh, the database system is in a better position to figure out what's the optimal query plan than some Jochmo programmer writing it themselves. So that's what this class is really all about. Now is actually, how are we going to find this optimal plan? So for a given SQL query, the query optimizer's job is to find a correct execution plan that has the lowest cost. Now I've underlined correct, meaning it has to be the answer that they actually want. It doesn't help us to find a really fast plan if it produces the wrong results, people get pissed. But the term cost I'm putting in quotes is because the cost is this relative term that can mean different things for different people. For the purposes of this class, we'll assume cost means the one that runs the fastest. But obviously you can imagine other scenarios where you maybe want to have that cost be reducing the number of network messages, reducing the energy usage and things like that. So we'll talk about how we're going to generate this cost in general, generate this cost in a bunch of different ways. And as I said, this is going to be the hardest part of a database system. This is what's going to separate the very expensive commercial systems, the Oracles, the SQL servers, the DB2s, Teradata, Snowflakes, versus the random stuff you're going to find on GitHub or even some really good open source systems like MySQL and Postgres and SQL Lite. Because those companies have spent decades and millions and millions of dollars trying to make their query optimizer handle all these different situations as intelligent as possible. And that's something just the open source guys can't compete against. Yes? So the statement is, isn't it because open source anybody can contribute to it, therefore hypotheticals should be better? How many people are out there in a query optimization, right? So I would say, my second line here says if you're good at this, you'll get paid money. So when the database companies email me, say, hey, we want to hire your students, the number one thing they ask for is, oh, by the way, do you have anybody who does query optimization? So I had some students, but now we're not. So I don't have any students that are built in query optimizers. It's pretty much the Germans, right? That's it. And then these old white guys from the 60s and 70s. This has been a really hard problem. And up until recently, this is not something that the open source community has really tackled. It's sort of been patchwork good enough. And again, Oracle has billions. They have a whole floor in their palace in Redwood City that was just doing query optimization. Postgres MySQL can't compete with that. OK, so the other thing to point out too is that, well, as I said, it's proven to be incomplete to find the optimal plan. So we have to try to figure out a way to quickly define something that's good enough. And so no optimizer for any possible query, despite being called an optimizer, is not going to find the optimal plan. Because MP complete, it's impossible. For simple things like select star for table ID equals something, that's easy. We'll find the optimal plan. But once you start doing joins, then things become a mess. So the technique we're going to use to be able to trim down the number of possible query plans we have to look at is basically using heuristics and other assumptions about what the database system is going to do to try to get us to something that's good enough. And this cost, I would say, also too is going to be an internal metric. This is not going to be the expected wall clock time because that's going to be very difficult to predict. And the cost that you would generate for maybe one query plan in Postgres, that cost is not comparable against an internal cost model from MySQL or Oracle. These are internal values. Yes. All right. So David, we take this offline. Deep learning is a thing. Are people considering using deep learning for query happenstition? Yes. In limited scenarios, we can come back to that later. That's something we'll cover in the advanced class. Not here. You have to walk before you run. Deep learning is great, but it's not going to like, if you don't have a cost model, then it doesn't matter. All right. So I've talked to this before. We talked about the difference between a logical plan and a physical plan. I just want to say that a logical plan, again, is going to be the high-level description of what we want to do. I want to scan a table. And then the physical operator, the physical plan will represent the exact execution algorithm or method we're going to use to do that logical operation. So the logical would be scan table foo. The physical would be use an index scan on this particular index for foo. And then there's that query processing model we talked about before. We have the operator execution with its vectorizer iterator model. Those are the manifestations of a physical execution, physical operator. In some cases, we will have to care about what the physical layout or the format of the data that we're producing, in particular the sort order, because that will determine what we need to do potentially up above in the query plan. Like if we have an order by, and we use a sort merge join and we're already sorted, then maybe we don't need to do the order by. So this means that we're not always going to have a one-to-one mapping between the logical operators, the logical operators and the physical operators. A logical operator can be broken up into multiple physical operators, or it could be condensed down into another one, to a single physical operator with others, right? All right, so here's a high-level overview of what the execution path is going to look like when a query shows up in our database system. We ask them application, PHP, Java, Python, doesn't matter what it is. They send us a SQL query. And the first thing we can potentially hit would be in our system as a SQL re-writer. And I say this is optional or rare, not all systems are actually going to do this. This is literally taking the SQL string that shows up over the wire, over the message, and doing some kind of regular expression to convert it into another SQL string. Some systems do this, like the tests, for example, the cluster of my SQL out on YouTube, they do basically regex and do some basic rewriting. And it's basically thinking like, if I see a unique table name, instead of me, instead of like, you know, select from table foo, I'll rename it like foo one. Because internally, that means something different for the system. Like I said, this is not that common. Most systems don't do this. All right, so now we take our SQL query, we throw it into our parser. This is going to spit out an abstract syntax tree or just the tokens that we see in the SQL string. And then we're going to pass it to what is called the binder. Some systems have different names for this. But the basic idea is that we're going to take these tokens that we have in our abstract syntax tree and actually map them now to internal database objects. Right? So we have to look in the catalog and say, there's a from clause followed by some string. So we go look at our catalog and say, you know, is this string actually a table name? If yes, then we get back to the internal ID for that. So when you ever see like table not found or column doesn't exist, they're doing this step in the binder phase. The next step is you would have a, now what is a logical plan, basically it's almost a conversion of the, again the syntax tree, now with the annotations of the objects and the catalog, you can pass this now to a tree rewriters. Again, this is optional. This is operating directly on the logical plan. Not all systems do this, but most of them do. Bus tub is not going to do this. Bus tub is going to go directly from the binder output into a physical plan for simplicity reasons. And for this one also too, we can go back to the catalog and get additional schema information to find things like, you know, what are the primary keys? What are the foreign keys? What are the type information that we would have on our query plan? And this can then have a new logical plan that we then shove out into our optimizer. I'm going to use the term optimizer in this lecture. Old school database people will call it compiler. I think Snowflake calls there as a compiler because it's back in the days in the 70s when they first proposed hey, let's take something like SQL or Duterte query language that was when C and UNIX, back in the early 1970s, was a hot thing and this idea of I'm going to build a C compiler was this far-fetched idea at the time which is now super commonplace, but the idea was the same thing. If the same way you'd have a C compiler, you'd have a SQL compiler. But I'll use the term optimizer. Sometimes it'll say planner. It's basically the same thing. So now for this optimizer, we're going to have this thing called a cost model that's going to be used for us, that's going to allow us to be able to predict the estimated execution costs of a particular query plan. The idea is that we can potentially enumerate over a bunch of different possible plans and we'll see two different ways to do that and then pick the one that has the lowest expected cost. And then you produce a physical plan that you then execute in the engine. Okay? Yes. So the idea is doing manipulation of the logical plan to a new logical plan. We'll see examples in a second. Question over here? Okay. So again, this is a high-level overview of how every system that has takes SQL in and does something. And like I said, the SQL writing and the treat writing, these don't always all systems have these. And then some optimizers, the primitive ones, they actually won't have a cost model. They'll just use like hard-coded rules that basically Bust Hub is doing. If then else closes or pattern matching to say if I see one operator following another operator within some context, maybe I can rewrite it the query in a different way. But at this point in the path, we're mitigating a query plan treat, not the raw SQL. Yes? Does this diagram show a cost-based optimizer? Yeah, that's what this is. Yeah, this question is is this an example of a cost-based optimizer? Yes. If it's a rule-based optimizer, you just get rid of whatever this calculator thing that I showed there. Okay. So that sort of takes us to what the the high-level thing we're talking about today is how we can do query optimization. So there's going to be heuristic and rules that we can do sort of basic things without having a cost model, without having some logic to evaluate hey, is this the right thing to do for this query given my data? Yeah. It's sort of things that you always want to do. You may need to go look at the catalog and say, okay, is this is this query operating on this kind of table it has this different type and therefore I need to do certain things and that some rules apply, some rules don't apply. Right? At the end of the day, you're not looking at the data. Right? And then the more sophisticated method is going to be a cost-based search. As I said, we're going to use a model to enumerate over a bunch of possible plans that are equivalent to produce the same correct result and we pick the one that has the lowest cost. Yes? So the statement is, for something like join would you put something always on the, sorry, would you always want to put the, since you always want to put the smaller table on the outer table, be the outer table does that count as a cost-based search or heuristic search? That would be like a rule, right? So you go to the catalog and say what's the expected cardinality of this table? So the catalog will be a place where you can get some additional information. All right, so we're going to first talk about the heuristic and rule-based optimization. Then we'll talk about some query cost models. And again, these are all high-level concepts and then we'll finish up talking about the top-down and bottom-up cost-based optimization searches. All right? So the logical plan stuff we talked about before, this is repeating myself here, but the idea is that we're going to transform again this logical plan to a new logical plan using these pattern matching rules. You can think of like this could be implemented with like a visitor pattern where you're traversing down the query plan and you're trying to identify what's the operator that I'm looking at, what's my children, what's my parent, and then maybe you can then rewrite the tree into a new tree, right? The idea here is that the some rules we always want to do, therefore we want to apply these rules, but then if we're also doing a cost-based search, we want to be able to maybe guide the cost-based search to consider query plans where we know it's probably going to be the right choice, the optimal choice, and not look at stupid things, right? Because then you have sort of a cost-based search which is blindly enumerate over everything brute force and it would take a lot longer. So the idea is like to cut that time down, we can use these logical plan optimizations to sort of shortcut or throw away things we know that's never going to be a good choice for us, right? So we're not going to be able to compare plans before and after the manipulation or before or after the change. These are codified rules that say we know this is the right thing to do, therefore we're always going to want to do it, assuming that the rule matches, right? So I'm going to go through one quick example, we'll go through four steps. Again, this is a gross approximation of what will happen in a database system. So say we have a query here that is we want to find all of the artists that appears on my mixtape. As you recall back from the first lecture we had this Spotify database, we have artists appear on albums, the peers is the foreign key reference between album and artists, so that an artist can appear on multiple albums. So in the first step here we want to split the conjunctive predicates. So in our where clause we see that we have a bunch of ands, right? The peers are artist ID and so forth, right? So instead of having that just be one giant filter operation in relational algebra we're going to break this up into separate where clause or separate predicates, right? I'm going to want to be able to do this because we can then manipulate and move them around as needed. The also thing I point out too is that we're going basically from a little translation of the SQL query into our original plan here, right? Since we know we need to do a join but we don't know what kind of join we want to do yet we're just doing a Cartesian product, but we know that's going to be stupid so we'll fix that later. And then the order in which we're doing the joins is just the order that they appear in the SQL query. Because it comes, as you parse it you just see the order that they appear and you start joining them together, right? This is easy to do for conjunctions, for ores you basically have to duplicate the expressions. So we can move things around. The next thing, the next optimization we can do is predicate pushdown. The idea here is that we want to filter as much data as soon as possible in our query plan. So we're going to push the predicates down to the lowest point where we have enough information to actually apply the predicate. So with all the predicates up here, all the filters up here we're basically doing complete sequential scans of these three tables. And then doing the Cartesian product was the worst thing to do. So now in our rules we could then examine what's actually in the expression and find anyone that does not require, find the point in the query plan where we have again enough information coming into it that we can apply the predicate. So for example here, this predicate album.name equals Andy's OG mix. Well I could put it right after the scan here because it doesn't reference any other column in my predicate so therefore this is the lowest point I could put that. And you would end up something like this. The case of this one here again because it's referencing both the artist and the peers table, I have to do it after the join. So now we're going to replace the Cartesian products with inner joins. And we're going to attach then the join predicates that are right above the Cartesian product to be what we're actually going to do the join on. So it's sort of obvious here like artist ID equals appears that artist ID. Well I could replace this with the inner join and have the join clause, the on clause of the join be now this predicate. I would do the same thing for the other one up there. So then the last optimization we can apply or transformation we can apply is do projection push down. The idea here is that we want to remove all redundant attributes that we would need or that are being generated from each different operator at the lowest point of the tree so that we're not passing around data that we know we don't need to produce in our output. This is assuming that it's a row store or we're doing late materialization so we're actually copying tuples, all the values of the tuple, for one operative of the next, even if they don't need it. So we'll push down the projection like this. It's a bit complicated here, but in case of artists we know that we need the on ID and the name column so we'll do that projection here. Same thing for peers and then after the join we only need the name and the on ID like that. Right? Yes. More memory usage and more time required to make those? So the question is if you're very aggressive on the projection push down would that require you to end up storing more innovative results and therefore blow out your memory? Is that what you're basically saying? Yes. We were doing the value of first making a thing that has just what you want from those two versus reading from those two directly into the end point to speed up. Yes. The question is, and it's a good question, it's a good point, that it may be the case here that if I'm just feeding this directly into this join that rather than having an intermediate operator where I actually do the projection, materialize that, then do the join, it might be better off to just pass the full tuple in here than do a projection, right? Because then you don't have an intermediate step. Or just project while reading? Yes, so I'm not really showing that here. The physical implementation, the physical operator could be the combination of a scan feeding into the join here and doing the projection as part of the scan. Okay, so it could just be like... Same as it's not an intermediate result, it's a fancy way of reading. Again, these are logical operators. Anything about how we're actually going to implement this. Logically, we can do this and throw things away. Whether or not the dataset decides I want to do that projection right now, it can determine based on the cost. Yes. So your question is, for the album table, should we do the projection on ID first? So our statement is, for the album here, the path of the query plan. I do the filter, then I do the projection on just ID. Could I also put a projection here and get ID name? Yeah, you can do that, yes. Logically, it's correct. Does it help? It depends on the implementation, right? Yes. Is this assuming early materialization because light materialization, this would be unnecessary to do this projection pushout? Yes. But again, I'm just trying to show that I want to introduce these transformations, right? It's a type of transformation. Predicate pushdown is probably more important, right? There will be a query on project three where you want to do projection to remove unused aggregates, right? There's things like that. Yes. So her statement is, is projection pushdown always given more, is it more important to do project, pushdown? Yes. But again, so depending on the implementation, this could just be a bunch if and else statements or it could be actually a rule engine that doesn't pass with the query plan and does pattern matching and has a catalog of rules that it can apply and do these transformations and ideally they're item potent, right? You could apply them in any different order or commutative. It doesn't matter what order you apply them. Some things that's true, not always. In that, in those kind of systems, you actually can specify like priorities of the rule so that like this thing would get fired first before projection pushdown. As far as I know in Postgres, it's a bunch, it's if and else, it's like hard coded. If this, then that, if this, then that. That's how they check these rules. I think my SQL does the same thing. Well, CoctrishDB SQL Server, there's a type of optimizer we're talking about at the end and they're using a pattern matching engine and they have priorities, which I think is the right way to do it. All right. I want to go quickly how we do nested subqueries or nested queries, right? And there's basically two approaches, right? If you have a correlated subquery, it's where you, the inner query is referencing something on the outer query. If it's uncorrelated, then the inner query doesn't need anything on the outer query. And so you can sort of treat them in different ways. The two approaches are going to be either to rewrite them, to correlate them, flatten them, basically rewrite the nested query into a join, or you could extract out the nested query, run that in a separate, just separately, put it into a temp table, and then either inject the scalar value if that's what it's needed, or do a join against the, I should take that, yeah, it's injecting because it's uncorrelated. All right. So this is an example actually from the old text what we used to use, based on sailboats. So basically, it's a reservation system. There's sailboats, and you have sailors, and people reserve sailboats. I think being a car rental is a better one. A better example. So this one here, we're going to find all of the sailors that reserve something on a particular day, right? So the thing about here is that we see that it's a correlated subquery. So inside of the inner query, it's referencing the outer query, and we know this where clause clearly looks like it should be a join, right? So the data system can just rewrite this into a join using a rule, right? And we can do this on a logical level because we just say, okay, we know that we have this nested query and it's referencing something in the outer query, and it's an echo join, or it's a quality predicate. So therefore, let me extract it out and rewrite it as a join. Yes. Your statement is, or we're treating that the query plan is basically a subtree of plans so that we can do rewriting without, sorry, what was the second part? Your statement. Yeah, I don't really follow your question. It's like, I mean, you can have your rules either look at single operator in isolation or like look at series of operators and understand the context that exists then, right? If you match those rules, then you can transform the tree appropriately. I don't even have more tree. Retrieve from the inner query. That's just like instant head from the black box for just more operations. Oh, yes, sorry, sorry, sorry. I understand. So yes, this question is like, in the case of my nested query here, when I have the nested part, is that just inside just more operators in the query plan itself rather than being like a function called or something else? Yeah, you would know what the operators are within the nested query and therefore you can manipulate them as needed, right? And obviously you can have nested queries inside of nested queries so you just recurse down. Yes, it's the alternative to having a black box where you pull stuff out of. I mean, it's a black box you can't pull anything out of because you don't know what's in there, right? So this is an example of decomposing queries essentially I just showed. So for queries that are getting correlated where you know that you can easily extract them out and rewrite them as a join, you basically break the queries up into blocks or tackle them one by one. For subqueries that are more complicated, again, you could run them separately and show the results in an attempt table, right? So this is a more complicated example here. We're trying to find all the sailors that reserved a red boat and they had the highest rating amongst all the sailors, right? So here we have our nested block and you see here that it's uncorrelated, right? Because it's just getting the max rating of all the sailors. We're trying to find whoever is the highest rating. You could have multiple sailors with the same highest rating. So in this case here, the stupid thing to do would be to run this query and for every tuple in the outer query rerun this thing over and over again, but obviously that would be wasteful. MySQL used to do this for a long, long time till recently they fixed it. But so instead we'll just run this once and then just insert the scalar value here, right? And in some systems they will actually extract out the nested query first, run that first, then go back into the optimizer and say, okay, well now I know what value I should put in here, right? Actually my SQL does that, which is somewhat clever. Not all systems actually do that, right? In most cases it would be like the easiest thing to do is, okay, check this out, store this that is in a variable, which you can have in SQL and this injected there. But you're doing this when the optimizer is then, sorry, if you don't run it, run it beforehand, then you just have a placeholder value like it was a prepared statement. You run it through the query optimizer and you make a basic guess of what you think that's actually going to be, what the value would actually be. So I extract out the inner block and then I do my planning on the outer block. Yes? You can only do this when the inner block is uncorrelated. No, you can be stupid and do it for a correlated query, but you can convert it to a join if it's correlated and ideally you want to try to do that, but it's hard. So we talked about basically applying rules for the actual logical operators in a query plan, but remember also there's the where clause, the on clause, having clauses, those are the expression trees themselves. Like something equals something that's going to get converted into an expression tree. And actually maybe cases where we actually want to then apply optimizations on those expressions because we can do something more efficient. So as the logical phase of the optimizer is traversing the query plan, it can also then look inside of the scans and the joins and whatever else has the expressions and then start traversing those trees and start figuring out if there's ways to optimize those. I showed the example before, we take the giant where clause and we break it up on the conjunctions. As we're doing that, we can also look at what the expressions are and figure out are there additional things that we can optimize inside that. And this is implemented very similarly to the way I said we do logical optimization. Like you can add a bunch of if-then-else clauses when you evaluate the expression tree or you can have a pattern matching a rule engine again with priorities to say actually maybe priorities don't matter for this as much but it's basically you can declare these rules and they can apply the rules to say if I look for this pattern and it matches then apply this optimization. And typically you just do this until you don't have any more rules to match. So let's look at an obvious example. So we want to remove unnecessary predicates. So query shows up. Someone says where 1 equals 0. This is obviously going to be false. So we want our query optimizer to be able to identify hey, this is stupid. Let's just rewrite this as a false, right? And you say this is super stupid. Why would you ever write a query like this? You'd be surprised. They exist. So the statement is if you rewrite it to false why can't you just optimize to do nothing? Yes, some systems do that, not all of them. We'll see an example in a second. Why would you do that? Why would you do what? Why wouldn't you do that? Because this one's obvious. 1 equals 0, right? I don't need deep logic for that. Here's more complicated ones. Now equals null. Now's the current time stamp. Better not be null. Better not be null. And for PostgreSQL they get this right. So we'll write this as false. But SQLite doesn't. Right? Because think about it. Again, we're doing logical optimizations. We don't know what this is actually going to produce at this point. You could, but assume they didn't implement it the way. You have to have additional metadata to say I know what the built-in function is. Could it ever return null? And if they don't have that metadata then you don't know. The statement is aren't all SQL functions strongly tied to this when their inputs are? Yes. Again, it depends on the implementation of whether they maintain that in the catalog about their functions. I mean, what do you want me to tell you? You have to be like. Why do companies not do that? That's dumb. Sure it's dumb, but there's so many other things you've got to do. I don't know, right? Sorry. So here's another one. Random is null. So my SQL gets this one right. Postgres does not, right? So let's look at a quick example here. All right, so here's Postgres. So here's the... So we're going to explain Analyze. Again, this is that same table we showed before, this 10 million decimal numbers, right? So now for 1 equals 0, Postgres flags... I can't highlight, sorry. Postgres flags as a one-time filter and actually doesn't scan anything. So they recognize that 1 cannot equal 0. So they basically evaluated the predicate once, saw that it's false, and therefore, every tuple in the table is never going to... it's never going to evaluate to true. So it doesn't actually do the scan, right? And it dumps out right away. Let's try now where now is null. So it also applied the one-time filter, saw that now is null, but it still set things up to actually see where they actually want to scan this. So my first example of 1 equals 0, I don't think it even got to, like, actually run the query and say, okay, should I actually try to run this? It evaluated the predicate in the optimizer and saw where it equals false, and so it stops right there. In this case here, it didn't know in the optimizer that now is null, so I think it applied it, and then it kicks out that it actually never executed. Again, this is explain-analyzed, so this is what it actually did. If you replace it to just explain, I think it'll say that it does want to do it. Yeah, so here's a claim-analyzed. So it didn't actually run the query, but it thinks it's going to do the sequential scan, and it knows it needs to look at these once. All right, last one, where random is null. Should be exactly the same, right? Nope. It actually scanned, where are we here? Yeah, did a sequential scan on the entire table and applied the filter for every single one and removed 10 million rows, right? So in this case here, it doesn't know that random could not produce, it's not going to be null, so it ran that every single time. His statement is, the question is, what's stopping you from going to GitHub or whatever, I don't know how to use GitHub, and going to make a pull request to fix it? Nothing. Do it. I think you'll find it's not as easy as you think it is. Yeah, so in the back. So your question is, what if I have an interquery that selects as null, and then it's in the from clause? There are... His statement is, the question is, could you also do, it's basically, can you do this short circuiting? Can you identify that the interquery is never going to produce a result, or I don't need to run anything out of query? Again, for some cases, yes, some cases, no. All right, let's look at MySQL. So let's do the 1 equals 0. You see right here, it says that there's an impossible where. So MySQL recognizes that I don't need to run this query at all because it's never going to produce a result. Let's try random as null. Of course, MySQL looks to be different, so they don't have random, it's got to be rand. They also recognize that they have an impossible where. And then for now is null, they got that one too. So MySQL gets this one right, Postgres does not. All right, yes. So your question is, if I have a where clause, and there's no transformation I can do to convert it to false, does that mean for every single tuple I have to apply that predicate? Yes, because otherwise the query will produce an incorrect result. Now the order which you want to apply them, that depends on the optimizer. It can figure out when to do that. All right, you can also do things like merging predicates. So here we have a where clause with a disjunction. So value equals values between 1 and 100, and value is between 50 and 150. And obviously these are, since they have a disjunction here, the range is actually expanded, so it's actually from 1 to 150. So some systems can recognize this and understand that, oh, like what the between clause is actually asking, understand that these are the ranges, and that they can be merged. It's something like this. And not all systems can do this with some kind. Yes. His question is, the SQL have undefined behaviors that can be optimized, or say it again? This will come up when we talk about transactions. There'll be, like the SQL standard will say something, like say, oh, this is what you should do, this is what it means, but depending on the implementation, it may actually not be able to do that, or it might do something else. There are dark corners with undefined behavior. Yes. We can ignore that for this class. All right, so far we have seen how to do rules to, again, to traverse the logical query plan and transform it to a new query plan. And then for expressions, same thing, we traverse the trees, and we can do manipulation of those. In some cases, we can evaluate them to see whether they're true or false, and we can condense, or we can do additional rewriting. And there's a bunch of other optimizations you can do as well for nested queries that are similar. All right, so now let's talk about how we actually use call space search to find a better plan. So I would say that a lot of systems are going to do a combination of the two of these, as I said. So you'll do the rule stuff first, clean things up, remove some things you know are never going to be useful, and then you apply the call space search. And all of the times, they'll just do the call space search just for the join ordering, to figure out what should be the outer table versus the inner table. Some systems like SQL Server and CockroachDB, they're doing call space search that's more holistic for all as possible, both the logical and the physical manipulations or transformations. All right, so again, the cost model is going to allow us to predict the behavior of a query plan, given a particular database state. And as I said, this is an eternal cost that we can't map out to something to the real world. There is research on how to predict the real runtime of a query plan based on what the optimizer thinks is going to happen. But most systems won't do this. It's something that humans can look at. Internally, they have their own range of values for a query cost. The reason why I have to do this is because it's going to be too expensive to run every possible query plan. So we need a way to quickly approximate, here's what this particular query plan is actually going to do. We want to do this in the order of a milliseconds. If I have a billion tuples and the query might take a minute, I don't actually want to try out every single query plan to look at because it'd be super slow. Now I will say MongoDB actually does it this way. They actually generate all possible query plans, fire them all off, and then see what comes back first and that's how basically the query optimizer works, which is pretty clever. You can do that for simple things. For more complex joins, I don't think that's a bad idea. All right. So our cost model, it can be comprised of a bunch of different things. So we'll have the physical costs, like how much CPU we're going to consume, how much disk recent writes we're going to perform, whether the data we need is going to be in our buffer pool or not, how many network messages we have to send around. And this obviously can depend heavily on the performance of the hardware. If you have a really fast SSD, then maybe spilling the disk for some query plan versus another isn't that big of a deal. But if you have to write over the network that's far away, then maybe you want to choose a query plan that reduces the number of disk writes. So logical costs, and these would be things like the number of tuples I'm going to spit out from my operator. And this would be independent of the actual physical choice or the physical operator I would use. Like if I do a join between two tables with some predicate, it's always going to produce the same number of tuples regardless of whether it's a sort merge join versus a hash join. And then those algorithmic costs that you can sort of bake in and say, I know that my big O notation of this join algorithm is this versus that versus another one, and you can choose one over another. Typically it's going to be a combination of the first two. I don't know which ones I haven't seen outside academia or anybody that does number three. So quickly tell what Postcode does just because the source code is readily available and the documentation is actually pretty good. So for their execution, their cost model is going to be a combination of the physical costs in terms of the CPU and IO that the operator is going to consume. And what they're going to do is they're going to weight these expected cost components, like how much disk I'm going to read and write by some magic constant factor that you can define as administrator. And the idea here is basically to identify that whether we're going to be able to process this query or tuples in an operator entirely in memory or we have to get it from disk and if we're getting from disk, is it a random read versus a sequential read? And so by default they're going to assume that if it's in memory, it's 400 times faster to process that tuple than having to read it from disk. And then it's going to say the sequential IO is going to be 4x faster than reading it to random IO. So if you go look in the documentation about these different constant weights you can set, they have the default values, there's this little line here that basically says hey, there's no really good way to figure out the ideal setting for these values. And you can assume that they're good enough but they're basically recommended you have to do trial and error to figure out what's the right way to set these things. But most people don't set them. And again, the different database systems, some systems like the commercial guys like DB2, when you turn the system on, they basically run a bunch of micro benchmarks to figure out how fast your disk is, how fast your CPU is, and they use that to derive the constant factors that would use intercost model. If you use Linux, when Linux boots up, there's that thing called BOGO MIPS where it basically tries to figure out how fast your CPU is. It's the same idea. So the statement is, is this because do those systems like Oracle or DB2 because they're being sold as appliances, meaning they have exact control of the hardware, therefore they know how to set these values? But yes, but they still have these kind of flags that you can set yourself if you want to. Same as why would your own settings be better than theirs? Because maybe, again, so there's the micro benchmark, that's different, right? That's like trying to compute it exactly in the hardware. For general purpose things, it's trying to be the lowest common denominator. Some settings that are good enough for most people. But obviously if you have a very specific use case, like you're super heavy on writes in a very specific way, maybe you want to set these things differently. So the statement is, I want to take this offline, but the statement is like, do the high-end enterprise databases, do they fab their own hardware, basically sell them as appliances? We saw this in the data machines. Yes, you can buy million-dollar boxes from Oracle and Teradata and IBM. I mean like, they're not just a software company. They make most of the money on software. So the way we're going to generate this cost model is through these internal statistics that the data system is going to maintain about your tables and indexes and other things. And how the different systems are going to calculate these statistics, when they actually do it, depends on implementation. Sometimes they have triggers to say, the size of my table has changed by 20%, or I've updated 20% of the tuples, then rerun, analyze, and compute my statistics. Well, Oracle has a cron job. They run this every night at 11 p.m. There's a bunch of different ways to do these things. These are basically the commands, if you want to run these manually. And just thinking of this, if there are sequential scans that are going to look in a table that compute some approximation of what's inside of that table. This is a mistake. I should have just ignored this, sorry. So let's do a really, really simple example of how we have to compute the selectivity of a predicate. And to keep it really simple, assume that it's a value equals something, a quality predicate. So we would say the selectivity of a predicate p is going to be a fraction of the tuples in our table that will qualify or evaluate to true for this predicate. So for an equality predicate, it's just the number of occurrences of the value we're trying to look up on divided by the total number of tuples in our table. So if a select start from people were age equals nine, assume you have some kind of distribution like this, where you have the number of occurrences and then the distinct values of the given attribute age. So to compute the selectivity of age equals nine, we just go look up in this histogram where nine is, get the count and then it's four over divided by 45, assuming there's 45 tuples. That's basically what the data system is going to be doing for all different possible predicates you can have. This is the best-case scenario, something equals something. If we have this distribution, then we can figure that out. Where things go, negations is the inverse of this. Where things get hard is when you have inequalities like less than greater than light causes and other more complicated things. But we can ignore all that for now. So there's three big assumptions we're going to do to simplify our calculations of selectivities. And these are sort of, again, if you're building a call space optimizer, these are the sort of first assumptions you would make because it makes the math a lot easier, makes the implementation a lot easier. The more complicated systems are more sophisticated on how they do these cardinality calculations and don't always make these assumptions. So a really simplified assumption we can do is assume that there's uniform data, that all the values are going to occur equally the same amount of times. And obviously in a real world we know that's not true. But the way you can get around this is that you can maintain what is called a heavy hitter list or like exact counts for just the 10 most common, the 10 most repeated values in a column. And then you assume everybody else has the same occurrence. We're also going to assume that our predicates are independent, that we can evaluate them, you split up a where clause by conjunctions and evaluate them individually. And if you multiply them, then you get the true selectivity. It's sort of basic in statistics. And then the inclusion principle basically says that for every tuple that's in the inner table when we do a join, it's guaranteed there'll be a tuple in the outer table. If you have a foreign key, then that's going to be true. But a lot of times people don't find foreign keys or they're doing joins or things that aren't foreign keys. So again, this is like, if you're building a custom model for the very first time, you would do something like this. But this is obviously not correct in most real world scenarios because you can have correlations between attribute specifics. So let's consider a really simple example here. We have a table of a bunch of cards and we're going to keep track of the make and the model. So we have a query select star from the table where make equals Honda and model equals Accord. If we assume our independence, say independence assumption and uniformity assumption, then we would break up this predicate by the conjunction and say what's the selectivity of the make equals Honda and what's the selectivity of model equals Accord, multiply those two together and then you get the expected selectivity of this predicate. But this is stupid because there's only one car company that makes an Accord, right? It's Honda. So these are correlated. If you know a model is Accord, then you know the car maker is a Honda. So the real selectivity is 1 over 100, not 1 over 10 times 1 over 100. So in the high end systems, you can actually define, say, these columns are correlated and the data system will maintain statistics for the combination of the two of them together rather than having them be individual. So this is sort of like a brain dump for you guys. They're showing here's all the things that you have to do. Here's why things can get really hard and hopefully you just have a better appreciation of how tricky this actually is and why people pay a lot of money for it. So there's three ways we can maintain statistics. Again, we can't scan the entire table to say what's the selectivity or predicate because if we're doing that, we might as well just be able to run the query anyway, right? Yes? So his question is, does the data system maintain these statistics automatically? The answer to that is yes, except for the correlated statistics, you have to tell it which attributes are correlated or columns are correlated. It will maintain these things automatically. We'll see how to reduce the size of them because you don't actually want to store the count for every single unique value because that would be huge. But there's ways to reduce the size of the metadata or the statistics you're storing. The real question is not whether you should you maintain them, it's how often you refresh them. And again, you can do this manually or you can set parameters to say how aggressive are you refreshing the statistics. Because again, it's basically like running another query. So if you're re-computing your statistics on a one petabyte database every minute, then that's going to interfere with all your other queries and slow you down. So the question is whether you should do it, it's how often you do it. Another good rule of thumb is if you bulk load a bunch of data, a lot of times people have jobs where at midnight they insert a bunch of rows. Immediately after you do that, then you want to run analyze or refresh your statistics. All right. I'll go these quickly. There's basically three ways to store these things, histogram, sketches, and sampling. Histograms that we've already seen. The problem is if we store a histogram and count for every single key or every single value in a column, it's going to be super expensive. So my rinky-dake example here has 15 values and I'm sorting a 30-bit count for them. It's only 60 bytes. But if I have a billion values in just one column, then that's going to be four gigabytes of just storing statistics. So no system would actually store this. Oops, sorry. No system would actually store exact count for all of these. So, sorry. That last slide should have been skipped. What you said I want to do is bucket up the values and based on some kind of, based on some threshold or some rule and then now store the count for all the values within that bucket. Right? So it's a simple way to aggregate things in there and compress them. And aggregate histogram is basically, you should define a bucket that has n number of values and every bucket you would have would have the same count. So now your histogram just stores between the range for values one to three, here's the count, four to six, here's the count. Right? Then you want to say, if you want to say how many, what's the number of currencies for key two, you would take whatever this count is divided by the number of keys in your bucket. It'll get it wrong, right? It's not an exact lookup, but in some cases that's good enough. Another one you can do is also equate depth and the idea here is that you want to, you have the width of the bucket, the size of the bucket is going to vary so that the total number of, the number of currencies within each bucket is the same across buckets. Right? So I would go with something like this because I'm trying to reach the count of the number of currencies to be 12 and I would store it like that. Right? So different statistical trade-offs to say one scenario is better than another. In this case here, we're doing qualitas. Sketches are sort of similar like a bloom filter. These are like probabilistic data structures that can give you approximate counts in some scenarios. And then the ideas are going to histogram, you remove some of the histograms and replace it with one of these sketches. They're much smaller and they give you potentially better bounds on your estimations. So countment sketch is the older one from the 80s. The hyperlog log shows up in a bunch of newer systems. But again, you can only use it to say what's the number, like for a very specific question, like what's the number of distinct elements in my set. But you can't say what's the number of occurrences of a particular key. The countment sketch can give you that. All right, now the last one is sampling. This one seems sort of obvious but not every system does this. So as you're doing run-analyze, you also maintain a sort of separate, sort of internal table on the side that has a subset of the tuples that are in your table. And then now, in your optimizer, what do you want to say what's the expected selectivity of a predicate? You actually just go run that predicate on the sample table and produce the result. And you assume that the sample is representative of what's the entire data set. So say I have a billion tuples but I just sample three tuples out like this. So now I want to say what's the selectivity of age equals greater than 50? I can just apply the, just scan through this. Ideally it's going to sit in memory and then you get an expected selectivity like that. And you assume that's going to match to whatever is in the full table. So SQL Server does this. And I think DB2. Most systems do not do this. Right? Sometimes other systems, you can go maybe grab a random block or random page in a table then apply the predicate, see whether it matches and assume that the rest look like that. But obviously there's pros and cons for that, right? It won't be, maybe it's not fully representative but it would be faster and easier to implement. All right, so that's the cost model. So now that we can roughly estimate the selectivity of our predicates and we can roughly figure out what's the, maybe the expected number of tuples that are coming out of our operator, what do we actually got to do with this now? So now we get to the cost-based search, right? And so after we do all our rules then we want to start enumerating over a bunch of different plans and try to figure out or estimate what their cost is and then we want to pick whatever the best plan we've seen during the amount of time that we're allowed to do the search. Right? Because it's NP-complete to do this, we can't do it, there's also search. So there's basically a bunch of parameters you can specify in these systems or these optimizers to say how long you're allowed to do the search for. Right? If my query is going to run for one millisecond but I'm going to search for 100 milliseconds, that's not a good trade-off. Right? So we'll see how to do single-relations and then which are easy. It's the multiple-relations, we'll see that. That's the harder one. And the nested sub-query is we basically saw if we can rewrite that into joins then it's essentially a multiple-relation query. Right? So single-relations is pretty easy. Right? We know we're scanning one table or accessing one table. So we're just going to pick what the best access method is and we can use our cost model to say what's the expected cost of these different choices. Index scan probably is... If you have the right index it's probably always going to be the right choice. Actually depends on the where close. And then you just... Then you maybe try to determine what order you want to apply the predicates because you split them up, you can move them around as needed. For all-to-be queries, these are often very easy to do especially if they're equality predicates because you know exactly what the best index is going to be because you could look in the catalog. So these are called Sargeable queries, search argument able. It's some terms from the 70s. The basic idea is that you know immediately what's the best index to choose for this query and you just pick that and everything else sort of falls into place. So again you can use the cost model to say what's the expected number of tuples at a return from an index for a given predicate and then pick the one that's going to filter out the most things. So in this case here, this is the easiest thing to do. I have a table of people. ID is the primary key. So I just look in the catalog and say since I know the primary key is unique, it's an equality predicate. I know immediately that this is the index. The index and the primary key is what I want to use for this. But I could have multiple indexes that had different choices like ID com plus the value common together and you use the cost model to identify that this thing is better. All right, for multiple relations, there's two choices. There's the bottom-up optimization and top-down. So most systems are going to be doing bottom-up. It's going to be a combination of the, again, do the rules, the rewriting stuff we talked about before and then you do the bottom-up optimization just for the join, to figure out the join orders. Postgres then also does additional rule-based optimization to clean things up after you do the bottom-up search. The top-downs method is typically trying to be, again, all-inclusive. They do all the transformations all at once, which to me seems cleaner, but most people implement the first ones. So let's go to these one by one. So for bottom-up, again, we're going to use the static rules to perform our initial optimization we talked about before. And then we use a dynamic programming technique, essentially like a divide and conquer, to start with the unoptimized, just the raw relations, and then we're going to iterate through and apply, evaluate different join orderings for the first, say you're doing three tables, for the first two tables, and there are different combinations of those. Then we do the three-way join above that for the remaining table, and then we produce our final result. Also what I mean by this visually, but this was how they implemented the first query optimizer that did call-space search back in the 1970s at System R. And I'm pretty much, this is what the textbook, I think, describes as well. The high-level idea is still basically the same, but this was the first call-space optimizer that existed for a database system. But this carries over to DB2, and as I said, most of the open-source systems, if they have a call-space search, this is what they're going to do. So in System R, what they're going to do is going to break the query up to blocks, as we talked about before, and then convert all the, do all the logical operator optimizations that we talked about. And then for the logical operator to exist, they're going to then try to figure out what are the physical operators that could possibly implement that logical operator. And to simplify things, they're only going to consider the joint algorithms and access paths, things like aggregations and stuff like that. You basically use heuristics to pick those. And then they're going to literally construct a left-deep join tree that has the lowest cost. So this is one of the big assumptions that they made back in the 1970s. It still carries over today. Some systems don't actually do this anymore, but they made this choice that a right-deep tree was always, that was always the best way to actually implement a query with joins. And therefore I don't consider bushy trees where you're joining two tables over here and two tables over here and then join those together. It's always joining two tables going up on one side of the tree. And then you just throw away any right-deep trees as well because they're symmetrical to the left-deep. So an example of a rule in how you do your query out meso, where you can throw away a bunch of things you're not going to consider because it reduces the search time. Again, I think 1970s super slow computers, this was important for them. So this is the same query we had before. We're going to get all the artists that appear on my mixtape, but now we're going to add an order by clause at the end. We're going to order by artist ID. So in step one, you're going to choose the best access paths for each table. Again, this is just rules. So we say for artists a sequential scan appears a sequential scan and the album I have an index, I can look up one name. Then I'm going to integrate all possible orderings for the table. So just all possible combinations of joining the three tables in any different order. And they're going to consider both the inner joins as well as Cartesian products. But in actuality, they would immediately throw away the Cartesian products because, again, we know we don't want to do that. And then for all these different combinations of the join ordering, now we're going to do this dynamic search or dynamic programming to figure out what's the join order that produces the query problem at the lowest cost. So again, some bottom up approach, you have sort of the bottom part is where you start. You just have your three tables and then the top will be the three tables joined together in a particular order. And so the top is not in a particular order. It just says that they've been joined. So in the first phase, what you're going to do is say, I'm going to pick any two tables to join together and then also choose an algorithm for them. So in this case here, I can join artists and peers. This one joins an album and appears. And that one over there joins, appears an album. I think it is going off for all the other combinations. But I didn't want to draw that in PowerPoint. And then the path into this next logical operator here will have the different physical operators. So I can either do a hash join or a sort merge join and do the same thing for all of those. And so for each of these, I can then use my cost model and say what's the execution cost for this physical operator to do the join on these two tables. And then I'll throw away all of them that sort of going up to the next logical node, throw away the one, except everyone but the one that has the lowest cost, right? Then do the same thing for each of these logical operators, figure out the different physical operators to get me up to the three-way join up above, find the one that has the lowest cost, throw away all the others. And now then I go back, you backtrack and figure out which path down to the bottom has the lowest cost. And you can choose this one here. Is this clear? Yes. So your question is like, so at this point here, why do I consider these different combinations? Right. So the physical operator says, this is joining A2 is the outer, A3 is the inner, this is A3 is the outer, A2 is the inner. Yeah, you want the smaller table to be the outer table, but if this, like, until you look at the cost model, you don't know. The cost model will figure that out. Yeah. So say it again. This should have been an artist's album. I mean the dot, dot, dot, the ellipsis means like it goes off, there's more. But it's all unique combinations of like appears artist, artist appears, right? Because outer versus inner. Okay, again. So then we have all these different paths. We just pick the one with the lowest cost. Yes. The question is, why is the left deep tree considered optimal? It's not optimal. There could be a bushy plan that does produce better results. Yeah, I don't have slides with us. So say I want to join A1, A2, A3. A left deep join would be like A1, A2, and then A3, right? So like I'm going up the left side of the tree. Let me throw an A4. So then if you had also then another join here, you just go up the left side. That's a left deep join. I hate chalk. It's so gross. A bushy plan would be A1, A2, join that, A3, A4, join that, join that. That's a bushy plan, right? So there are some cases where this is actually better. Back in the day in the 70s, since they had so much limited memory, that they wanted to do as much pipelining as possible. So if you only do left deep trees, then I can have pipelines go up as far as I can, keep things in memory. In this case here, if I do a bushy plan, I got to run this join, materialized results, then run this join, materialized results, and then possibly bring that back into memory. You got to go back to this and get it, right? So back in the day, they wanted to avoid that. Some systems maintain that still, that can imply the same trick as system R, but I think for modern hardware, it's not needed anymore. All right. The last thing to point out too is like, in system R, in the original notation, they had no notion of the logical, the physical properties of the data that each operator is producing. So even though my query had an order by, there's nothing in these, how they would represent the query plan to say, oh yeah, the data is coming out of this operator in this sorted order. So the way they would handle that is in the cost model, you basically have to reintroduce an order by clause somewhere, or a sorting operator, and then had the cost model recognized that that was more expensive to do. But there's nothing as in the original implementation of system R's optimizer that could keep track of how the data is coming out of it, and only after you produce the best plan, then you go patch it up. Yes. The limitation of system R here was that they didn't have a notion of sort order, so therefore they couldn't recognize that a sort merge join would have been better than a hash join. The newer systems, they maintain additional metadata, but we'll see how they do in the top-down approach. All right, so the top-down approach is this guy Gertz-Grafing, again, keeps showing up in the Volcano model, the B plus G book. He had a system also called Volcano. That's where the Volcano intermodal comes from. He had a series of papers in the 1990s that said, here's how to build a top-down, yes. Yeah, so her statement is, and I mentioned this before, if we're doing this NP-complete search, isn't the search going to take longer than the query? If you let it run forever. So there's usually a timeout to say, if I run for more than this in a millisecond, stop. Or if I don't see a better plan after so many milliseconds or seconds, stop. You have to cut off, yes. All right, so I'm going to quickly go through this. There's a different way to think about how to do query optimization. And I said, this is what SQL Server does. I'll post a link on Slack. I think during the pandemic two years ago, describes their query optimizer. Hands Down is the best talk I've ever seen in my life, query optimization. And what they do is super, super sophisticated. They have the best query optimizer. I'm not saying this because, whatever, because I like Microsoft, whatever. The research shows that their query optimizer does better than all the other ones. They do the sampling technique I talked about. They do this top-down approach. It's very, very fascinating. And I think this is state-of-the-art. But it's not to say that it's state-of-the-art because it's doing top-down. There's other state-of-the-art systems, like the DB2 one is pretty good too as well. Like, well, Cochrane is also top-down. But again, they've been doing this. Microsoft basically hired Grit's Graphi in the 90s and said, rebuild our query optimizer. They paid millions and millions of dollars to do this with a big, big team. And it shows. Their query optimizer is phenomenal. Postgres, as much as I love it, it just doesn't have those resources. So it's nowhere near as sophisticated as what SQL Server can do. So let me show you what top-down looks like. So with top-down, you start at the top and say what I want my result to be, what I want my query plan to look like. In this case here, we're also getting introduced these property values to say what we want the data to look like. So we want to join on an artist that appears in an album and then we want the data to be sorted on an artist ID. So then now what we're going to do is we're going to do conversions of logical and logical physical plans. So we just sort of enumerate a bunch of different choices down here. Here's the, you can do these two joins here and then we can do our scans. So here we can join, we sort of go one tick below and say, okay, if I do, if I can convert this joint above to a certain range joint on these two tables, the two tables I've already joined, if the operator I'm looking at, if the current cost of the query plan is still less than the best cost I've seen so far, then I'm allowed to keep going down in the tree. It's like a branch-and-balance search. Soon as I have an intermediate result in my query tree that I know is worse than the best plan I've seen so far, then I can stop and I don't need to go further down of the tree. So in this case here, I'm allowed to go here and then I try, then I look how to join, sort of take the joint of the result of this joint plus the scan table here. That looks good. So now I go down and say, how am I going to do this joint? I consider a hash joint. Look at the cost there. Look at different access methods for get this data in. I go back up and say, all right, let me also consider the sort merge joint. Same thing, I go down here and say, what's the different ways to get the scan of this data in? Right? So I can do transformations from logical to logical and logical to physical. They also have this notion of enforcer rules that will require a sort of subtree in my query plan to produce the data in a certain way that in certain physical properties that I need to have in that point of the tree. So in this case here, at the root of the query plan, I need my data to be sorted by artist.id. So as I'm doing traversal, if I then say, should I consider a hash joint as feeding into my root node here, the hash joint will randomize all the data, right? So in this case here, the hash joint would not be able to satisfy the enforcer rule that says the data must be sorted. So I know I can discard it and don't need to traverse down in the tree at all. I could also have now a quick sort, a sort operator here. And again, this is showing that I'm converting different logical operators to other new logical operators, it's not always going to be a one-to-one mapping. So in this case here, it is sorted. This operator does guarantee that the data is sorted, but now if I consider a hash joint going into this, I would recognize that, okay, this is already more expensive than the best plan I saw before because I'm doing this additional quick sort. So therefore, I know this is not where I want to go down the tree and cut this off. This is a gross, gross simplification. There's obviously a lot of optimization you can do to memorize things and speed things up, but for now, we will cover this more in the advanced class. I just want to get the notion of what is bottom-up or is the top-down, what are the trade-offs for these? Okay? Yes? So is it sort of like greedy search? What do you decision on each level? Question, is this like greedy search? Yes, but there's roles you apply to then... There's roles you can apply to... I say this. Yes, it's greedy search. Keep it simple. You can do... Not similarly to nailing, but there's ways to jump to a random part of the tree and search from there to get out of a local minimum. We can ignore that. Okay, so that's great optimization in an hour and 20 minutes. And that's it. It's the hardest part of the databases. So again, I just want you to understand that when you call explain, you see these numbers, this is basically where it's getting those numbers. It's how it's generating the physical plan. So we saw how to use static roles and heuristics to optimize the query plan without understanding anything what's actually inside the database. Not entirely true because you need to know, like, basic counts and things like that and what the schema looks like, but we're not scanning the actual data. We're not trying to do approximation. And then if we wanted more advanced optimizations like the join order, that's the most important one. We would use a cost model. Okay? So today's class was the hardest part of databases. The next class is the second hardest part, which is transactions. This is my favorite topic. This is one I actually do know something about, more than query optimization. So we'll cover the next, I think two or three weeks, we'll cover transactions, and then it'll be recovery and logging. And then that's it for basics for databases. And we go to distributed databases. Okay? Bye, guys. See ya.