 So we are going to talk about, yep, question. It's during office hours and for the re-grade request, yeah. And if you need, if you can't make it or something, let us know and we'll figure something out, okay? All right, other questions? Yep, when you see, come see us in the grade scope, we can go through the answer keys over there, yep. The latest slide is on the website, yes. And if it's not, let me know. I think it just went up this morning, along with the notes and the video. So the lecture from Monday should be up over there. Are you not seeing it? Sorry. I'll put this up right after the talk, yep. It's a whole new set of slides from what was there last time. So just pay attention and then I'll get you through all of that. Other questions? Okay, all right. So let's get started and we are going to go and take a look at query optimization. There we go, all right. So, okay, move, all right. We're going to look at this query over here, which is a very simple select query and we are going to go and walk through the motions for optimizing this query. Okay, so this looks pretty simple. To set up the machinery for query optimization, we are first going to need access to this thing called the catalog, which I know we've talked about briefly, but we'll understand it far more deeply today. And this is going to have, think of it as metadata for all the data that you have, right? It's a mini database, if you wish. And it has information, for example, I've got two tables in the system. One is the employee table and then there's the department table. And along with that information, such as there are 10,000 records in the M-table and 500 records on the department table and they're spread across 50 pages. So basically 10 records per page in each of the tables. So those are stats that get stored in the catalog with everything else. I've also shown little triangles. The dark triangles means I've got a clustered index built on that attribute. So there's a clustered index on SSN in the M-table and unclustered indices on the empty name and the department ID fields. So that's a setup. Now we are going to go and start to optimize this query. And the first query plan that we start with is a literal implementation of the way in which you would go and evaluate this query following SQL semantics, which says take everything from the from class to a Cartesian product that gives you the M-pen department and then go ahead and take a look at that information in and apply the selection, which is applying the joint condition followed by the selection on the department name and the projection on an employee. So an optimizer is going to consider a variety of different plans, including a plan like this. Now this is a Cartesian product so most optimizers won't. So this is just for getting us going today. But what we have to do is to figure out what is the cost of this plan? And you're going to figure this out without actually running the plan because as we'll see very soon is a number of plans that you can produce from a query is a very large number. It is an NP hard problem. So we can't run the query to figure out whether we should run the query, right? That's oxymoronic. So we have to look at the query and then figure out how much does it cost? And then look at other ways of getting that same answer that SQL semantics and then figure out if there's a better way to do it than this. What's the cost of this plan? So we're going to start working like an optimizer. So we've enumerated our first plan. Heating an optimizer does is enumerate plans. We've done the first plan. It's a logical plan as we will see. This is just a view of what that query execution would look like. And then we start to say how much does it cost? We'll reach to our catalogs along with cost models that we have written in code and start making assumptions. So we'll say this operation of a Cartesian product, I have to read the two inputs, which is 50 pages and 50 pages and I have to write a whole bunch of pages and the number of writes in those pages, I think I have a typo here, that should be 1000 pages, which we are reading both of those in. And then, oh actually, I know what's happening there. That's basically looking at each of the page, the 50 pages times 1000, that's the cost of the Cartesian product, 50 times 1000. It's like imagine doing a nested loops to do this Cartesian product. So that's the second number is the cost to do the Cartesian product, but now I'm gonna produce an output. So I have to figure out what that output size is. I'm going to assume my stats are correct that there are 10,000 records in one and 500 records in the other. And so what's going to be the size of that output in terms of number of records in a Cartesian product? It's a Cartesian product. So every record will pair with every other record, right? So it'll be 10,000 times 500. That's how many record pairs I'll produce. It's a Cartesian product, right? What is the space for each of those records? In the original tables, we could fit 10 records per page. Now each record that we produce as output is twice as big, twice as wide. So we'll fit five on a page, right? So we will have 10,000 records multiplied by 500 records, which is five million, divided by five, tells me that this output is a million pages long. Does that make sense? It's just the pure product of the cardinalities that are input, that's how Cartesian's product work. As you can see, we are starting to cost these and these are all guesses, but they are pretty good guesses based on the stats we have. The stats could be wrong, maybe a count for the number of pages or number of records is not perfect and we'll get to that in a bit, but assuming what we know, we're going to take those costs that we know plus some of these other things and try to guess what the cost of this model is, of this operation is, okay? Have you been with me so far? We get five tuples per page again? Five tuples per page, how did I get five tuples per page? Because the Cartesian product is going to have an M record and a department record side by side, right? So I'm just projecting all of those out. So the output record would be twice as wide, twice as many bytes per record as the input because I'm just combining all the attributes from both of those together. Okay? All of those, yeah. We're just taking all of those and passing it along, keeping all of those across. Okay, yep. Ask again, sorry. Why is the first entry 50? Nested loops join, so I will read 50 pages. That's my outer block in the nested loops and I'm going to screen over the other stuff. So it's just that cost. Now of course, I could have implemented Cartesian product in slightly different way. I could have get a different number for the first two, but that comes from the assumption of how that algorithm is implemented, okay? And we'll get to details of that as we switch around algorithm. But that's a great question. How am I getting all these costs? I'm guessing based on algorithmic properties, input stats, and I'm trying to get to a rough number that is not too far out. That's all we care about. Doesn't need to be precise to the last bit over here. Okay? Other questions? All right. Now we say what is the next stuff which says a selection on MPID with department ID costs? Okay, now here, we're going to assume that we know that there's a foreign key relationship between the two components. And so I'm going to say it's, this operation is going to read in the million pages that were written by the previous. So that's the first million. And then it's going to write its output. How big is its output? Its output is estimated to be 2,000 writes. Why? Because I've got the Cartesian product. If the primary key, foreign key relationship holds, that means for every record in the MP table, there is one department ID because that's the key relationship. This means after I apply this selection, my output will be the same number of records as there are in the MP table, which is 10,000 records. Five of these now bigger records that we are working with fit on a page. So the output size that I will have is 2,000 pages. That's how much I write. Everyone with me so far? So again, I'm looking at the schema. I'm guessing, but I'm making good educated guesses. That's what an optimizer has to do. And as you see, it's getting harder as we get up the tree. The first one was easy. There was no guesswork. Now the second one is building on stuff that the first operator output it. And if you make a mistake there in an estimation, the errors will carry through. But we have to find a way to cost it. And that's what we are doing. Next stage, we are going to apply the selection on department name is equal to toy. Now again, we'll start guessing, educated guessing. We'll say department ID is a key in the department table. And there are 500 records. So on average, what is the probability that a department ID field has a name toy is going to be one over 500. You're just going to assume there's uniform distribution of employees across all the departments, which may not be true. Maybe every employee works in the toy department, for example, or no one works in the toy department, but we are just averaging things out. We're just assuming uniform distribution of the data across these two tables with respect to that key relationship. So now we will end up with a table in which that operation we are going to have to read as you might guess, 2,000 pages. Come on, there we go. And we're going to write four, that four is simply dividing by 500, the number of records we had, which was 10,000, dividing again by five, because that's how many records fit on the page. And that's the size of that output page, four pages for that output, for that operation. Last, we're going to project out the employee name, just that one column is all that the query wanted, and we'll read for that again, that one page, the four reads, and then go write that up. So we've got a correct plan. We've taken a logical view of that query, written that down here, come up with a cost for it in the first thing that we said is a way in which we could go run this query. And that's two million aisles. You add all of those numbers up, trust me, it comes out to about two million aisles. Can we do better? That's the job of the optimizer. Can I do better? And now we start enumerating. We start saying, what are other ways I can run this query? And here's another way that you could run this query. So again, this is a logical query plan that says, do a join first, don't do a Cartesian product, that's dumb. Do a join between these two tables, then do the rest of it as we did before, okay? Now the question is, what join do we use? So we're now getting a little bit more detailed into this plan space. And we'll say, let's do a page nested loops. If we do a page nested loops, plug in the formulas you had from the last few lectures, something that you looked at when we looked at query operations. And that's going to be that 50 plus 5,000. And now we are going to do 2,000 as the write. The first two terms come from simply plugging in the cost of doing block nested loops into the formulas we have from before. And that's why those formulas are important. The 2,000 is exactly the same as what we had before, right? In that upper part of the tree. So all we did is took this portion here, these two bottom operators, and combined them to do a join because that's what you would do, right? That's much more efficient. And we add the rest of it up, nothing changes there, and you end up with 54,000 IOs. Very different from 2 million IOs that we had before. Can we do better? And that's good, you don't have the slide. So I want you to think right now. I'll put the slides up immediately after that. Okay, as I said, I've read it the whole talk. So let's see. Can we do better? Yep. Yeah, so it's basically can be, why are we carrying all these department columns all across just look at the department name and department ID and bring that along. Definitely, we will definitely do that. It's basically doing projection push down as we'll see in a little bit. Another idea? Did you have another idea? Okay, those are indices on the tables. So I already mentioned that dark ones are the clustered indices. Yep. Exactly, we could change the algorithm. Why did you look at all these joint algorithms? Because we know some are better. Block tested loops is probably not as good, right? So let's do that. We'll switch the joint algorithm to a soft merge. Let's just say I picked that and then we'll say how much would it cost? But with soft merge, I'll also need to know how much of a buffer pool space am I going to get? Because my cost is going to depend. Do I do a two-pass algorithm? Do I do a four-pass algorithm? That's going to depend upon how much space do I have? That's why we needed all these numbers. So now we're adding a little bit more detail. I'm creating a plan, but for this plan to work, if I'm doing a soft merge, I'm saying I need 50 pages of buffer pool to do my soft merge, plug the formulas in as before from the class. That comes to 3,150 and then 2,000 writes, same thing carries over. Yep. Yo, very good. I didn't put the slide, that's great. That's exactly what we are going to do next. And you're thinking now, you're thinking I can optimize her. How can I do this well? We spent all this time with query execution saying all of these things matter, but now we are saying how do we make the plan to make all of that stuff work? And that's exactly right. So if we just chase this down by simply switching the algorithm to soft merge joint, we come down to a dramatically smaller number, about 7,000 IOs. And exactly as was suggested, this is with materialization. Why are we reading and writing at every point? Could we just pipeline as we talked about, right? Pipeline says just produce the output and send it across. When you talked about the vectorization model, we'll say that vectorization might be doing a vector at a time, but regardless, we are not landing all the data, storing it to a temp file at the end of each operator and reading it back in the next one. So we are doing something far simpler. So that's exactly what we'll do. And if we do that with pipelining, you cross out all the back and forth read, writes that are happening. Assume you can fit everything in memory, which means in this case, the plan is simple. You just needed 50 buffers for the soft merge, but imagine you had one joint feeding into another joint, stuff like that. You need to count all the memory space that you need to make that work. If you have that, you can pipeline that through. And now we have 3,000 IOs. So we came down from 2 million to 3,000. So you see why optimizer guys, people who know how to optimize the folks who know how to write good optimizers, they're often the highest rate people in the database companies. And it's not justice. Obviously we'll do a lot more. You'll see it get a lot more complex. But the big difference that it makes, hopefully that's coming through. Can we do better? Yep? Can you certainly think about the worst? Yeah, yeah, yeah. That's great. So the question is, should we be thinking about the worst case costs as opposed to the best case costs assuming in memory? That's exact, that's true. Here, for example, I'm saying at the query optimization time, planning time, I'm hoping to get 50 buffer pages. But what if I start to run the query and I don't have that and I only get five? So the question is, what do I do? Should I plan for that? Absolutely. And the most standard way to do that is to make a plan, start running, and then say whoops, at runtime, what I thought was going to happen is not there. Can I adapt the plan on the fly? So you can do adaptive query processing, adaptive query optimization, wilding query processing, and interleave both of those. Take the advanced graduate database class and we talk about stuff like that. So right now we are assuming optimization happens, then processing, but you can interleave those, especially as you start to realize whoops, something bad happened. You can do other kinds of things too. So for example, you can say, hey, when I do that, I'm expecting that input to be 2000 pages in that second operator. At runtime, you can say, oh my gosh, instead of 2000, it seems like I've already seen 20,000 pages and it's still going. I'm way off. You can say, I'm sure the rest of the plan is also bad. I need to go re-optimize that or do something different. So you can keep track of all these calculations we are doing as stats when we create the plan to say, this is what I expect it to be and if it's not, then maybe I should do something about it. There are other ways in which you could do something about it, including saying, I'll kill the query. I've just started. I'm way off. Go back to the optimize and ask it to dig a little bit deeper. That's not as commonly done. More common method is to go and start to adapt what you've already done and then start to change the rest of what you haven't executed yet. Again, those are advanced topics but I like how you guys are thinking. So we can do a little better along the lines of what was suggested, which is we can look at the plan that we have at hand, which costed about 3000 IOs and say, let's go a little bit deeper into this and let's go flip around the join and the selection. So this is called the selection push down. And so all I need is the records that are in the toy department. I didn't need to do the sort more join with all the records from the department, from all the departments. And now if I have an index, as was just mentioned, maybe I can use that. So I'm counting it as three reads. Maybe it's a two level index, one read two actually get the record. There's only one because the department IDs are unique. I'm gonna make that assumption here that there are no two departments called toy. So I'm making that assumption and then I'll do a nested index loop join because I've got an index on the other columns and now I have a much simpler plan. I plug in straightforward formulas. I can do this with 37 IOs. So we flip things around. We change the shape of the tree as you saw in the last one. We push that selection down. Now in the grand scheme of life as the questions, yep, yeah, I'll come to that. We have a ton of material to come. I'm gonna try to get query optimization in an hour, 10 minutes. So wait for it and if I don't, I'm definitely answering that question. It's coming. So other questions about what we just covered is that foundation clear because otherwise you'll be lost for the rest of the lecture. Okay, all right, so let's keep going. Let's say we decide this is the plan we want to execute. We've enumerated these plans. We enumerated a bunch of plans, costed them and said this is the best we can do. There are many more plans, by the way, that we could enumerate, but we've just done this and now we are ready to send it to the scheduler. Like the question that was asked, I can't send this. This is a logical view of what the query should look like. I have to do other things with it. All this red stuff are annotations I add, which will say things like, oh, the first selection, I'm looking at the department table and the access path, remember that's the technical term, is to use that unclustered or the non-clustered B3 and access path for MPHIS, a file scan. I'm going to pipeline stuff. I'm going to use a nested index loop and all that stuff. So the details about what we want that runtime system to do is all going to have to be put into a data structure that captures the tree along with all the red stuff and then that gets sent to the query scheduler, which we talked about two lectures ago, when now it goes and starts to run the tree bottom up and or top down, depending on whichever way that we've set up the rest of the query execution machinery and start to do that. And now the most stats that get added including, I expect this to be 2000 pages as an input and stuff like that. I'm not showing that here, but typically these annotated trees, they're called physical plans as compared to the logical plan which we saw before which didn't have these details. A physical plan is something you can run. It tells you which algorithm to use precisely and if you have stats, annotations and you have choices to go and do dynamic re-optimization but we won't talk about dynamic re-optimization today. Let's just get you to see how regular optimization happens and as I said, take the advanced database class if you're interested in that topic. All right, so logical plan, we enumerated, costed that. Once we found something, we'll convert that to a physical plan. That's what you send to the scheduler, okay? All right, so the NP-hard question that was asked, the space of plans you have to search that you can search is exponential in the number of joins and it's been shown to be an NP-hard problem. If I've got 10 joins, I'm going to be looking at 10 factorial, some large number like that and practically if I give you a 10 join query or in wild, you might sometimes see 100 join queries. This is not unheard of. You will never be able to run the optimization, right? You may take years to just find the optimal plan and so the question is, what do you do? So you're not going to look at the entire space of all the plans. You're going to, if the entire space is this yellow cloud, you're going to design your optimizer to only look at a small portion of it. So it's practical. And then the whole art is about how do you define that space to be the space in which you will find the good stuff. So the goal of an optimizer is not to find the optimal plan over all possible plans because that's too much. It'll take longer to optimize by many orders of magnitude than running something but it is to not make very bad decisions. Don't run the condition product. Don't run really bad stuff. So it's to find in whatever time budget you have, often you have a limited time budget to do the optimization because you can't say, I have a query. I'm going to take an hour to optimize it. Oh, and it actually just takes a minute to run. You can't also say, I'm just going to take a minute to optimize. Oh, and the best plan I came up with takes a day to run. So you're going to have to find that balance, right? You're going to have to find that balance. And there's a lot of art that goes into that but regardless you're going to need smart ways to say which space do I hunt it, right? When I'm hunting or enumerating all these plans, how can I do this in a smart way? So that's what we'll talk about. And there's just not one way we'll talk about one way today and that's why optimization is an evergreen field as workloads change and things get more and more complex in data platforms. How to do this stuff keeps changing all the time. So if you're an optimizer person, you have a job for life and probably the highest paid database employee in the company is likely an optimizer person. But you got to know the math and you've got to get comfortable with algebra and stuff like that which we'll start enumerating these plans as because that's what all the papers are and they write those in terms of those transformation. So this is just reiterating that we have a logical way of representing the query which is to say here's how I'm gonna run that and then you can manipulate that like push the selection down stuff like that. And then there's a physical space in which you're going to define very specific things. What joint do I run? What selection do I run? Do I use an index? And so that's a physical stuff. So as you'll see there are different styles of optimization and one style will start doing both together but the main style which we'll start with is going to just work in logical space, find the best logical way to do it and then convert that into the physical plan. It's kind of simpler and it's how the whole optimization game started in the field. So let's just dig into it. So we are in that optimization module, we have to enumerate as we enumerated a few plans for that very simple query. But what we also have to do is to do this search and apply some cost to it. Now, this enumeration can be done by applying rules and I'll show you some rules soon. And it's just like saying just apply the rules. If I've got A join with B, I know that's the same as B join with A, joints can commute that way. Just keep applying the rules because every time you apply a rule that is safe, you get a new plan, see if that is cheaper, right? And the other one is I'm going to do something more intelligent and search for that type of combination but in a cost-driven way. So in practical optimizers, they use a little bit of both. They're not orthogonal to each other. They can actually be combined with each other, okay? So let's start to make it a little bit more practical as to what are rules-based way of enumeration and cost-based ways. So we'll start with the rules-based stuff. And you already saw three of those, the three most popular one in the example that you just saw. The first one is a predicate push down. And in this rule-based method, you are simply going to apply the rule. You don't need to know anything about the data. You're just gonna apply the rule. For the cost-based stuff, you're going to have to know about the data. You have to say what does it cost and should I cost one way or the other? So rule-based are just blind. Just by the structure of the tree, don't have to know how many records are where you could just apply that, okay? The application may result in a worse plan or a better plan. That's not the consideration. It's just like, how can I enumerate, right? And then eventually you may end up costing something. But just applying the rule-based way to enumerate doesn't require you to know anything about the data. Here's a simple one. If I didn't tell you anything about how many records are in the two tables, MPH department, and you see a selection, you can say, you know what? Regardless, I think selections are usually more selective than joints. It's probably a good idea more often than not to push the selection down. And just a rule of thumb, I don't care about how many records there are. In general, this is going to be good. You can say I could build an optimizer that does just apply some of these rules. From what I understand, the bust of optimizer is basically rules-based and it's gonna do simple rules like that. And it's simple to build. We'll get you a whole bunch of benefit for optimization, but doesn't require you to build a far more complex optimizer. It's often where you would stop, okay? So really simple. Hopefully you can see why in general this is a good idea, okay? So now, if you want to be the highest way database employee, you have to start getting familiar with terms like this. See what I put down over there? That is a relational algebraic expression, which I know we've covered before, but that's the free representation of the algebraic representation, right? So let's just walk through that and see what that looks like. Whoops, come back, please. Okay, and so as you can see, there's, I'll write both of them down. The first one over here is basically saying, look at the first part over here, which is that join is essentially that department joined with employee in that inner bracket. Then that flows into the selection, which is up over there, and then flows into the projection. So the bracketing stuff basically gives you that representation in this mathematical form for what that tree looks like. And the papers when they talk about transformation are just going to say this transforms into that, that's a predicate push down. As you can see, this is the rewrite of the predicate push down where this selection moved inner before that joint happened. This is just regular math expression, same thing, but you've got to get that in your head if you want to play around with optimizers and understand what they do. Not rocket science, but you just have to get comfortable with relational algebra, okay? Questions? Okay, so the other simple one also we saw, Cartesian products are nearly always going to be a bad idea. The one case where it's not a bad idea is if MN department both had one record, you're not gonna do much better, might as well do a Cartesian product, right? And there are cases like that, very rare, but nearly always it's going to be a bad idea. So this is just saying, hey, replace that with a join. And again, you can write that in relational algebra where you were doing in that inner bracketed term the Cartesian product, and then you were applying the selection. You're saying, oh, I can just replace it with a natural join, which here is represented by the dumbbell operation it's on the DID, and the DID column is common in both, you could even omit that and that just becomes the same syntax as we've seen in SQL as a natural join syntax. But now just written algebraic, okay? Everyone getting comfortable with algebraic manipulation of trees and logical plans? Questions? Yep, yeah, great question. So again, these things may not work. Predicate pushed on may not work, for example, if every department was a toy. So it's the same cost. Or if it turns out that the join was more selective that the join between department and employee ended up having zero records because all the employees had nulls or no one was assigned to any department then this selection would have been more effective as throwing records away and there's nothing else to do. So sometimes a join can reduce data from flowing up more than a selection. So again, these are rules of thumbs but based on what actual data sits in the tables you might actually get one versus. So it's not guaranteed that this is always better. Both of these that we've looked at are nearly always going to be the better way to do it, okay? But not always, not always. Here's a third one that also we have seen which is a projection push down. I guess we didn't see this one and I don't know why this is blanking out on me. All right, there we go. And this projection push down is effectively saying I am carrying all the employee information all the way through this pipeline. All I needed from the employee table was the name and the department ID, right? I need the department ID for the join because that's a join key. Why do I carry the rest of the columns of this employee table push that projection down? And as you can see, you're going to have less data flow across all the costs will go down, okay? And so a lot of these rule-based stuff will be things like this. And I'm not going to read through all of this. We'll assume now you can take a look at this. And this is not even all the rules. They're going to be like hundreds of rules like that. But let's take a look at one or two which are going to be super important. So let's take a look at this one over here that basically is the join commutativity. If I have our natural join with S, I can flip those around. I get the same result, right? Logically, I get the same result. It's the same set of records that I get. So I can flip that around. Why would I flip that around? Because maybe that's better. As you remember in nested loops and sort merge and other algorithms, there's asymmetry in what is the inner versus outer and you can get that. Or it may be that I have an index and I can leverage the index on the one that is the inner relation and I want to flip that around. Okay, but it's a rule. I can flip it. These rules are apply this transformation. The query semantics remains the same. It is safe. Then figure out whether it's a good idea or not. In the pure rule-based stuff, you'll apply it and assume it's a good idea. You'll have something saying, if I see, I will always do selection push down and projection push down and remove Cartesian products and that's my optimizer. And that's fine. You'll get a pretty decent optimizer with very little effort. Let's look at the second one which is the next one over here which is called join commutativity which is saying our join with S, join with T. If I join S and T first, that's the bracket on the left side. I've got one plan but I could also switch that around and join R and S first and then join with T. Why would I do that? Second one, because maybe R join with S produces very small records. Maybe zero, maybe just one. And maybe S join with T might produce a billion records. I would rather do the smaller stuff first because then I have less work to do. I'm not carrying all of this stuff across, right? Okay, and this, by the way, there are very nice symmetries for the join properties with matrix multiplication because optimization, if you're a numeric optimization person, you'll see that they also do similar types of things but joins have, but we have a richer set of algebra and richer set of manipulation with relational algebra. Okay, so it's a more, it's a both fun, my bias view of an optimization problem. Okay, now let you look at this stuff. I don't want to memorize all of this stuff but I do need you to know the three things that we talked about and the two things that are joins they're fair game for exam if you're trying to worry about it from that perspective, okay? I do need you to know what a rule-based method does. Doesn't have to look at the data, I can just apply. It's a good way to build quick things in an optimizer. Okay, now as we've talked about, that's not gonna be enough but before I go into the cost-based optimization let's just tie everything together that we've discussed so far, give you a minute to breathe before we go jump into even more hairy optimization techniques. What's the grand scheme? Application sends a query. That comes to a parser. That parser checks. You have a table in the from clause. Is that table actually exist in the catalogs or did you miss type it? Stuff like that. It's gonna basically do all of that. Take that abstract syntax tree of the parser. The binder will go and check the catalogs to figure out is it meaningful? You are referring to column A in table R but that doesn't exist. So all of that type of checking happens over there for it to say, okay, this query looks legitimate. I'm ready to let it go. And then it goes to the optimizer where it's going to send across a logical plan and that logical plan, the optimizer is gonna go over the logical plan, enumerate different ways of rewriting that logical plan and figure out which is the best one. So the optimizer internally can use a heuristic-based, rule-based method that we just talked about or cost-based stuff which we are going to talk about next or a combination of both of those. Okay, so its job is to do that enumeration in that search space that I mentioned and find the best way to do it with whatever time budget it has and whatever stats it has at its disposal from the catalog. Okay, and then it'll go and pull up more information about what algorithm, however implemented this algorithm, what's my hash joins implementations cost, right? Those are equations that will depend on the implementation, plug all of that in to get the costs that we have and then eventually get a physical plan like the one that I showed you, the annotated relational algebraic tree, aka the physical plan. Now that can be sent to the scheduler where it can run stuff. Okay, so that's the scheme. We're gonna go deeper into the optimizer. As we said, it can use a combination of these two methods. We just covered what the rule-based stuff looks like, just enumerate rules and apply things based on best judgment, perhaps. And now we are going to go into a cost-based search. Okay, so this is the fun stuff. Yep. And that's a good point. Commutativity as a rule-based stuff, you would generally not apply, you would apply it in the cost-based scenario, right? So again, the rules might be there, how you use it, when you use it will depend on that. So that's a great point. The commutative associativity is at the heart of this cost-based search I'm going to talk about. So one reasonable way might be if I'm trying to build an optimizer and hurry, I might say do selection, push down, remove Cartesian products, projection push-downs. I think everyone needs it. And now for the joint stuff, do what we are going to talk about in a cost-based stuff. So you saw how over there, for some part of it, it is rules-based, the other part is cost-based, and I can get a decent optimizer working. Good question. Other questions? Ask again? We just talked about that. There might be, so the question is, will a Cartesian product be better in some cases? Yeah, and we just talked about that, right? If both sides had one record and it's just going to join with each other, that's the same as doing a nested loop. You don't have to set up all the machinery. Might just do a Cartesian product. So there's some cases where Cartesian product is better. Usually not, but they're definitely rare corner cases where Cartesian product is better. Yep. Okay, great. Other questions? I think I should just not touch this. Something's very flaky today. Okay, great. Just give me one second. Does this mean the recording's not done? I hate if I have to go record this mature all over again. Okay, great. Everything looks to be in order. All right, so now we are going to go into the cost-based optimizer and then start digging deeper into time. All right, so we'll start with a very specific style of query optimization, which is based on something called the system R optimizer. It was one of the first database systems that was built in the whole field. And back in the day when Cart had come up with the relational model, IBM first rejected it, but then they started to really dig into it. System R is the system that they built. I know we talked about that in the first class. And its optimization stuff is the bedrock for what most systems do even to this day. And as you'll see, it's got a very specific point of view. It is cost-based, right? So it'll do all of this join associativity and commutativity enumeration, but we'll cost it. And then as we will also see, it's a style called bottom-up and I'll contrast that with another style I'll introduce later, which is called top-down. So technically it's a cost-based bottom-up optimizer if you want to sound smart when you talk to optimizer people, okay? And now it's gonna do the following. It's going to start by saying what are all the tables I have? For each table I'm going to find out how do I access it to access through a file scan, through an index scan, what is the cost? So figure out what is called the single relational access path first and then keep combining them while the joins and using the associativity and commutativity properties. Okay, and then there are nested subqueries which I'll talk about in a little bit. So ignore all of that for now. So, and it'll try to choose the best plan. It's a cost-based stuff based on everything it has seen so far. So it'll try to prune things away so it can keep what it needs as it keeps building up. So it's a dynamic programming-styled method. And those of you who are into dynamic programming recognize that choice of algorithmic style. So the first is we'll pick the best single relational access path. So if I've got a table, like we had the department table and it had, one way I can scan it is a file scan. It had a bunch of indices. I could scan it through each of the indices. I'll just say what does it cost doing each of those? And then I'll keep around the cheapest and System R will keep around not just the cheapest but also something in which I get things in sorted order but ignore that for now, right? So I'll basically try to find the best way to access each table or maybe a couple of best ways. And then I'll say now, if I access this through a file scan and this through an index, and next stage I want to do a join, what would that look like? So it's gonna start building up in that way. And it'll use simple heuristics as you talked about for the cost model. And we can see that in a little bit more detail. A little bit more setup and then we'll start looking at some animations is we'll take the query block and break it up into these logical operators. So all of that is saying we're going to manipulate in the logical space. And for once we get down to a logical operator like a join, then we will say, oh, I've got five implementations of join algorithms. Now I'm starting to make a physical decision for that operator saying, what would nested loops cost me here? What would block nested loops cost me here? What would hash join sort merge and so on would cost me their different flavors of hash join, for example. And so we will start making those physical decisions too as we intertwine between these costs. System are, did you have a question? Okay, just trying to stop the people from the back to look at, I'm just kidding. They're like you were raising your hand. All right, just stretching. It looks only at these plans called left deep trees. These left deep trees will have a shape that looks like this where all the joints sit along like a long line. And you can think of all of these as the inner relations. And so for example, you can imagine if you had built hash tables and all of that, you just take this record probe probe probe and you're done. So if you have enough memory to build all the hash tables, this goes blazingly fast. And so it concentrates on that class of plans. It will not do what is called a bushy plan. A bushy plan is characterized by having a node in the logical plan like that in which both the inputs are also joints. So at some point it branches off and the branches, both branches are a joint. That's the definition of a bushy plan. Okay, so even if one of the nodes in your plan has that, then it's a bushy plan. So it doesn't look at that, but that later on turned out like there are many times where bushy plan is the better way to optimize square. So it's going to miss out on that. If you go back to the diagram I had with the cloud, it's going to miss out on a whole bunch of plans that are outside that space. So it's chosen to say, I'm going to look, it's taken a specific point of view and said, I'm going to look at the face of these lefty plans. It actually looks at something a little bit general called linear plans, but ignore that. I'm not going to look at stuff here which is bushy plans. It also doesn't look at Cartesian plans which are in this space. It deliberately says I will not spend time looking there. It's a very specific example of how it makes its perspective, how it chooses where it's going to look for a plans. All right. So, and this was done by Pat Strelinger, really famous, NAE member, and it was quite the breakthrough in query optimization when this paper came out in 79. So how does it work? We'll take a slightly more complex query. I stole that from Andy's slide last year. He had some amazing animations which was awesome. And we're going to choose, as we said, first step, the best single relational access plan. So we'll say, here are three relations. In this case, it's doing a join between artists where they appear, the album, and then only selecting album names which is like Andy's OG remix. And so I'm going to enumerate different ways of accessing each of those sequential scans, index lookups, stuff like that based on what the schema tells me. Choose the best plan and then start to dig into the join order stuff. Okay. So I've now, I've finished step one. I know how to access each table by itself. I'm going to now say, what are different ways to join? And I'll start enumerating that. I could do artist, appears and albums, or I could do appears albums. I can enumerate all of that. And as you can see, there's going to be exponential in the number of joints. That's where the complexity comes from. I'm going to choose to do some things. Let me decide. I don't want to look at Cartesian products. So those are choices that I will make as I go about writing this optimizer. So in this, as we then start to say, I'm going to look at these specific types of joint ordering and now start doing the costing. So let's go and dig into that. So here's what's going to happen. Yep, question. Question about the new creation. Yes. After doing one relation to zero relation. Yes. And the slide will basically do that. Yep. They're just saying, these are the possibility. This is the universe of stuff that I have to look at. But I look at even that in a very systematic way. So here's how we will look at. So I will say, I eventually need to end up with a logical plan that is equivalent to what is shown on the top. That's not a physical plan on the top. I need to join all these three tables with each other. That's basically what this tells us. To start with, I've got tables, artist albums and appears. I have to start assembling this thing together. So now pay attention. So now here, the real dynamic programming magic comes in. First, I will start my enumeration. By the way, before this, I've already decided appears should be done with a file scan, maybe this with an index stuff like that. Like that decision I made in the first step, single relation access path optimization. Okay. Then I'm going to say, Oh, what are the different ways I could do that? And not everything is listed here, right? To keep this line compact. I could choose what it wants to do, but it's going to do things like, oh, I'm going to join A1 and A3, first and third, and I could use a hash join or a sort more join. And can just start enumerating all of these different combinations. As you can see what's at play over here is, I'm playing around with the associativity and commutativity. If you think of the top line as being the relational algebra, I'm saying which one do I put brackets around first? Okay. And then I'm going to say, all right, if I do that, the first two are basically doing a join between artists and appears first. The second two are basically doing album and appears first and so on. And a lot more combinations I haven't shown here. Okay. But for each one of that, like this here, this was, both of these will give me that part of the join is done, right? And here if I had more than two join algorithms that list would grow. Now you can see there's a lot of space to explore here, right? And now I will start to do putting all of that together and say, I can look at this from a cost-based perspective and just for each of these individual ones, I'll just look at this part. If I want to get to here, I've got these two choices. Which one is cheaper? There's no reason to carry the more expensive one forward because dynamic programming, it's going to build on itself. There's no point in carrying the more expensive part forward. So start pruning now, right? So you can see how it is starting to prune. So to just keep that around. Isn't that beautiful? Right? So now it's got this little compact dynamic programming table. It says those two are better done with hash join. This one better with a sort more join. And now, as you can see, I'll just recurse through it and get the rest of it done. Okay? So just to complete the picture, I will then say, let's chase, chase this path down. This says I do A3, A2, which I did before. Now I'll add an A1. That's the permutation that I picked there. Again, I could do through two algorithms. If those are the only two I have, I can find the cost. At this point, I've reached the equivalent logical space I needed to be at the top. That logical stuff is just the logical stuff. And I will go and start pruning. Look at everything that makes sense at that second step. And now I've got full plans. I've got three full plans that have survived. Then I just go find the best one. And I'm done. Okay? Did that make sense? And this is a straight up dynamic programming technique, a cast in this way. And that's why this paper was so beautiful. Before that it was like, how are you gonna optimize these queries? But it was a really beautiful algorithm, okay? If you love algorithms, this probably gets you excited. All right. So now we've got that. And the one part we didn't do, as you might have noticed, is that there's an order by clause in that query, okay? What we did here was to just get the logical way. We said that's the logical target. That's the goal. We start from just atomic parts. We reach there, but we didn't do the order by. And so what system R would do, it didn't have a systematic way of dealing with the physical property of ordering, as it was building stuff, as you'll see the top down thing that came later, inbuilt that into the optimization process. And so it would then go through that and then go and decide, I have to do an order by, which it would defer to the very end and say, he's the best plan. How do I put the order by and try to adjust it, but not really look for an optimal plan with the order by in the first place or in the worst case, it would put a sort at the end of it. So it was not that property of ordering was left out and that was improved by good scruffy as we look at the top down stuff in a second, okay? And I seem to have misplaced my phone. Okay, good, we have a bit of time. All right, so that is the bottom up approach. As you can see, the name is because I'm building the tree pieces at a time going bottom up till I produce the final plan, right? In contrast to that is the top down approach and the top, the bottom up approach by the way is what all the old systems use that was a breakthrough in query optimization. So all the database systems listed here use that. It's sort of relatively simple to implement, not simple as a rule-based stuff, definitely way more complicated, but it takes a little bit, lot more machinery to do the top down stuff. What's the top down stuff? Yep. Further? So it's intrinsically that they want to subdivide into consideration? Yeah, I would not say that is intrinsically a property of that. You could engineer a bottom up stuff in which you try to consider that upfront and change the enumeration stuff, but that's kind of what was done there. So, but I don't know if anyone really does that, that very systematically. There is, I'm digressing, but in the system art paper, there was this notion of interesting orders and they would keep track of sorting till for some of these things, like if I get plan A versus plan B at a medium at the second level, if this one is sorted by something that needs to be joined later, I could benefit from that. So did it keep some sorting information but only for the purpose of joint but not for explicit property that gets used for other stuff? I know it's a technical detail, but can you change that? Doesn't seem like it would be impossible to do that. It's, but not easy either. You'd have to make some changes. It's not incompatible, okay? So top-down was invented by Goetz Grafave and he was an assistant professor with this student, Bill McKenna, as part of the Volcano project, which you've seen the iterator model came from there, exchange stuff came from there. So obviously huge work, huge contribution that he's made to the field. The realization was that really, there are two types of rules you apply. The logical rules like saying selection, push-down. Then there are physical rules that are related to applying the algorithms or thinking about sort properties. And you could write all of those as transformation rules. Some are the logical space manipulation, some in the physical space manipulation. And then what you do in the top-down approach is you start with a plan, some plan that is correct, and then start to say what adjustments can I make by applying the rules? And they could be a mix of, I could apply at one point a logical rule and change a joint order, or I could apply a physical rule and change the algorithm and I can play around with all of that stuff. So it's like, don't need to do logical planning first, then physical in the bottom-up fashion. Start with something that makes sense and you can control things a lot better. So there's still a debate in the community, but top-down is probably a better way to do because you get a lot more control over that solid space and you can be more specific because you can interleave physical and logical plan manipulation. So how does that work? In the top-down optimization, you're going to do, as I said, you'll apply two different types of rules, logical to logical or logical to physical, that joint AB versus joint BA is just basically, I'm commuting the joint, logical to physical, it's saying if I've got a joint, oh, I could try a hash joint and see if it is cheaper. So it can do those. And so it's basically, in some sense now, it's rules-based, but it's going to cost each of those, but the rules are of two types, logical to logical transformation or logical to physical transformation. Does that make sense? Let me show you an example and maybe it'll make more sense. So we'll start with the same query we had before. And at this time, we'll say, he has topped out, right? So we're going to start and say, I have this plan. I'm going to do an equi-joint between all those and an order-by. That's what I have. And let's start by seeing what else we can do with it. So it's going to start by exploring from there and apply logical to physical plan rule and say, hey, you know what? First two, let's go join that using a Sautmer joint. Okay, because that plan above is still logical. I don't have the physical plan, but it has all the properties I need with it. Saying three relations need to be joined. That's what the top thing says. And output needs to come ordered by artist ID. So the three relations needs to join is a logical property. It needs to be sorted by artist ID. It's a physical property of the output, right? So these nodes, there's now this notion of logical properties and physical properties, okay? Now I say, I'm going to change the physical property by applying a Sautmer joint to the tree at hand, okay? And that's what it is. So now it's starting with something and then making it more real and exploring in the opposite direction. Does that make sense? We didn't start with single relation access plan and build this dynamic program. We said, we'll start with something that makes sense and now it's like refining, right? So you can, and that refining will be like how much portion of the search space I want to search. So here it will do that Sautmer join and said, okay, now how do I do the Sautmer join? Because of these three stuff, look at every point in time, it's keeping a complete plan at hand, right? So it'll say to do the Sautmer join, I can take that and I can take that. That's the correct plan, right? So it has a correct plan right now for how it could go produce that, okay? And then it'll say, okay, what else can I do? Can I make this node better, right? Can I make it, can I give more details to it? It's logical. Can I make it more physical? Yep, I can put a hash join. I can look at other stuff I needed to. I could prune stuff out if I didn't need to. I'll complete that hash join and say, oh yeah, yeah, I need to go take these two because this one's taken, right? That's my tree. I'm just refining it and making it real. Now I've got a real plan for this tree, for this one path. I can apply other transformations to this and say, okay, at this point I could have done something else. I could have done a Sautmer join and that would have looked like that, okay? And now have two different things that are complete. I could go and say let's go and keep doing this stuff, keep exploring. At some point you might say, you know, this thing is more expensive and I can go prune that out. So it's just a totally different style. You're going to start from the top and you can do other things but you can start to play around with all kinds of interesting things that might say, you know, if I expect this to be small, then don't apply, don't only look for hash joins, don't look for Sautmer. You can be more local in terms of making the choices about how you apply these plans. Furthermore, it's a very elegant system where everything is a rule that you're applying as a transformation, but your physical and logical stuff, it kind of works in the same framework, okay? It's not like I do the logical optimizations first and the physical, it's like not the two-step, you can interweave and blend all of that stuff and start to make other decisions with it. And it just continues, I could have gone and said, oh, I need to stick a quick sort because I need an order by over there but because the other, in this case, maybe the plan was coming out sort much by join, so it was naturally getting ordered by artist ID, but if it doesn't, then I could put an explicit sort method and then, oh, that is too expensive and so on, I can prune stuff. Same thing, you're gonna cost stuff when things get physical and prune stuff out, but you're just starting with something that works and then refining it, okay? And the nice part about the top-down is you can also very easily at, you know, you very quickly have a correct plan and you can say, oh, this used to cost 3 million, very quickly I went to 2 million, very quickly I went to 1,000 and now, and that took me like 10 milliseconds to do, now the last 100 milliseconds, I will only improve the plan by 10% stop, I can go with what I want. You get choices like that far more easily in the top-down approach, okay? Because you kind of see everything in front of you, you start with something complete and keep making it more complete, right? So that helps. You don't have to do a exponential 10-way dynamic programming type of a style, it'll take you a long time to get to something complete, right? So you can start to get something out there really quick. All right, you have a bunch of stuff to cover on the other details that matter. So hopefully that was the core of the optimization and that core works if I told you the query is a single block query, a select project, a select from where, and that's it. But as you already know, you have CTEs and you have nested subqueries that you can write and life is a lot more complicated, it is very common for a database system to be presented with queries that might be deeply nested, might attend 20 levels of nesting as subblocks, okay? And you can nest in all kinds of places in SQL. So what do you do with these nested queries? So the goal is we know how to do single block optimization, right? If I can take that query and I can rewrite a nested query into a single block, everything you've told you before, you can apply to it and that works beautifully, okay? If you can't do that, then what you could do is to run each block at a time and then build up, run the innermost block, then the next one, and so on. So you have a way to do it, but can we do better, the way we can do better is by basically rewriting the queries whenever possible to make it a single block query because now you'll be able to optimize it globally, okay? So instead of optimizing each block which doesn't have any information and running it, try to flatten it out and get a bigger block because now you give the optimizer a global view of what to optimize. Does that make sense? So here, for example, is a query in which we are looking at sailors. There's a nested block here. That's looking at the reservation sailors have made. You can see the bear cross does a join and it's looking for all reservations made by sailors on a certain day. Now, that is the same as rewriting that query into a single block as that. Now, the SQL programmer could have written this one which you might argue, but often queries are generated. Queries are rarely written by humans. They're generated by tools and tools love to generate nested queries because that's the abstraction on which they are built. Regardless, for us as a database engine, we have to be ready with deeply nested queries and there'll be techniques like this to go flatten stuff out. Often they're done through a rule space components. You can see there's a little bit of that. These rules are everywhere. You can often this rewriting and flattening is done through a rules infrastructure, okay? Find this pattern. This pattern is safe if you rewrite it in this way, okay? And it's not always possible to flatten and we won't go into the details here but take the advanced graduate class and we'll do that. If it's, for queries that are harder, there's always a way out and that always a way out is to just run each block at a time and get your answer. But sometimes you can also do these things called decomposing query. So I'll give you an example here, which is again a bunch of stuff happening with sailors and their reservations and boats. But in over here is a predicate that says, select only those sailors where the rating is equal to the maximum rating of all sailors. So it's trying to find the highest rated sailors, okay? But that inner query, that select max S2 rating, that's effectively a constant for the rest of the block, for the outer block. So you could decompose this query and say, hey, this nested block, I could run that once and whatever value I get, just plug that value in here, right? You can decompose it in that way and some systems will actually run that, get the answer, stick it into this, then run that. So it'll optimize that. First query, the max rating, it's simple here, but you can imagine sometimes it's a more complex query and then go flatten out. The difference between this query and the other one was, in this query, it is a core related query where this inner query that you have has, it refers to, if you look at this table over here, sailors, the sailors ID field is showing up in that inner query block. These are called core related queries, okay? So core related queries tend to be flattenable, a fair amount. And this was not a core related query, that inner block has nothing to do with no joint key or any such thing with the outer stuff. They are very amenable to doing this decomposition. You can just pull it out. It feels like a constant and it is a constant for the rest of the query, right? So you'll do techniques like this before you go and start throwing things to the query optimizer. Other things that you will do before you present stuff to the query optimizer. And sometimes this expression we're getting is also what the query optimizer does. So it gets confusing as to where the boundary between the parser is and the optimizer is and the execution engine is, those boundaries are artificial. And in fact, there's a nice paper that talks about this artificial boundary causes a lot of trouble. And again, if you take the advanced database class, we'll cover that type of material. Expressions are everywhere and sometimes you will get queries that look like this. So before you even go and optimize this query and stuff like that and this may be a piece of a query, right? Maybe in a block of a query somewhere where someone writes where one is equal to zero. And you might say, why would a user write that? They probably wouldn't write that but a tool would definitely generate stuff like this all the time, right? And sometimes users do make mistakes and they'll write stuff like that. So you might see all kinds of crazy stuff if you've been long around and with the database that's been around for a while. And you could rewrite this stuff to say, this is really the same as where is equal to false. I can take that expression and just convert it into a simpler truth function because that makes everything easier. Okay? In this case, it's going to return nothing, right? And same thing, you'll see functions like now is equal to, now is null that can also be made into that. And some systems when you have functions like now and random won't do that false and actually go evaluate the query and come up with the answer because they sometimes don't know what's the property of random on null and many database systems also allow users to register external functions so you can't make assumptions. So databases will take different approaches to doing that. The main point is if you have expressions you can go and rewrite that before you send everything to the optimizer because you can just simplify that. These are just idiomatic stuff you can do, systematic stuff you can do that are safe to do that. Here's a between predicate where it's a or so between one and 100 or between 50 and 100 is the same as rewriting that to do that. So simplify the query using safe methods to these different rewrites, flattening out the methods before you get stuff to the optimizer, okay? That's usually a good way to do it. Questions, okay? All right, now there's a question of how do we calculate the cost of these plans, right? We talked about, oh, we'll produce these plans whether it's bottom up or top down at some point you have to say, what does this cost? Do I prune this plan or do I keep it around? And so that's where the assumptions of these cost models come in. And essentially it's like, hey, what's the size of the output of the joint? I need to know that to figure out what of output of that selection because I need to know that to figure out what's that joint going to cost. Should I be using a sort merge versus nested loop? Should I be commuting that joint? I need to know these numbers before I can make any of those choices. So you need a cost model and the cost estimations done in a variety of different ways and it will be a combination of using some internal cost model that allows us to compare one plan versus the other. You already saw this in action with the example that we started in the class. So let's just dig into that. There are usually going to be two components to the cost model. One is going to be say, what are my units that I'm pricing out? So the units might be how many CPU cycles and I'm going to consume? How many IOs we are going to consume? In the example we talked about that I showed earlier, it was just IO cost, but sometimes the CPU cost can be significant and that might change. So you really want to have a proper cost models. And when you have CPU and IO cost, you might also need to basically then say, how many cycles of CPU can I trade off for an IO cost? So you're going to need some constants and often those are guesses that are input into the optimizer. These cost models are really hard to get right, but we need something that is reasonable. Then there are these logical costs that will depend upon the size of the output, size of the operators and the algorithms that we have chosen. So ultimately we have to combine all of that stuff to come up with something that looks real with all the complexity that happens in database system. They'll do CPUs, IO activity, network activity and stuff like that, but you still need to say which one is better because you have all kinds of trade-offs that are getting made. So most systems will use a combination of CPU and IO cost. They will at least do those two and then they will have to pick some default stuff like saying in posters it will be processing the tupper in memory is 400 times faster than reading a tupper from disk. Because you have to come up with a common unit across all of these different factors. So these are going to be additional assumptions that we make to get something real so we can compare two plans and say which one is cheaper. And so you'll see if you look at the manual of Postgres, which is very nice because it's open, they'll have stuff like this. Look, we don't know what we are doing but this is the best we can do. And pretty much every database system is gonna have that. It's a famous thing at SQL Server where the cost model for those constants like 400 was based upon a desktop on which some benchmarks were run that are sitting below one of the original creators of SQL Server. For the longest time, the constants were chosen based on this machine. How many IO, how many cycles can I run? How many CPUs operations can I do in one IO? But crazy things like that happen. Okay, then of course they fixed it. Systems like DB2 and other advanced systems when you install the system, they will run a few micro benchmarks to come up with those constants and use that because that changes from machine to machine. But those things are important to get things right. We also had statistics such as number of records and stuff like that in the tables that we looked at. Who creates that? Database systems have special commands like analyze and they will go and scan the entire table to go produce that stats which goes into the catalog. And you can run that periodically. You can run it as a cron job. So it's not uncommon to say I'm gonna run it every week and as we start about transactions, these things that scan all of that for doing things like analyze will be run at the lowest isolation level. So it's okay if they see slightly in the corrupt data, you just want to get decent guesses for how many records are there. We don't want to be off by orders of magnitude. Then we also have to do interesting things like saying, okay, I see a predicate. This is called the selection cardinality. I have to estimate the cardinality of a select operator. So I've got a predicate here that says ages equal to nine. How do I know how many records will pass that predicate? Because I need that for the rest of the planning, right? And so you start making assumptions. So this is not real, but imagine you were able to store there are 15 different values, age values in this database and you store exact counts for each of those. Now you can say age is equal to nine. I can look at the value, the number of occurrences that I have for nine, which is there. So four, there are total of 45 different values. So I can see the selectivity of this operation is four by 45. Now I can use that to guess what the output size is. Yep. If we do this for the index, you would do this for anything on which you can have a predicate, which could be non-index attribute too. And as we'll see very next slide, that was obviously too expensive because the space to store that detailed value could be the same space as you need to store the original column. So you might be doubling the size of your database, which is a bad idea. You want these statistics to be compact and as accurate as possible and cheap to compute. So it's like three things tossing against each other and you won't get all of that, but that's the whole design space. We also gonna make assumptions that the data is uniformly distributed, right? Like we did when we said toys across employees is uniformly distributed. We'll assume independence and we'll talk about that next that two predicates are independent and we can just combine them. That may not hold true in practice, but there are advanced methods to work with that. Then we'll do inclusion principle like we did. We said, oh, every department employee will have a, every employee has a department ID. They're not all nuts. So foreign keys point to something real. So these are all assumptions that we make as we start to make estimates about these cardinalities. So the question that was asked related to that is sometimes these can be way off. So for example, if I've got two predicates where it says make is Honda and model is Accord and I say by the statistics, by individual stats on the make column, I guesstimate that one in 10 car is a Honda and one in 100 cars, 100 models is an Accord. I'll say my estimation says that the joint cardinality for both of these columns is the product of those two, but I would be way off. Why? Because only Honda makes Accord. So I assume uniform distribution and that is way off. So these can be wrong. And that touches on this point of how you sometimes have to go and dynamically re-optimize these squares and that's a hot topic of research even to this day. Okay. There are other things you can do. Histograms are going to represent this details information that we had, but with a lot less space. So there are two classes of histograms. The first one will say I will go and take this information and break it up into buckets. So I divided it into five buckets, three values in each bucket. Now for each bucket, I will store the total count. So now I've one-third the number of values that I'm storing. I can make it 110, 101 million, whatever I want to. And these are called equi with histogram because I picked a width and kept that uniform and I just stood across that space and counted in that. There's a flip side of it called equi depth histogram that's going to allow the bucket size to be different. But it's going to say I'm trying to get roughly equal values in each of these buckets. And there are different properties between the equi width and equi depth. There are all kinds of other names you hear if you get into this space, like we optimal histograms and stuff. Pros and cons for each one of those. Both topics we cover in the advanced database class. Here you need to know that we do histograms because you want a more compact representation. They're more accurate than just guessing uniform distribution. So they're very important and you want to have these histograms. They're more advanced versions of histograms based on these things called sketches. You're not going to go into the details of it again that gets covered in the advanced database class. They are more sophisticated of building things, of building these types of approximate stats. The key one over here is hyper log log which is used all over. It's a very fast way to get an estimate for the number of distinct values in a set, in a single pass. And that gets used for all over the place, including in data platform outside database systems. So if you're interviewing for a database job, do look up hyper log log because you should be aware of that method. It's a fast way to do approximate counting, obviously has importance for the types of stuff we do in optimization, but in all other places too. The other thing you could do is sample. So I can take randomly sample one hundredth of the records in the table and then say take the predicate I have which is ages greater than 50 here. What does it look like on my sample? I've just brought a small amount of data from disk, not the entire table. And whatever I get, I'll say, okay, that's gonna apply to the rest of it. So instead of completely guessing, uniform distribution, this is a more educated guess with a little extra cost, right? So that's the trade-off you're making a little extra cost for more accurate information. So I know I'm running out of time. I've got three slides. So this is the conclusion slide. Let's tie everything together and hopefully you've appreciated what query optimization does. And it's going to look at the SQL query, convert that into a logical plan, eventually end up with a physical plan. And as you saw, sometimes that logical to physical is not like a one-step linear process, but you could be intertwining those till you end up with a physical plan that you can give to a scheduler which has all the details that you need. Before you do all of that stuff, using the top-down or bottom-up approach for query optimization, you flatten the query out, do all the expressions, re-rides, all that other stuff to make the query correct and as compact as possible for sending it to the optimizer. It was a very hard work to do. The enumeration, as we said, can be top-down or bottom-up. And for all of this to work, you may have the best enumeration algorithm, but if your cost function is off, you'll get bad plans. So you need the cost functions to also work. So that's all the stuff I need you to know for the exams. Just give me two minutes. For those of you who are really curious about query optimization, I just want to tell you if you want to dig deeper into this space, so not going to grill you for the exams on this, but many of you are probably curious. There are four essential, you probably need to read like 5,200 papers if you're going to build a query optimizer. It is less rocket sciencey than it seems, and I'll talk about that in the next slide, but it is very, very hard on, as I said, optimizers are often the highest-paid engineers in any data company. The first one is the way where you would start. It's written by Surya Chaudhary, a giant in the field of optimization at Microsoft. It just gives you very, in like a small number of pages, like eight or 10 pages, he synthesized that whole field as it stood back in 98 and said, here's that whole mess, and here's how you should think about enumeration, costing and stuff like that. Beautiful paper, very easy to read. If you understood everything today, I strongly recommend you to go and read that paper. It'll take you to that next level. Again, we won't grill you in the exam, but it's great. Then there's the paper by Gertz-Crafrey, which is that top-down stuff. Very technical detail, takes a little bit of getting used to. Then the next paper is the system R that started this whole field of optimization. The last one is this, which is a beautiful paper. It's a beautiful title, two of Nests and Trees, which basically says sometimes we have these outer joints. You've seen that, everything we've talked about today was about inner joints, but these outer joints cause a massive nightmare for query optimization. This paper showed that if you've got a bunch of outer joints and inner joints mixed up in your query, there are, in many cases, clean ways to get a block of outer joint rewritten, and a block of inner joint rewritten, so you can optimize just the inner joints through all the techniques we've talked about. Beautiful, beautiful paper. It just seemed impossible that you could make this algebra that clean, but if you're into that type of match stuff and algorithms, I strongly recommend that. And then the last slide is, next time we'll talk about transactions, which is the next hardest part, and let's see if I flip one of the slides that I needed to just give me a minute as I try to pull that up. There we go. And this is the very last slide, which is if you are going to go and do, try to build an optimizer, three rules, these are my rules, you'll have to read a lot of papers with lots of algebra, get used to that. It'll be a lot of fun. Early on, you're going to have to throw a lot more workloads at it because optimizers get better over time by hardening. And that's how DB2, Oracle, SQL servers, optimizers amongst the best because they've seen so much workload over the last 30, 40 years. The last one is, no matter how smart you are, you will not get it right in the first place. It's not that, oh, I picked the right enumeration strategy and stuff like that. It's nitty-gritty details like, hey, what was the data structure that I used to keep track of all the tables and all the predicates and keep track of which one gets evaluated where? And you'll find soon you have a spaghetti of objects in your C++ code that are all over the place. You'll try to do a logical to physical transformation and find the data structures are all wrong. You'll probably patch it and very soon you'll have code that is unmaintainable not because the optimizer stuff is that hard. It is hard, but it's because you made it even harder because the data structures are all spaghetti. I've seen three optimizers, one that was written by someone else doing my thesis work who was sitting right next to me. And he had a really interesting thing that he said, I told him, why is your code so hard to read? And he said it was hard to write so it may be hard to read, right? It is hard to write. And if you don't go back and throw it away, it'll be hard to read. And when you're writing it, you're just building on it. And I've looked at the DB2 optimizer, I worked on that in an intern, it was very, very difficult to read. And then in the quick step system, we tried to build as clean and optimized as it could be. But in the end, we still ended up with that data mess. It's because we did not do rule number three, which is plan to throw it away and rewrite from scratch and don't try to patch it. So with that, I'll stop and then we'll pick up on transactions in the next class. Thank you. This shit is gangsta. The poppy with the motherfucking hookup, the mob, yep, still got you shook up. I smack you with the bottom of the clip and tell you, look up, show me what it's safe at before I blow your face back. I got a block on taps, the feds can't trace that. Style is like tamper proof. You can't lace that at the Dominikin' or you could call me Dominikin'. Black Skelly, black leather, black suede Timmelins. My all black dirty haters send you to the Pernigates. You get gizama trying to skate and that's your first mistake. I ain't lying for that cake, your fam, I see you wait. My grand's is heavyweight and ran through every stake. When he asking how I'm livin', I tell him I'm livin' great.