 Today we'll be talking about query optimization. This is a really important topic in database systems. So again, I'm here at my house. I have the tier you're over there who's asking questions as we go along. So when we say query optimization, the idea at a high level is that we want to take a query that's written in a declarative language like SQL and we want to then translate that into a execution plan that the database system can execute. And so with query optimization, the idea is that we want to find a correct execution plan that will have potentially the lowest cost. So the two words that I'm emphasizing here are correct and cost. Correct is kind of obvious. It doesn't help us if we find a really fast plan that ends up computing the wrong answer, returning the wrong result for a query. That doesn't help us at all. And the term cost is in quotes because the cost is going to be this relative metric that's not necessarily tied to a real-world metric. It's just something that the data system would say this is the best query for me to execute. So the cost could be in terms of execution time, it could be in terms of monetary cost if we're running this on a page-ago cloud system, it could be in memory usage, it could be any possible thing, but at a high level it's some objective function that the database system knows that it wants to optimize for. So the query optimizer is going to be the hardest part we're going to have to touch or a component we're going to have to implement in the database system because finding a optimal or near-optimable query is going to be super challenging, it's proven to be an NP-complete problem. So this means that although the name of the component we're going to build is usually called the query optimizer, we're almost never going to be able to find the optimal plan because it would just take too long to look at all possible solutions or all possible query plans and pick the best one. So what these upcoming lectures are really about are the techniques that we'll use to simplify and reduce the search space so that it's not an exhaustive search and techniques we can use to estimate what the plan cost will be without actually having to really execute it. So this makes more sense as we go further along, we talk about cost models, but the way to think about this is if we need to estimate what is the cost of executing one query plan to get that true cost, we could just execute the query, but now if you want to look at thousands and thousands of possible different query plans, that's not feasible for us to do, so we need a way to estimate it. And this cost thing is also going to be an internal metric that the data system uses to determine whether one plan is better than another and it may not necessarily be tied to the wall clock time or real cost. So this is just sort of to set us up on what we mean when we start talking about query optimization. For the next three weeks we're going to have discussing a lot of different things for query optimization. So as I said at the end of the last class, I will fully admit that query optimization is the one aspect of database systems that I know the least about and I'm always trying to learn more. And so this year we've expanded the discussion to include an extra lecture on extra query optimization methods beyond the traditional two approaches that we'll talk about in this lecture and the next lecture. So where we're going for the next three weeks over the next four lectures is that we'll spend time at the beginning talking about how to implement a query optimizer, essentially what does the search strategy look like, the search engine inside of it look like. Then we'll talk about how to plan rewriting, query rewriting, plan enumeration, and then we'll talk about sort of more sort of sophisticated techniques to do adaptive query optimization and we'll finish up with talking about cost models. So I'll briefly touch on what a cost model is, I think a little bit in this lecture, but that'll be sort of the end thing to say, well, I have to redesign all this search stuff that we're talking about here. The cost model is what we're going to use to sort of guide that search. So for today, we're mostly talking about optimization search strategies, because that's sort of one of the two main parts you need to have. And then that'll then segue into these other topics. So for today, again, as I said, we'll start talking about the search strategies, because that's the most important thing and that's sort of the basics of what you need to actually implement a query optimizer. And then we'll go through the different approaches people will have tried to building a query optimizer over the last 40, 50 years. And then that'll motivate why we want to use a more sophisticated dynamic of programming technique or the Cascades technique that you guys will read about for next class. So today's sort of like a bunch of background information that then leads up into the modern implementations. So at a high level, the stack of the system that we're talking about, and what I've sort of referred to the front of the system, is going to look like this. This is the part that we're talking about now in our system. So we start with a SQL query that the application sends us. We can first pass that into a SQL rewriter, which means that we'll transform the SQL string itself to annotate it or modify it in a certain way. And this is optional. Not all systems have this. You see this sometimes in middleware systems that are doing sharding and sometimes can rewrite the SQL query to add in, go to this table or go to this node versus that node. But this is sort of, this is like SQL query in, SQL query out. You're just modifying the SQL query. Then we take the SQL query, run it through our SQL parser, and then that will spit out a abstract syntax tree that just has the tokens, the string tokens of the query. The token of the table name or the column names and so forth. And then now we pass this into the binder, which does lookups in the system catalog or the database system to map the table names in the, that are referenced in the query, the column names that are referenced in the query, into internal identifiers. So this allows you to get like the type of the columns, whatever constraints you have on those columns and so forth. So then the binder will then spit out what is now called a logical plan. So this is a high level description of what the query wants to do. All right, so if I have a select query that wants to access table foo, my logical plan will be, will just say scan table foo. It doesn't say how I want to scan it, it just says I want to access it. Then we can take that logical plan and pass it into a, another optional step called the tree rewriter. And this is again, logical plan in, logical plan out. These are, think of these as like heuristics we can apply to manipulate the structure of the tree for some optimizations that we know we always want to do and we don't need to, to estimate the cost of the change we want to make. So like we do, like a predicate push down here or sometimes you see rewriting view queries or CTEs into nested queries this way. Again, this is entirely optional. Then we now feed this into our optimizer. And at a high level we'll be focusing on a cost-based query optimizer, where we have some cost estimates that are based on statistics that we've collected from the database itself that allows us to predict or estimate, what's the extrusion time of a particular query plan. And then again, this is where we can enumerate a bunch of different options and choose the one plan that we think is going to have the lowest cost. And so once the optimizer is done with this step, then it fits out a physical plan that we can go off and execute it. Now in the case of our system, and when we talked about query compilation, we take that physical plan and then we use, we have code generation to convert it into a domain specific language that we then compile into machine code. But other systems, you could take this physical plan and start interpreting it and executing it right away. This is at a high level where we're going for today in the upcoming lectures. So we're mostly going to be focusing on these two parts here, the tree rewriters and the optimizer. For SQL rewriting, as I said, that's mostly done, you're not really doing that for optimizations, you're doing it for like redirecting queries. Yeah, it's mostly like control plane stuff, not necessarily for execution stuff. And the binder, there's not really any magic you do there, it's looking up for, you know, looking up table names and things like that. All right, so let's now discuss this distinction between the logical and physical plan, because this is going to come up multiple times when we talk about query optimizers. So the essentially what the optimizer is trying to do is trying to map a logical algebra expression of the query itself into a equivalent physical algebra expression. So as I said, like the logical plan would say, I want to access table foo. The physical plan would say, access table foo using index, you know, x, y, z, or access table foo using a sequential scan or binary search on the actual table itself. Right, so at a high level, the distinction is that the physical plan tells you how you actually want to execute a, you know, the query itself. But the logical plan says that this is what I want the query result to be. So the physical plan operators themselves, it's going to have low level information that is specific to the physical properties and the physical format of the data that they're accessing. So in some cases, the physical operators will know that it's producing a data that's sorted on a given column or know that's going to be operating on data that's compressed in a certain way and maybe needs to decompress it into another way. The logical plan has no notion of this. It just knows about relations. So one important thing we'll see, and this will come up when we talk about Cascades in next class, is that there's not always going to be a one-to-one mapping from logical operators to physical operators. So to mean that in the case of access table foo, I can imagine easily how that would be mapping the logical operator to access table foo to a sequential scan to access table foo. That would be one-to-one. But sometimes we'll have logical operators that could then expand into multiple physical operators. And likewise, you could take multiple logical operators and coalesce them into a single physical operator. It depends on the implementation of the database system. Usually, you see logical operators being exploded into more operators. That's more common than the other way, in my opinion. Like I actually take that back. So a logical join with a logical order by could be merged into a physical sort merge join because you're doing the join and the order by together. All right. So another important aspect about what we're doing here relies on this notion of relational algebra equivalencies. Again, this is something that we've covered in the introduction class last semester. But the basic idea is that since we know the commutativity and associativity transitivity properties of relational algebra expressions, we know we can define rules that allows the new transformations of the query plan in such a way that we know that the end result is equivalent to the original plan. So in this case here, if I'm doing natural joins between tables A, B, and C, and if I join B and C first, then take the result of that join and now join it with A. Well, that's equivalent to me joining A and C first and then taking the output of that and joining B. And I can do this because this is an inner join or a natural join where I have the commutativity property. I can reorder them any way I want. Left outer joins, you can't do that because they're asymmetric. Whereas these things are symmetrical. So I can flip these around in any order that I want. So now you can kind of see what the optimizer is actually going to try to do is if it knows it needs to join table A, B, and C together, it's going to try to figure out what is the correct ordering I should do for joining B and C first or joining A and C first and so forth. And it can use a cost model to estimate which of these join orderings is going to be most efficient. Then now you can think about the physical operators as well. I can now decide what algorithm do I want to use to actually do this join? Do I want to do the sort merge stuff we talked about before? Or do I want to do my parallel hash join and so forth? So for these lectures, we're mostly focusing on analytical queries because in this world, the difficulty or challenge in finding an optical query plan is much greater than doing this for all HTTP workloads or transaction processing workloads. And this is because most of the time the queries in old to be workloads are considered to be what is called Sargeable. And Sargeable is an acronym for the search argument able. It's some term they made up in the 80s. I don't know why, whatever. And for the Sargeable queries, the reason why they're going to be easy for us to do query planning on is because in old to be workloads, you're almost always doing lookups on small number of tables at a time. And you usually have index defined on the attributes you're trying to do lookups on in these tables. So the many times the challenge is just trying to pick what is the best index for me to use for this particular query. All right, so say I have a simple table of foo. It has two columns int and name. And int or sorry ID and name. ID is the primary key. So if I have a query that comes along that doesn't, there's an equality predicate where ID equals 123. There's not really any complex search I need to do to figure out what the best access method is to access table foo here. I just look at my catalog and say, well, I know I'm doing an equality predicate on an ID field. I have a primary key on the ID. So therefore it has to have an index. So I know in this case here, I just use that index to do the lookup. And I don't need to even consider what are the possible indexes I want to choose. Likewise, for joins, they're almost always going to be in an OTP setting on a foreign key. Some applications are not well designed. So they may have not defined a foreign key. Sometimes you see sort of applications written by people may not know what they're doing with databases. So clearly it should be about foreign key, but that is didn't find it that way in the schema. But even then, you still figure it out fairly easily. But in this case here, like if you have a foreign key relationship and you know you have a small commonality, there's going to be an index to enforce that referential integrity in the parent table. So again, when you want to do your join, it's pretty easy to pick out. Like the parent table should be the outer table and then the inner table is the foreign key child. And I'm doing the Nesta Lube index join. To join these two tables together. So again, in this environment for OTP queries, it's usually pretty simple to do query planning. And you get by doing simple, the simple heuristics-based optimizer, which we'll talk about in a second. MongoDB is probably the most famous database system out around today that still uses the sort of basic heuristics to do things like this. Because in the type of workload that they originally were targeting, they weren't doing complex joins because they didn't really add joins until only a few years ago. So, you know, in this world, they could get by with simple heuristics. All right, so I also talked briefly about cost estimation at the beginning. Again, the idea here is that we have a query comes in, we have a logical plan, and we want to figure out what's the best physical plan to execute, to represent for that logical plan. And so we have to use a cost estimator to predict what would be the execution cost of executing that physical plan or one of the physical plans. And then we can use that to determine which physical plan is preferable to the other one. So there's a bunch of things we could use to figure out what should be our cost. We can look at to see how the query is going to interact with other queries running at the same time. That's actually pretty tricky and few systems actually do this. A real common thing is to say, well, what's the output size of all my operators and sort of use that as a proxy for what resources I'll use. It could be just like the asymptotic analysis of the algorithms we're using for our physical operators. But if we know a hash join, it's going to be more efficient or have incur less IOs or less overhead than a nested loop join. So we can bake that into our cost estimate. How much CPU or memory I'm going to use and then some information about what does the data actually look like? Like where is it physically located? How is it compressed? All these things we can incorporate into our cost models to make estimates of what will be the cost of executing a physical plan. Now, this is super hard to do because you're estimating based on a summarization of what the data looks like. Because again, I can't scan the table and say, how many tuples you have or what's the distribution of these values? Because you might as well just execute the query and for really large databases, that's not feasible. So you're trying to estimate this as quickly as possible because you want to evaluate as many physical plans as you can as quickly as possible so that you maybe find the optimal one. So this should be not next week but in two weeks, we'll discuss more about cost models and we'll see how wrong they actually get and lead an optimizer to make bad decisions. All right. So let's talk about now how to actually build an optimizer. So there's some first design decisions we need to consider before we can actually say what is the search strategy that we're going to use to enumerate plans and look for an optimal plan. So we're going to go through each of these topics one by one but optimization granularity will be the, what are we, what's the scope of what we're looking at when we do query planning? When we find these off, how to handle prepared statements, plan stability and then when to finish the search. So again, we'll go through each of these. So the optimization granularity is what is the scope of the search problem that the optimizer is going to have to deal with and the choices are to either do it on a single query or do it with multiple queries. So single query is the most common approach because oftentimes this is all that you really have like a client of this connection sends a SQL query. You're not going to wait around to see what other queries show up from other connections. You take that one query and you run it through the optimizer at that point in time. And the advantage of doing this is that you're going to have obviously a much, potentially much smaller search space because you're only dealing with that one query at a time. You typically also don't reuse the search results across queries. So you're essentially starting the search from scratch every single time. Now we can talk about some techniques in adaptive query optimization, how to bridge information you've learned from one search to the next. But typically most data business don't do this. The other aspect of it is like if you want to start figuring out how much resources I'm going to use when I execute my query, you may have to consider what other queries running at the same time. So again, you can do those multiple queries but if you're doing with a single, you can do this in a single query much easily than you can do with multiple queries because you don't have to worry about what those multiple queries are, how they're interfering with each other at the same time. Multiple queries, as I said, is more rare. This would be like if I know I'm going to execute a thousand queries in my application right now, rather than me issuing them one after one after another, like issue one, execute it, get those all, issue the next one. I could in theory give the database system a batch of queries and have it figure out a single plan for all those queries that it just executes and it knows how to route the request, route data between operators that are shared or not shared and produce the output for each of those individual queries. So as I said, this mostly appears in maybe the sort of stream processing systems or continuous query systems where I have queries that I was going to run forever because there's some pipe of data coming in all the time and I'm processing them. It did appear in academic systems that was a system called sharedDB that does something similar. Materialized views are another way to sort of achieve this. But as I said, this is for most applications, this is not the interface. This is not how they're going to interact with the database system. They're going to do it this way. Open a connection, semi query. The optimizer takes it, plans it, and then runs it. The next question we got to deal with is when are we actually going to fire off the query optimizer? So this first one, again, this is the most common one. This is static optimization would be the query shows up in the system. I've run it through my optimizer. I generate a plan for that query. And then once I leave the optimizer, I never go back to the optimizer and I just go and execute the query. So it doesn't matter. So at the time of the optimizer, it's picking what it thinks is the best plan, but it may turn out to be when I actually run the query, it's not as good as I thought it was. I thought it was, but I can't go back. Like it's sort of a static plan. Once the optimizer says this is what you're going to use, you're stuck with it, right? So now, as we talk about many times, the quality of these plans will generate will be highly dependent on the cost model accuracy. That's truly true for everything, but it says in this case here is that you can't go back and tell the optimizer you're wrong, that you told me something, what you told me was wrong, pick something better, you're more dependent maybe than these other ones here. Now, the other thing you can deal with too is instead of having to run this optimizer for every single query over and over again, especially if it's the same query, you can amortize this cost using prepared statements and potentially have a longer time to do query planning than you would normally otherwise. We'll talk about the cutoff threshold for the search of the optimizer in a second, but think about this as like, when I open up Postgres, I open up MySQL, whatever terminal I'm using for my database system, I take a SQL query in, I hit enter, and then I get back a result. Like in that time, it's going to run through the optimizer and generate a physical plan then actually execute the query. And in some cases, the queries can run in milliseconds. So if I have a more complex query that I don't want to have it, I'm going to execute this all the time and therefore I want to spend some extra time to do a better job looking for an optimum plan, I can potentially do that with prepared statements. We'll come into that in a few more slides. Another approach that is, as far as I know, this doesn't really exist in sort of relational SQL-based database systems today. It mostly shows up, potentially in stream processing systems, but I might be wrong. And the idea here is that my query shows up. I generate a logical plan for it. Don't want to restart. I generate a logical plan for it. And then I break that plan up into sort of subgroups or cell plans. And before I execute a cell plan, then I run it through the optimizer to generate the physical plan and then I execute it. And then now I'm going to learn some information of what the output of that cell plan will be that I can then use to guide the optimizer to find the next cell plan, our next pipeline I want to execute. I could have it then generate a physical plan using the information I've collected from running the first one. You're sort of iteratively building up the physical plan by going back and forth between the execution engine and the optimizer. So this is rare. Again, I don't think any commercial system actually does this approach. We'll see in a sec, Ingress actually does this. And it wasn't so much they were doing this approach for because they were sophisticated because they were limited to this sophistication of what they could actually do in the software and the query plan at the time. But this approach exists in academic systems that again, I don't think it exists in any commercial systems. The last one that does this one actually does exist in commercial systems today is this idea called adaptive query optimization or sometimes called hybrid optimization. And the idea is that you still do this first step when you go through the query shows up you go through the optimizer the first time it generates you a plan but then your plan is also going to include include these built-in sort of watermarks or thresholds you can use to determine whether the estimates that the cost model used in the first step when it generated the plan the first time whether those estimates are actually matching up with what you're seeing when you when you access the data. So like if I thought my query was very selective or so my predicate was very selective but then I run the query and I see the predicate is not selective at all then if you identify that the there's a the error rate for your estimations it's exceeding some threshold then you can either pivot and change the plan on the fly right there or you just give up and go back to the query optimizer and say hey your plan was wrong make me a new one make make me a better one so the again we'll talk about this in two weeks this one here is you bake in the query plan itself additional alternatives and so you can sort of pivot or change based on what the thresholds are this one is literally like saying stop executing the query and go back to the optimizer so in the easiest way to do this is just throw everything away and start over from scratch and the idea is that you're making the gamble that the physical plan you generate the first time is so bad that you're better off throwing away any work you've already done going back to the optimizer getting a new plan because that new plan will be even more efficient than just letting the first one run to its end another approach is to then to recognize how much work you've already done and then you go back to the optimizer and say I've already done you know I've already scanned or joined these two tables so generate me a new plan that has that doesn't change that part I think that part is fixed if I already computed it I think everyone pretty much does the throw everything away and start over at least in the commercial systems today all right so the support for parent statements so let's say I have a query here that wants to do a three-way join on tables ABC and I have some kind of input parameters here that I can use to to filter out the the tuples some of these two tables so this is the simple query right this three-way join is doesn't it's going to take no time to execute but let's say it takes a long time right and let's say maybe it takes it takes 10 seconds to to run through the query optimizer but the query only takes one second to run so in that case every single time I execute this query and say I'm executing over and over again I don't want to pay that 10 second search cost for a query that's going to take one second so what I can do instead is I can declare it as a prepared statement and it's essentially I'm giving this this this query here this name my query and now I use the execute command to invoke it all right so now in this case here the database system could it's told ahead of time what the query is going to be as a prepared statement it can run it through the optimizer take that 10 seconds cache the plan and now every single time I invoke it I'm just getting that plan again right so now it runs super fast without paying the cost to the optimizer so in this case here this is this is a super simple example this would work right there's no issues here but let's say that I want to execute this query over and over again but maybe not use these values that I have here on on the to do this the predicates in the where clause maybe I want to change these for every single query so to do this I can replace the constant values with question marks or sometimes dollar signs to indicate that these are input parameters and now I declare when I declare my prepared statement I say these are the three parameters you can pass in and their types and now when I invoke the query it's like a function call where I pass in these actual values right so now the tricky thing is going to be though is what should the physical plan actually look like in terms of join ordering for this query if I don't know what these values are at the time I'm told hey prepare the statement for me right so if you know it maybe the first time I executed when I had these values here I would want to join a a and b first then join c but maybe for another set of values I want to join b and c first followed by a but I don't know that because all I have are these these placeholder these variable markers here and I don't know actually what values are going to be used so there's a couple of different ways we can handle this with prepared statements in a query optimizer so easiest thing to do is every so often we just use whatever the last plan it is that we had so let's say that the we call prepared statement we don't actually prepare it right right there we don't want to see the query optimizer but when someone actually invokes it and executes it uses the prepared statement now we get some values we can then say well let's just assume that those values are what we're going to represent what we're always going to see and I just cash that plan all right and to always use what I already used before so I think that's what Postgres does right it's pretty simple the next alternative is that you actually lose all the benefit of getting a prepared statement where you essentially run it through the optimizer every single time you invoke it you can be a little bit clever and try to use whatever the last plan you had as the starting point in the search at least had that say you know here's a good starting place for my physical plan and base your search for the better plan off of that rather than from starting from scratch every single time that essentially using the previous plan as an upper bound that is tricky because that requires you to engineer your optimizer to be able to restart it from an existing physical plan and most optimizers are not written to handle that another approach is to generate multiple plans for your prepared statement so you could say well I know if I'm doing a prepared statement that has one input parameter on this one column I can look at my histograms or statistics that I've collected through the analyze keyword or the analyze command about what the distribution of values look like for that column and maybe I break them up into buckets in my histogram and for each bucket I'll generate a plan so now when I invoke the prepared statement I look at the value I mean passed in and I figure out what bucket it belongs to and that's the plan I use so for one variable that's easy to do and now I have multiple variables now it becomes exponential or the number of plans I'm going to possibly have would explode and that would be expensive to do so I don't think oracle might do something like this but I might be wrong like there's sort of more primitive versions of trying to look at have an exhaust of great of all possible plans the last one is just to say well I know what my parameters are I can infer what comm is doing doing the lookups on or what attributes they're accessing and I'll just look at my statistics take the average value that each of those comms could represent and use that as the input parameters when I prepare the when I invoke the prepared statement through the query optimizer so again there's not one approach that that all systems actually use this is obviously the easiest one but I think this is actually a good research question and I want to investigate at some point as well all right the next thing we got to deal with is is stability so plan stability means that the that the performance of the query of a particular query will be consistent from one day to the next like if I run the query today and it takes an X amount of time if I run that same query tomorrow on the same sort of database you know modules and minor changes I should should take this roughly about the same amount of time and what I don't want to happen is I run my query today and it's it's you know it it takes this time I run it tomorrow and it's half the cost is the execution time is cut in half but then the day after that now it's back to the original time like I don't want those oscillations because that makes it super tricky for for us to identify performance bottlenecks and performance issues DBAs are very conservative about upgrading database systems uh database system software because they rather have stability and then just better performance all the time so how can we achieve stability in our queries so uh again this is all about for the query optimizer so we want to make sure that our query optimizer is generating plans for queries that from one day to the next uh will have you know can stable performance so one approach is to provide hints to the query plan so this is where essentially you can annotate the query either through like special comments in the header or or uh if it's a graphical tool you sometimes you can do this where you can specify what you want the query optimizer actually to do right for certain choices it has to make so oftentimes you'll see this like I you you can you can provide a hint to the optimizer and say this query should use this index on this table I don't care about other indexes or scans make sure you use this index right or the join ordering would be another common one another one is to uh you can tell some database systems which optimizer version they want to use for a particular query so a lot of times when people upgrade the database system software this is this is in particular for oracle you can specify which version of the optimizer you want the queries to use because if you've already vetted say a previous version and the optimizer generating plans that you can understand and you have stable performance numbers when you do the upgrade maybe you don't want to upgrade the optimizer and have it generate different plans for queries and now the the numbers may may vary right so again DBAs are very conservative it's not it's not helpful if we upgrade the database system software and 99% of the queries now run faster but then 1% of the queries run slower because the optimizer picked a different plan than it used to that's going to be bad because the people whose queries run slower now are going to get on the phone and call you and complain and say why is my query run slower please fix it the the other 99% of the queries those people aren't going to call and say you know thank you for making the queries run faster people don't do that it's when things go bad that's when people complain right so in this case here again so if I upgrade have I upgrade oracle I can say I want to run with the previous optimizer because that'll generate plans that I've that I know about the last choice is to support backwards compatible plans and so the idea here is that if I say I upgrade my database system software but before I do that on the old system I dump out or export all the plans for my prepared statements or queries that I'm executing all the time and too much of files then I upgrade the database system software and then I load back in those plans from the previous version of the software and override any decision that the optimizer make you're saying you know for this query absolutely use this plan that guarantees you've had the same plan that you had before so SQL server allows you to do this SQL server you can dump out the query plans as sxml and then load them back into to new versions right so the last thing we talk about is search termination so again when you're on the command line you type a query in you hit enter and it's going to run the query optimizer and so you need a way for the query optimizer to determine I've run long enough and I need to stop because again some of these problems for very complex queries you know it's the the search basis and be complete so you could just be running this for days and days and days and people aren't going to like that so the most obvious thing to do is just set a wall clock time barrier that says that if my optimizer runs for a certain amount of time I just kill it and I stop right another approach is to have a cost threshold where you say that if I find a plan if I find a plan that has a lower cost than some threshold that I want or another one might be if I've run for a certain amount of time and I don't find a plan that's you know 20% or 10% better than the best plan I've seen so far then I just go ahead and stop myself the last one is basically you say if you can identify that there's no more possible permutations or numerations you can do for the target plan then you know you're done for the search and typically this is done on a per group basis right because if it's a lot of a lot of a lot of tables there'll be a lot of joins and you'll never actually exhaust this right so just give one example of how this is used in the case of my sequel you can specify the optimizer search depth in the search tree so this prevents you from going maybe too far down in the branching and you sort of sort of artificially cut off the search you know so many levels deep so that that you exhaust the number of possible choices more quickly it's some systems try to be dynamic and it's often hard to get this right so this is one sort of I guess semi-famous blog article from Percona who's one of the major MySQL consulting companies do a lot of development with MySQL so back in 2012 they talked about how they had this they had this one customer that was doing a select query on on 20 tables and the optimizer took took five seconds to find the query plan and the query only ran for 50 milliseconds because it was reading a thousand tubals and so by saying that like search depth and artificially cutting it off from just doing is also search they were able to get it down in the the optimizer to finish in in milliseconds so all the various data systems have different tools to figure these things out all right so now we want to talk about the so again these are all that is what we just talked about all the design decisions we have to consider when we want to build our optimizer how to handle a pair of statements how to decide when to finish finish searching what to actually do our search on one query multiple queries things like that now we want to talk about what the search is what the optimizer actually looks like how is it actually going to generate a physical plan from a logical plan so the way to think about what we're going to talk about today is we're going forward in both time and complexity so like from a time meaning like historical point so like heuristics based optimizers of what they first built in the 1970s and going forward in this you know we're coming the newer newer implementations the other thing I also think about too is these are these are the the heuristic based one it's in the simplest way about an optimizer although the engineering side of it can get very very messy but going down the the the implementations become more complex and therefore the types of query optimizations and rules they can transformations they can apply become more sophisticated so we'll talk about you know these these are what we're talking about because this is what you'll see often when people first build a new system because they're not going to have a call space search model they're going to use heuristics and sometimes heuristics can still be are still why they use today and then the unified search and stratified searches will end up we'll talk about about today we'll get up to the volcano optimizer and then that'll segue into the next lecture when we talk about cascades and more detail okay so the very first query optimizers for the relational database systems in the 1970s for two of the three first relational database systems we're using heuristic based transformations and the idea here is that in the code itself you define these these rules that will transform the logical operators into physical operators so the the most common thing would be like if I'm doing a scan and I know I have like an index that that matches all the predicates in my wear clause then that's the index I wouldn't want to use right so that's sort of like that Sargeable stuff that we talked about a few slides before so that's like that's sort of the standard transformations you can do from logic to physical but some of the optimizations you can do to improve query performance will be applying all the the basic rules you want to have for query execution like predicate pushdown limit pushdowns and things like that right predicate pushdown would be instead of having you know the maybe the predicate at the top of the query plan I push it all the way down to my access methods so that I'm filtering out tuples as quickly as possible and not not you know passing it from one operator to the next because that would be expensive another common thing will also be to do to handle join ordering through just cardinality estimates they would say that you know we have joining two tables you know which one potentially you can approximate which one is going to have more triples feeding into the join operator so you make that one the outer versus one the inner so it's simple rules like that so the important thing I'm going to say about this is that there's no cost model involved in this it's all static it's like I see a query a query plan has this pattern in it I always apply this rule right I always push down the the the the predicate things like that so as I said the two of the three first relational database systems from the major relational database systems from the 1970s ingress out of Berkeley from Mike Stonebreaker and Oracle from from Larry Ellison this is what what they used and the again from back then like the the resources that they had for you know for you know computer resources to run actually run the data system software itself was was quite limited so the size of the databases they were trying to store back in the 1970s were super small right so like in we'll talk about this in a second in the case of the ingress paper here like they're talking about you know scanning tables with like 300 tubals and that's that's like that's nothing so the complexity of the problem they were trying to deal with is the kind of queries they were trying to deal with is is was much much less than what we deal with today in the case of Oracle like Oracle you know they got huge and got you know a lot of money writing on you know the simplest type of sort of optimizer you can have again one that I didn't have a cost model that all the things we're going to talk about and going forward like they got huge just based on this so I don't want to I don't want to make a sound or sound like I'm saying you can't you can't have a sophisticated data system without having a cost-based optimizer clearly Oracle got quite far with it but again back in the 1990s the type of queries they were dealing with like the sql92 standard no ctes no window functions no user defined aggregates right all these things didn't exist so it it is actually viable but again also mongo got really big without having having a a cost-based optimizer as well so I've talked to some people that worked on the the Oracle query optimizer like back in the day in the 1980s 1990s and they would say that it was the largest piece of code in the system and it was quite sophisticated in what it could do but it was like an engineering nightmare because everything was written in C there wasn't no high-level DSL or their declaration of the rules it was a bunch of if-then-else statements and so again they had a lot of money and they had some really smart people and they could have done amazing things by expanding this but it just got to become untenable and that's why they switched over to the the the cost-based models cost-based searches that we'll talk about in a second all right so again this is what Ingress did at the beginning this is what Oracle is saying at the beginning I'm not saying that this is the right way to do it in some cases it is some cases it's not and so let's talk about what Ingress does again this is mostly just for her historical curiosity all right so going forward we're going to use the following three-table schema and with an example query of Spotify or Apple iTunes database so we have artists and we have albums and then artists appear on albums so we're just keeping track of like what you know what what what records or what albums are being put out and what artists are on them so let's talk about Ingress does so this is what Ingress used back in the 1970s so for this we're going to use this query here where it's a three-way join between artists appears an album and we want to get all the artists that appear on my my next table all right so Ingress didn't the way it did joins is that well again how to say this Ingress didn't support joins so they had to rewrite all the queries into single-table queries because again because they couldn't join tables together so the first step what they would do is they would take a join query like this and then they would then decompose it into single-value queries so what do we mean by that so say we take this query here and we're going to rewrite it into two queries so the first query now is going to just just be on the album table where we're doing the look up on the album name because that's what we're passing over here right and so this is just a select on the single table here but then we're going to write it out into a temp table and then now for this query here we replace album with our temp table reference here and here so now we're going to recursively now rewrite this query to remove in the the joins that we have and so now we're going to have the this new query here that just does a look up on appears with a join on temp one and it's going to write the output into temp two then this query on the artist table will do a look up or do the join against temp two so now once you do this once you do the rewriting you're going to then substitute the values from these temp tables with the actual values that are produced by the query so say we execute this first one here we'll then generate get the output of the album ID 999 so then now we go to the next query and we replace the reference to temp one now with 9999 that's being generated by the previous query so now this is no longer a joint this is a single statement or single table query and I'm just substituting that value in that produced from the previous query same thing here this query runs now produces two output and then now I can rewrite this query into two queries one that takes artist ID one two three and another one that takes artist ID four five six and I can execute both of them produce results of those queries combine them together and then that's the output of my join right so this is pretty impressive right this is like a way to do rewriting of queries I mean I'm showing this on the on a sequel level they would be doing this on the the physical plan level you know this allows them to actually join without actually having a joint implementation like you're just doing these single statement selects so it's been argued that this is actually an early example of adaptive query optimization because rather than doing the static optimization approach where I would generate the plan all at once for my entire query I could take the sequel query here I see as a misspoke they were not operating on physical plans they're probably operating on on logical plans but I could take this query here and only when I get the values one two three that I want to substitute substitute into it then I fired off I send it off to the query optimizer and then generate the physical plan for it so it's sort of like a late binding optimization approach right so like I can learn from information about what's coming out of these queries to decide what the best way to execute this particular query is it's sort of again it's not exactly what adaptive query optimization is but they're essentially running the optimized on a per tuple basis which nobody does today okay so all right what are some advantages and disadvantages of the heuristic based optimization so one it's it's easy to implement it's easy to debug because you just walk through with gdb whatever your favorite debugger is and step through and see what what rules I should get fired and in order to determine why you know why I chose one particular plan versus another right and as I said when you when you build a new database system today just to get up and running and execute some first queries you can get pretty far with a heuristic based optimizer the downside though is that oftentimes if you will have to have these these constant thresholds or values baked in the source code itself or defined in a configuration file to help you make determination about certain decisions you have to make like if I wanted to determine whether a if I want to do a join convert a logical join plan or operator into a physical join operator and I need to determine whether it's sort merge join is faster than a hash join I have to have some notion of sort of baked in cost to make that decision and oftentimes it's it may have no bearing to actually what the data looks like so that can be tricky the other tricky thing about this is that it's going to be basically impossible now to generate plans when the transformations or optimizations you want to apply can have dependencies in between them so that means that like say you want to I want to determine I want to apply optimization A and based on whether or not I choose that optimization I may want to may or may not want to choose optimization B and so now if I if I'm writing this and just based rules I have to write that explicit rule to know that if I do A then don't do B or I do don't do A then I can do B I have to write all that manually in the source code we're in a cost-based search and we'll talk about the next slide if you write your cost model in a correct way you could have the search sort of the the search engine sort of find that for you right and you don't have to code it directly so again as I said the heuristic base approach is what the first two out of the three original relational database systems and the major relational database systems that came out in the 1970s that she used the other sort of famous approach used in the 1970s it was a combination of the heuristics from we just talked about plus a cost-based search and this is going to be the technique that all the modern optimizers that we'll talk about will be will be based on they may not be using exactly the approach that IBM System R is using but the notion of using a cost model to estimate the efficacy or the cost estimate of a plan is the standard approach that everyone's using so with System R the way it's going to work is they'll have static rules first performed to initial optimizations and then they'll use a dynamic programming method to determine the best join ordering for the tables so I think we talked about IBM System R a little bit but again Ted Kahn was a mathematician in New York he wrote these for IBM wrote this this far-reaching and very progressive paper on the relational model but it was all mathematical ingress and the ingress team at Berkeley was one group of people that took the paper and tried to build a system IBM tried to build their own version of the system at called System R in San Jose in California and they got a bunch of people who had PhDs in brand new PhDs in computer science mathematics put them in a room at IBM gave them Ted Kahn's paper and says all right go go make this and so a bunch of people carved off different pieces of the problem you know one person with a PhD went invented SQL another person invented storage models or two-phase locking Pat Salinger was somebody who got signed to work on the query optimizer and it's been very very seminal work so this is the first example of a call space query optimizer it's also an example of using bottom-up planning or forward chaining we'll explain a few more slides what that actually means but basically it's like we're going to do a divide and conquer approach where we're going to start with nothing in our query plan and then iteratively build it up to say to produce our final result by figuring out what the join order is going to be so as I said system R was the first system that used this particular implementation the first version of dv2 that came out in the 1980s I think borrowed the query optimizer system R and then most open source database systems are going to be using some flavor of the the system R approach like this is what my SQL progress and SQLite use the commercial guys usually something more sophisticated or using Cascades all right so let's go through an example what it's actually doing so again when a system R approach would work is that you'd break up the query into blocks and then you would have logical operators representing each block and then for each of these logical operators you're going to figure out what are all the possible physical operators I could use to implement that operator and then you're going to iteratively construct a query plan that has the minimal cost essentially you're stitching together all the tables you want to access and joins to figure out how to join all of them and produce the final result so one important optimization that IBM made at the time was that they were only going to pursue left deep join trees so that means that like all the joins are sort of only on the on this left side of the tree going up and contrast this with like a bushy tree where you can you know I can join A and B and join C and D and take the output of those two separate joins and join them together this would join a lefty tree would join A and B first then take the output of that and join it with C then take the output of that and join it with D and they made this decision to only look left deep trees purely to minimize reduce the search space of the problem right so if you have to go consider bushy trees then that then that increases the number of choices you have to look at and again for the for the hardware they were dealing at the time that was deemed too expensive sometimes you see this optimization or this assumption made in optimizers today even though sometimes a bushy plan will be preferable will be better it depends on the system right but there's there's nothing about the approach I'm talking about here today that requires you to have to only support left deep trees that's just you know shortcut they took back in the 70s all right so let's go back to my my query they have before the one difference I'm going to make is that now I'm going to sort the output of this result based on the artist ID right so again the first step in this is ours that we want to choose the best access path for each table so here's all the tables that I'm going to access here's all my predicates and I'm going to pick some with the best access method for that so in the case of artists and peers I don't have a I don't have a I don't have a index I can use so they're doing sequential scans album and say has an index on the name and we'll use that now for getting we're going to numerate all possible join our rings for these tables again this is just like a permutation of every possible different ordering that my my tables could have and I can do joins I can do Cartesian products although we can print those out later right so then now for all these possible orderings I want to figure out what is the one that it's going to have the the lowest cost and I need to base that cost on what physical operator I'm going to use right so this is just a logical plan I want to join artists and and appears an album but I'm not saying how to actually do that join in this last step here this is actually where we want to figure it out so system R is a bottom-up approach where we're going to start at the bottom of the query plan where we don't have any information we haven't done any of the joins yet we just say we have these three tables artist album appears and this is this is where we want to get to at the top so we want to figure out how to start joining these together to get to this final result here where we've joined artists appears an album together so we're going to look at we're going to look at all possible join orderings for the first stage so I want to take two tables and join them together so I can join artists and appears album appears appears an album and then all of the other ones that I showed in the last slide I'm truncating here because we're out of space and then what I have is going out I mean emanating from the starting point I have a path to get there that's going to use a particular physical algorithm right so I can either do a hash join to join artists and appears or I could do a sort merge joined to join artists and peers and likewise to join album appears I can do a hash join a sort merge join and so forth right but then now for each of these different physical physical operators I can estimate what is the cost of executing it and then I for each from going from one node to this I want to then select which path to get to that node has the lowest cost so let's say it's in this case here to go from this node to this node the hash join has a lower cost same with this one to here but this one the sort merge has a lower cost then now I want to do the same thing starting from the next level of my query plan I want to say well what physical operator could I use to do the last join right so again I'm showing a subset here I could do a sort merge join or a hash hash join for these different approaches and then now the same thing I want to pick which plan or which which physical operator from going from one node to the next has a lowest cost and then now once I have this I backtrack and say well what was the path that got me from my starting point to my end point going through these physical operators that had the lowest cost and that's and say that it's doing a hash join followed by a hash join and that's the one that that produces the lowest cost amongst all the joins and that's the one I want to use right again to divide and conquer because rather than looking at you know complete paths you know from here to here in order to estimate the cost I'm only going from one stage of the next and then once I figure that out then I go to the next stage and then combine it together that produces the result so for this particular query though remember that I've added the order by so so now I care about the the what's the sort order of the output after I joined them but there's nothing in this in this plan that I've generated here that tells me what what that sort order could be right and this is because at least how system are is was sort of set up there's no notion of of the the physical properties or the sort order of the data that I wanted so in this case here I'm doing hash joins and that means that the data is going to be random and a random sort order and then I need to now execute an order by a sort a sort operator to put the data into the physical order that I want but going back here if I had known that I needed this data to be sorted a certain way then maybe I would have chose a sort sort merge join because that would have given you two for one that would have given me the do the join plus put the data sorted in the way that I wanted it right but because the way this is set up there's nothing about the physical property of the data that can make helping make this decision so that's actually one of the limitations we'll see what system are and we'll see how we can fix this up in the stratified or unified searches later on the yeah so the way system are actually got around this was that they would actually have to bake in the cost model itself some notion of the the sort order that they wanted data to be in so that they can then identify that the sort merge was preferable to the to the hash join and for this particular example but that was you know that's sort of a half right because now you're this logic of what the data should look like is in the in the cost model and then it's sort of separate from the actual the the search strategy itself so that's an issue so a briefing want to touch on this this distinction between the the bottom up and the top down so again what I showed you was considered a bottom up query optimization strategy it's where you start with nothing then you build a plan up to get to the output that you want so system R and and IBM Starburst came later on these are examples of bottom up approaches the dynamic approaching what we'll see in hyper this is also considered a bottom up approach the the current feeling from the the research community is that this approach is better for figuring out the join ordering the alternative is the top down approaches what we use in our system based on cascades and this is where you start with the outcome that you want and then you work down into the query plan and you start adding in physical operators to to essentially reverse the steps to get back get you back for where the goal you need it so again volcano was volcano in Exodus came from the same guy that did cascades cascades is considered the modern implementation of this so we'll cover this we'll cover this furthermore I think in next class I don't want to say too much but but like this is what SQL server and cockroach DB use is what our system uses a bunch of other systems use this so this is sort of like again the the hash join versus certain is join debate in for join algorithms for databases this is sort of the same thing if you care about query optimizers this is sort of the same debate people can have okay so let's talk about what Postgres does a little bit so Postgres actually does what the system our approach is for I usually this is the case right we'll talk about there are other query optimizers in a second but if you come along if you actually query in Postgres that only joins 12 tables then this is what you get if you have 13 or more tables then you get this genetic algorithm which we'll talk about in a few more slides but the way it works is that they had this rewriting phase that has all the static rules that we talked about that can do some initial rewriting or optimizations on the on the query plan then they throw it into a cost-based search model using the bottom up approach the same one as a system are to figure out the join ordering then after that they go back and add on the the remaining things that are missing in the query plan so these would be things like the sort order or any kind of aggregation you would have to have right so the the and then you do the same thing for you just recursively do the same steps for any sub queries so the Postgres is phenomenal I think the Postgres code is actually one of the most beautiful database system source code I've ever seen you know at least anyone written in C the query optimizer is sort of like the in my opinion at least one lesson we looked at it a few years ago is one of the dark corners of the system and so the it's it's I've been told this is very brittle because you implicitly have in the source code for like the rewriting phase steps that like you basically bake in the source code assumptions about what the output of that data needs to look like the query planning to look like in order for it to be then fed into the next set of you know heuristic transformations that you're actually doing so if you change sort of some assumption you have about what the the next stage expects that the query plan to look like if you somehow the the query plan you're generating or produce is now different then that could break everything everything else and this is sort of why I like the Cascades or unified approach we'll talk about in a second because you declare all your rules you declare all your transformations and the physical properties you expect the query plan to to maintain and then you just throw them the engine and let it let it figure it out for you and the forces all these things for you whereas this one it's it's you have to understand what the you know in order to modify anything you have to understand you know every single step what's actually happening all right so so again advantages or disadvantages of the system approach so the advantage is that it usually finds a reasonable plan at least for join-overing for a moderate size number of joins or a number of tables in a reasonable amount of time without having to do an exhaustive search the downside is that you're going to have all the same problems you had in the heuristic only approach because that's the first step you're applying before you go into into the search this is only really limitation for system R that they they're limited to left deep trees modern systems I don't think have that issue and then in the example I showed before where you need to have the cost model be aware of what your data needs to look like if you have like you know sort orders or compression and other properties on the data you got to bake that into the cost model because the search algorithm itself is can't can't enforce that so this last one is the worst part this is this is this is the major limitation you're going to have using this tomorrow system R approach of using the cost model to enforce physical properties right that I think again I think this makes the from an engineering standpoint this is problematic so before we get into the sort of the modern query optimizer implementations I want to talk about a another class of optimizers that are using randomized algorithms this is another way to think about the the problem we're trying to solve in the query optimizer so the idea of the randomized algorithm is that rather than doing this this sort of branch and bound search looking for a better better query plan we're just going to take our query plan and do random permutations on it and just do a random walk of all possible valid query plans and then if we just happen to stumble upon one that's actually better then that's the one we'll end up using right and so for this one this is where you actually need that threshold like a time threshold to be able to say I've searched long enough I know I know when to stop because otherwise this thing will just run forever so an example of this will be possessed as genetic algorithm which is actually in the real system today we'll talk about the next slide but let's before that there was implementation using simulated annealing so this is back in like 1987 as far as I know this actually never actually made it into a real system the basic idea is that you show what your query plan that you generate using the heuristic based approach that we talked about in the beginning and then now you just do random permutations of operators in the query plan like swapping the join order to tables and then you estimate the cost of that of that change of that new query plan and if it's better than you just accept the change if it's worse then you flip a weighted coin to decide whether you want to keep it or not and then the idea here is that every so often you'll get up you'll get a by accepting a change accepting a new plan that actually made things worse this allowed you to potentially break out of local minimums and then find that the true you know the true optimal so the tricky thing about this though is that you have the right rules that make sure you enforce any the correctness of the query plan so that you don't end up producing incorrect results so like if my if my data needs to be sorted and I randomly permute the query plan by putting the sort the order by operator before the hash join and then now my output is not going to be sorted I'm going to end up with incorrect results so you have to you have to write all these rules to make sure that this happens correctly which can be tricky because there's a lot of corner cases an example of a randomized algorithm that is actually used in production is the Postgres genetic optimizer so again the way this works is that if you we use the system R approach we talked about before if your query has 12 tables if you try to join 13 or more then you get this one you can turn this on and off there's a threshold let's say when this actually kicks in but by default I think it's I think it's 13 tables so they're going to use a genetic algorithm where at the beginning of every round they're going to generate a bunch of different permutations or random variations of the query plan then they're going to generate can estimate the cost of all those permutations and then they'll pick some number of ones that have the lowest cost as the ones to propagate from one to the next generation and then you have some mutator function to flip around or randomly permeate the best plans like the idea is like you're getting genes from the best plans and hopefully producing offspring that producing better plans so visually it looks like this so let's say that we in our first generation for a simple say a three table join we can estimate the cost of each of these and say this one ends up being the lowest cost so we update our lowest cost flag over here and then we're going to randomly permute aspects of this we feed that over to this one and we kill off this one and then now same thing we estimate the cost find the one that has the lowest cost this in this case here this one is lower than this one so that's our new best one we've ever seen pick these keep these two guys around flip some stuff and generate new costs like that and we just keep going over and over again until we find until we run out of time so what are the advantages of this well jumping around in random locations allows us to get out local minimums and potentially we could find a better plan to best my knowledge I've not seen any detailed analysis or experiments about how good the PostgreSQL genetic optimizer actually is compared to you know so letting them the regular optimizer run for longer we also have low memory overhead for doing the search because if we don't need to maintain the history of one generation to the next or sort of our random walk then the memory footprint is just whatever we use that you know whatever we're using for that that current round the biggest advantages though is that unless we do extra work to make sure that the randomness is deterministic we could end up with you know assistant that's difficult to do blog to determine why one you know why the data system squares out might have decided to use one plan versus another in the case of Postgres they always make sure that the the random seed is consistent for a given query so if I've run the query today through the genetic optimizer and I've run it tomorrow with the same query I'll always produce the same plan and we still have to still maintain any correctness rules to make sure that we don't do any mutations that end up within that query okay so we're going a bit long here but this is good stuff I enjoy it so the one observation we can make is that everything we talk about so far is you know mostly you can imagine writing all these transformation rules and these algorithms in a procedural language like C or C plus plus and this is tricky to do because there's no way to verify precisely or through formal methods that the the transformations that we're applying are correct that we're making sure that when we do a transformation from one query plan to the next that you know that those two algebraic expressions are actually equivalent right so we could try to verify this by running lots of fuzz tests with random queries and just checking that the output is the correct but of course this is like trying you're trying to prove a negative which is not easy to do so actually let me let let out the tear in one sec so a better approach to doing this would be if we could have a way to declare what our transformation rules are through like a high-level language or DSL and then we can have we could feed this into an optimizer engine who could then generate the code or apply these rules for us so that way now through these declared this declarative rule rule rule set we can then do our verification or analysis on those rules to determine that they are correct and as long as we you know we're reasonably sure that our that our rule engine is correct to execute them we would know that we're always doing transformations that are that are that are that are valid so this was the this was the movement in the late 1980s early 1990s to develop what are called optimizer generators and think of these are like as a framework that allows a data system developer to declare the rules through transforming queries to generate optimal plans and then this this would all be separate from the search strategy that we talked about before right these are like you know do predicate pushdown you declare a rule to do that so no matter whether we were doing heuristics or a like the the system our search we could be guaranteed that we we would we would produce that that that transformation correctly right so the first example of of this kind of optimizer was in the starboard system Exodus was another sort of famous one this is what IBM this was as an academic system and then there's been some variations improvements from the guy that built Exodus of all canons cascades op plus plus was at Wisconsin Columbia was another one at query optimizer in the late 1990s at a Portland since then there really hasn't been that much research in this kind of optimizer generators and they're primarily what people use today is is based on this work done in the 1990s so we'll go a little more detail about Starburst and cascades in next class but I'll briefly talk about them right now so the these optimizer generators again they had this rule engine that allows you to apply these transformations and then you could declare in the rule what physical properties do one of the operator to enforce on the data that that was processing like the sort order example that I showed before but now the question is how we're actually going to do you know apply these transformations and do the search for the inoptimal plan so the two approaches that are the do stratified search of unified search so stratified search is essentially what I said the IBM or the system our approach was where I can apply my transformations using heuristics without a cost model first then I take that out take that query plan and run it through a search model that could then find you know the best join ordering a unified search is where we're doing all aspects of the query plan planning at once so the transformations for doing predicate pushdowns and all the sort of those static rules if you will they're done at the same time in the same search space as you do like figure out the best join order so let's talk about the first one so again the stratified search is where you first rewrite the logical plan using all the transformation rules so you go from a logical plan to a logical plan again this is the same thing that we're doing in the heuristic base approach and so this rule engine is going to check to see whether the transformation you want to do is even allowed before it actually applies so let's make sure that again you don't end up with an invalid query plan in state so there's no cost model for this these are again the same thing we talked about in the beginning I always want to do a predicate pushdown or a limit pushdown I can write a rule to do that transformation and then once the one this step is done now you do the cost based search to find a mapping from the logical plan to a physical plan and that could be the join ordering or it could be the you know figuring out whether you want to nest loop join or a certain region or a hash join figure out those algorithms the most famous one of these optimizer generators was Starburst but IBM led by Guy Lohman and again this is just repeats what I said before you had the query write phase that does that can compute the transformations based on these blocks and it doesn't have a cost model and then you do the system R-style search to to find you know join orders things like that so as far as you know the latest version of IBM DB2 like their their enterprise relational database system is using this approach I don't know of any other ones I mean Starburst was a you know Starburst was a system developed at IBM so you know it's obviously makes sense to go in DB2 I don't know of any other DB system that follows this this similar approach so what are the advantages well so in practice it works well it gets fast performance the downside though is that you have no way to clearly define the priorities you have for your transformations again this will make more sense when we talk about cascades but basically like if I know I can't exhaustively search at everything so maybe there's some transformations I want to look at first because I yeah I think I'll get more better benefit from them and then and then that helps me to then lower my upper bound for for my query and then I know I can prune things out that maybe don't are that are less helpful so I'm going to apply those transformations first so that so at least in the version version of Starburst you couldn't do that the other aspect that they they talked about in the paper that was a big pain was that because these transformation rules are based on relational calculus it became difficult to maintain and find programmers to write because writing relational calculus code is is sort of not doesn't come natural to sort of regular assistance programmers again I don't know whether in you know the Starburst papers from 1988 I don't know if since then it's been rectified all right so quickly the unified search is where you have this notion of the logical logical logical physical transformations done all together and so you don't have these separate stages because everything is just a transformation you throw it into your rules engine it does the search and applies these things for you so the major downside the major tricky thing you have to deal with the unified search is that there's gonna be so many transformations that you have to use a memoization table and you cash some of these transformations so that you're not replying you know applying the same change over and over again so now this makes the the memory footprint a little bit larger here because I have to maintain some some history of transformations applied in the past to know whether I want to apply a new one in the future so Gertz-Grafy was the same guy that did the volcano uh the the you know volcano iterator model we talked about for query processing models there's a volcano optimizer that's part of the same project he's the same guy that did some of the work that we talked about for for you know latching and locking in B plus trees so when he was an academic he built a series of these these new systems that each had a optimizer optimizer generator so the second one here what was was a volcano and again this was a general purpose based cost-based optimizer that had these equivalence rules baked into it for relational algebra it made it really easy to add new rules a treated first physical properties of the data and that it as first class entities or components in the system itself so you could you could understand whether you're making a transformation that was valid or not as you were going along and rather than trying to clean up things after the fact so this is an example of a top-down approach will be the same thing that cascades will use that we'll talk about next class but this is sort of like this was like an early prototype this was a predecessor to two cascades right so as far as I know volcano I mean the volcano system itself was using the optimizer other academic prototypes at the time we're using this but as far as I know this approach is not used in in any any major system so let's look at an example here god damn don't want to restart okay so the first thing we want to do is start with a logical plan of what we want the query to be so this is what we want our output to be we want to do a three-way join between these tables and we want to have an order by on the artist ID so again this is the opposite of what we saw in System R System R we started up here and said here's all the tables we have figured out how to get us up here so we this is where we start with and we want to and then we're going to go down and apply transformation rules to get us there right so we can evoke rules that as we go down to either do logical logical transformations taking your join on A and B reverse that and do a join on B and A or we can do a logical physical take you know join on A and B and have it now be executed as a hash join on A and B right so we then add these nodes here to apply apply these transformations and then we we connect them and say how do we actually get there and we can just keep going this down and we keep going down to do this until we reach the bottom and in this case here we could compute the cost of what are query plan as is here and we keep check that as the the lowest cost so now we just keep doing looking all possible other permutations and so forth until we find in this case here we if we try to do a hash join we know that doing this hash join would violate our physical ordering that we have so we know that we can't do this so we can introduce like a quicksort you know to do the order by and then now we can do the hash join and so forth right in this case here say doing the quicksort plus the hash join is more expensive than doing the sort merge join followed by either the hash join or the sort merge join so we can cut it off there so you're doing branch and balance to identify that as I'm traversing the branch traversing the tree if I'm at a point where my query plan is already more expensive than the best plan I've seen before I don't need to look at anything below it in the tree because I'll know I'll never get better than that all right so the advantage of the of the volcano optimizer is that we have declared a rule to do our transformations but Starburst had the same thing but it's going to have better extensibility with a with an efficient search algorithm because we can reduce the number of done estimations we have to do by caching our transformations the problem though is that the at least the way volcano was implemented is that for every single possible like for every single operator I'm looking at I would expand out all possible combinations like or all the transformations I could have and then now start doing the the search and all those rather than maybe traversing all the way the bottom first and then it was not easy to modify predicates I don't remember exactly remember this what this was about I think essentially it meant that you couldn't do rewriting of like where clauses because the the optimizer itself only knew how to do transformations on physical operators and not the expression trees inside of them so I rushed this at the end it was just meant to show you what a what a top down optimizer looks like but then next class will pick up on on cascades so hopefully the main takeaway from all of this is that query optimization is super hard and this is part of the reason why the new SQL systems when they first came out maybe a decade ago they they basically didn't implement a query optimizer because they said oh you don't need to do joins you don't need to declare a language like SQL and part of the the advantage of not supporting those things is that you didn't have to build a query optimizer but now for the systems that did add something that looks like SQL or at least a declarative language they're having to go implement query optimizers and then the the quality of them can can vary greatly and they're usually going to be heuristic based systems all right so next class will pick up on discussing more optimizers and then the real debate will be again this dynamic programming versus the cascades approach the the bottom up approach we saw a system are versus the top down approach cascades will be more sophisticated implementation and the this is what we use using our system today okay all right uh so again I'm meeting with everyone for the projects this week and then we'll on the on Wednesday next week we'll do the in-class presentations or over zoom okay wash your hands see spank it in the side pocket what is this some old bullshit yo yo took a sip and had to spit because I ain't quit that beer called the OE because I'm old cheap ice cube down with the STI you look then it was gone grabbed me a 40 just to get my buzz on because I needed just a little more kick hooked like a fish after just one sip yo put it to my lips and rip the top off they put all just dropped off your sane eyes hopped off and my hood won't be the same after ice cube take a sane eye to the brain