 All right, so where we're at now in the course is that we spent all the last couple of weeks sort of going down the stack through the storage architecture, right? We talked about currency control a little bit in the beginning, but we sort of went from indexes to now the storage models, to the storage format, to now compression, and then we did logging and recovery. So at this point in the course, now we're going to go back around to the top of the system, and now we're going to actually take in queries. And now we're going to figure out how to actually generate query plans for them. Then we'll talk about how to execute them or schedule them. Then we'll talk about different join algorithm implementations. Then we'll get into the really good optimizations doing compilation and vectorization. So we went through down the first time to say how to actually store data. Now we're moving back around and saying how to actually execute queries on this. And I started off with current control in the beginning because I think that really keeping current control in the back of your mind throughout everything we talk about is really important because it affects pretty much everything. As we saw, it affects what recovery algorithm you use, the logging scheme you can use, it affects what indexes you can use. So that's why I sort of focused on that first, and then we talked about storage. Now we're going back and talking about execution. So for today's class and then on Tuesday next week, now we're going to talk about actually how do you implement a query optimizer. So again, this is the query shows up, you need to generate a query plan for it and fire it off. So for today's agenda, we're going to first start sort of a rough background, a high level background on what query optimization actually means in the context of databases. And then we'll spend a little bit of time talking about the basics, about the things we're going to care about in our optimizer implementation. And then we'll spend most of our time talking about the different ways to actually implement an optimizer. You guys read the Volcano optimizer paper, which is sort of one way to do it. And then on Tuesday next week, we'll talk about ORCA, which is an implementation of the Cascades optimizer strategy. So we're going to skip, we're not going to talk about Cascades in this class. We'll focus more on that on next class. And we'll only touch briefly on what sort of Volcano does at the end, because then that'll set up for everything we talk about next week on Tuesday. And so we're going to go through all these different search strategies, because I think it's important to see here's all the different approaches people have tried. And that helps them motivate why you want to use Cascades. So in a database system, query optimization is basically trying to solve the problem of, given some query from the application, whether it's in SQL or any language you want, we want to be able to find a correct execution plan that's going to have the overall lowest cost amongst other query plans for that same query. And so I'm sort of emphasizing two points in this sentence here. I'm first emphasizing the word correct, because obviously if we choose a plan that may be the fastest but produces an incorrect result, then that's useless to us. So we're going to have to make sure as we go along and we transform our query and explore different options in the plans, we want to make sure that we're still always producing a correct result. And then I'm putting lower cost in quotation marks here, because it's not a cost in something like that is, can be measured outside of the system. I mean, it's not going to be like wall clock time or something that's in the physical world. This cost is going to be sort of internal and relative to other query plans for that particular database instance. So again, we'll talk more about cost estimations and cost models next week. But then the thing to be mindful about this is that the database system needs to have some way to say, for this given query plan and that given query plan for the same query, which one is actually, do I think is going to be better? And that depends on your operating vitamin, it depends on a bunch of different factors which I'll cover in a few slides. So I argue that this is doing query optimization, building a query optimizer, is actually the hardest thing you can do in a database system to implement well. And we'll see as we go along. And part of the reason it's so hard is because it's proven to be NP-complete. For some simple queries, query optimization is not very difficult, you just find the exact index you need. But if you have say a 35-way join or a 40-way join, then figuring out the correct ordering of all those things is obviously going to be, you're going to have exponential choices and so it's going to be very, very difficult to do. So we have this sort of tricky problem where we need to be able to find a good query plan, but we kind of need to do this in real time in that like if someone's at the terminal and hits enter to send a query in, we want to be able to generate a good query plan right then and there. We don't want it to be the case that like if the optimal query plan might take five seconds, but it takes us 20 seconds to find it, then that's a bad trade-off. So because of this, no true optimizer is actually going to produce what I'll call the true optimal plan. Again, for certain things, you can always find the optimal plan because the choice is very easy. We'll see this in a second. But for more complicated joins, it's just simply impossible because you can crunch this for years and years and years. So what we're going to do instead, we're going to use estimation techniques to guess what the real plan is going to be and guess what the real cost is going to be. And then some of these techniques will apply heuristics and other rules to allow us to cut down the number of different options or plans that we have to look at to reduce the search base, to make this thing actually be tractable and something that we can do in a reasonable amount of time. So just a quick high-level overview of what sort of this query optimizer architecture looks like. Say we have an application and they submit us a SQL query. The first phase that we can go into is what we usually call the SQL Rewriter. So actually when I'm showing it in this sort of pipeline here, anything that's in the dark boxes, so the parser, binder and optimizer, these are things you actually need to have. You can't do query optimization without one of these. The boxes in the light gray, these are what I consider optional things. Some systems have them, some systems don't, some systems, and you don't always need them. We're not really going to discuss whether you want to use them or not. I'll just say that sometimes they're called different things, sometimes the binder has this thing inside of it, sometimes this thing is inside of that. The high-level idea is roughly the same. And in the end, you're still going to produce an optimal query plan, but how you got there depends on the implementation. So the SQL query comes in and the first phase, the first component we could go into is a SQL Rewriter. And this would be taking the actual raw SQL string and not inferring any semantics about what tables it's trying to access or anything like what it's actually trying to do. There's some basic regular expressions you can apply to them to maybe modify them to make them more tractable or easier to optimize. So the MySQL 8, which is actually not out yet, they have plugins that allow you to do basic simple transformations, like maybe converting something that's a complex bunch of or's into an in clause or something like that. Now the SQL Rewriter still spits out a SQL query, and then you pass that into your parser and that's going to break up, you know, it's like any other parser for programming languages, it's going to break up the strings into tokens and then generate an abstract syntax tree that explains what the SQL query actually was about. And then we feed that abstract syntax tree into what is called the binder and this is where we're going to convert the string names of tables and other objects in our database into actual identifiers that the system knows about. Right, so in this you use the system catalog to say, you know, for table name foo, get back a unique identifier for that table so then you can use that going forward on all your planning. So this thing, the binder spits out is what is called an annotated AST which basically is the same thing as this guy here but just now it has, instead of having names for tables, it has identifiers or it has, you know, all sets for columns and things like that. So then now you can have pass this into what's called a tree rewriter and again in case of Postgres this thing is usually embedded inside the binder, it doesn't have to be a separate component and this is basically going to allow you to do some other early optimizations directly on the AST that you know you may want to do, like simple things like doing predicate pushdowns or things like that. And then that thing then feeds into what we're going to focus on in this class here is the optimizer and that's actually going to do the more complex search to figure out how to turn your logical plan you're getting from this thing or the annotated AST into actual physical plan and the physical plan is going to specify exactly how to execute that particular query and so for some of these techniques there'll be some kind of cost model or estimator that we can say for this physical plan and given this state of the database what's the estimated cost of executing that query and again I said it's an internal cost, you can't compare different database systems cost models it allows it again to be able to guess that one query plans me better than another. So in our current system we have a parser from this system called HiRISE but it turns out that thing is implements SQL incorrectly so we're ripping that out and replacing that with the Postgres parser. Patrick and his team just wrote the new binder for our database system for in Peloton so we can now pull things from the catalog and get annotated ASTs and then Patrick's team is also implementing the new optimizer for us. So this is sort of actively working on this topic right now just sort of why I'm interested in spending three days on it. So the big thing to understand what's going to happen in query optimizers is that we're going to convert a logical plan into a physical plan so you can think of like the annotated AST as the logical plan because it's just saying at a high level what the query actually wants to do so it says like I want to read from table foo I want to join table A with table B but it doesn't say anything about how you actually should execute that because that information is specified in the physical plan so the physical operators are going to say specified the exact strategy or algorithm it's going to use to execute whatever it is the operator you're trying to do so say you want to join table A and table B the logical plan would say join table A with table B the physical plan would say join table A with B using a hash join on this key with this size hash table and this number of threads using this type of hash table like all this is specified in the physical plan so what we're doing in our optimizer we're basically trying to convert these logical algebra expressions into the physical algebra expressions and what makes this sort of tricky is that there's not always going to be a one-to-one mapping from a logical plan operator to a physical plan operator sometimes they can be split, sometimes they can be merged together so an easy example of this would be like say you have your relational algebra plan that has a just does a lookup on a table followed by a filter followed by a projection well in the physical plan you can actually condense those three operators into a single scan operator and then do the projection and filtering as you read out tuples you don't have to have another physical operator just to do the filtering so this is what I mean that it's not always going to be straight one-to-one mapping and when we talk a little bit more about Cascade next week we'll be seeing some cases where it can combine and split these things on the fly so the way we're going to be able to ensure that we're going to generate correct plans at a sort of logical level is that we can rely on the semantics of relational algebra to ensure that we generate plans that are equivalent so again in the original paper on from Ted Codd on relational model he sort of lays out this algebra and says there's all these commutativity and associativity properties that allow us to reorder the logical and physical operators in our query plan and still end up with the correct result and so this is going to allow us to have more flexibility to decide how it should order different operations in the query plan of course the downside is that because they're so you can generate a ton of different equipment expressions this makes the search base much larger so we're going to have to come up the way to deal with this so again this is basically the same stuff that you covered in intro class if I'm doing a join between A, B, and C I can join B and C first then A and that's considered equivalent to joining A, C first followed by B and our optimizers are going to know this it's going to know about what these equivalency rules are going to be and they can generate different permutations of our query plan we sort of throw that into our search algorithm to find the one with the lowest cost so one key observation we need to make in the beginning is that doing query planning, query optimization for transactional workloads to OATP workloads is really easy to do and this is because typically OATP queries are known or what is known as being Sargeable so Sargeable stands for search argument able and the basic way to think about this is that in the query in the where clause it'll tell you exactly what attributes it wants to look up and if you have an index that has exactly those columns or those attributes you don't really need to look at all different possible combinations you know exactly what index you need to pick and you know exactly how to run that query in the most efficient manner so let's say I have a table that has a primary key on the column ID if my query is select star from table where ID equals 4 I don't need to even look at other indexes I know exactly what index I want to look at to go directly to get the data that I need so this is what is called a Sargeable query so this is when we talk about doing heuristics and early optimizers this is essentially the problem that they're solving because it's really easy to do for joins it's in OATP settings they're almost always going to be lookups on foreign key relationships so again the same thing that you just know exactly what index to use to combine these two things together like if I want to get all my for my customer account get all my orders again I'm going to be doing an index lookup to get my account and then the index lookup to get my orders and I don't need to look at different permutations of join orders and things like that so again you're going to implement this with simple heuristics as we'll see in a second so the main takeaway is that for Sargeable queries for OATP workloads query optimization is actually really easy I can't prove this but probably 95% of the queries can be easily done like this and this is essentially what we do in Peloton's current optimizer we only can handle Sargeable queries so for this lecture and for next class and next week we're really going to be focusing on OLAP queries because they're not always going to be Sargeable they're not always going to have an index and it's going to be way more complicated to figure out the optimal plan so we have to do much more work so just keep in the back of your mind for OATP it's pretty easy for OLAP it's much harder and that's what we're focusing on here so I sort of alluded to this in the beginning about determining whether one plan is better than another for a query based on a cost estimation and again we'll talk about more about this next week but the basic idea of what to think about this is that the cost for executing a query depends heavily on what kind of database system you're using and its operating environment so you can think of it like if I'm running on a mobile phone then maybe my cost model might want to be how much energy does my join algorithm use because maybe I don't want to wear out the battery or if I have a distributed database system my cost model would estimate the amount of data I have to move back and forth over the network for our purposes for the most part we're going to be mostly focusing on the size of intermediate results but there's a bunch of different things you can consider when you're doing cost estimation which we'll focus on more next week one of the hard things to do in cost estimation is though dealing with understanding how queries interact with other queries and this is hard because when you say I want to generate the cost of executing this query you now have to take into account of whatever else is running at the exact same time as you run that query and although this can allow you to do certain things like say I have a query that's running for a minute and then I have a bunch of short queries that I want to run the best plan for queries running by itself might take one second but if that other query is still running it might take five seconds but there might be another plan that would normally be the worst but actually under this situation would be the best so all of these different trade-offs and how these queries interact with each other is really really hard to model and there are many situations if you did this correctly you can get better performance but in practice nobody actually does this so again the main takeaway is that cost estimation is going to allow us to figure out without actually having to run the real query what query plans will be better than another and it's always going to be approximation and it's always specific to that database instance at that particular point in time so now we want to talk about the different design choices we can have when we implement our optimizer so the three things we're going to care about are the granularity of the scope of the queries that we're going to try to optimize when we actually do our optimization and this thing called plan stability which is a way to make sure that the query optimizers can produce the same optimal query plan even if we ask it multiple times for the same query so for granularity there are basically two choices so the first choice is that we can just generate a single query plan for just the one query at a time this is what normally people think of when you learn about query optimizers in an intro class this is pretty much what we tell you how query optimization works when the query shows up I ignore everything else that's running inside the system at the same time and I just do my query optimization for just that one single query this is nice because it's easier to implement it'll be faster to do optimization for that single query because the search space is much smaller but the downside is again you're not going to be able to easily model the interaction of other queries running at the same time like I just showed in the previous slide because you have no notion of what else is running so you assume your query is going to run in isolation but in reality it may not actually be the case the other choice is to do query optimization across multiple queries so think of this as like I submit a batch of queries all at once and I figure out how they're going to interact with each other and I choose a query plan that may be the global optimal for all of them but may be not the optimal choice for each one individually but in the long run, if I run all these queries at the exact same time, I'm going to get better performance so this is obviously harder to do because the search space is much larger and this is only really useful for a technique called scan sharing which I'll talk about later in the semester the basic idea here is that when a query shows up if you know another query is going to run at the exact same time and do more or less the same thing scan the same table rather than having two separate threads scan through the same table separately and produce results maybe you combine them together and they can reuse they can piggyback off each other and reuse the same tuples as they go along and we'll focus more about this later in the semester none of the open source guys do this but the commercial guys can sort of do this like if you say in a disk based system say you have one thread that's doing complete sequential scan on the table it gets about halfway through before another query shows up and wants to do the same kind of sequential scan rather than him starting at the beginning and trashing your buffer pool you have them jump on with the first guy and they go down to the end and then the first guy gets off and he's done and the second guy recognizes oh I missed a bunch of stuff at the beginning and they go back and read it so that's where it's called scan sharing and that'll show up later on in the semester so for single queries this can be expensive to do but because you're doing it on a per-query basis but you can use stored procedures to kind of get around this you can you can use prepared statements to do a bunch of planning ahead of time so that you're not doing it on every single query invocation and that's sort of related to the next topic we have to deal with the next design choices when do we actually perform our optimization so again the first choice is what we tell you in the intro class and this is called static optimization the query shows up and I do my optimization for that particular moment of time and then I go ahead and execute it and I never go back as the query executes and figure out whether that was the right choice or not so this is highly dependent on how good your cost model is going to be because if your estimations are really bad then you're going to choose possibly an unoptimal or non-optimal plan and of course now this depends on how good your statistics and other catalog information maintaining about what your data looks like which is hard to keep up to date and again we'll focus more about that in the next class or next week but to way to get around having to do this every single time for every single query as I said you can use prepared statements to do all the optimization ahead of time so that when you now want to actually run the query you do some passing the handle of the name of the prepared statement and the database system goes into plan cache and go graphs the pre-optimized query and it runs that the tricky thing with prepared statements though as I said earlier is that in many cases they're going to have placeholder variables in its predicates for its where clause like where id equals question mark or input parameter and so in the optimizer you can't actually do a true get the optimal plan for that particular query because you don't know the value you're going to be at runtime every time you invoke that prepared statement so typically what they'll do is they'll they'll take the average they'll take the average value and substitute that for the placeholder and do all your optimization based on that but obviously depending on the cardinality of the data and whatever input parameter you use you can have wildly different performance if you have a lot of skew in your data set so this is not always it's not ideal but it's better than having to do it every single time another approach is to do dynamic opposition and this is where you just literally just take the the logical plan or the annotated ST that comes out of the binder before you got to the optimizer and just come do a straight mapping of logical operators to physical operators and then what happens is as you start executing them so you take the logical plan and then you start executing it and on the fly you'll convert them from a logical operator to a physical operator and you determine based on what data you're seeing which these operators which physical plan you actually want to use this approach actually can do sometimes also too is if you say you start off doing a query using one type of operator like one type of hash join if you recognize that your assumptions about what the data look like turns out to be incorrect you can modify the plan on the fly and switch off to use another join algorithm it's not easy to do and I don't think anybody actually really does this in the context of a data warehouse typically you see this one instead but where this is actually useful is in for stream processing systems or continuous query systems so this would be something like where you say I want to run some query that would notify me when some event occurs in my data stream and you sort of do the initial planning and then you install that query and sort of like a trigger that gets fired off when some event occurs and then over time if you notice that the data you're processing as it comes in through the stream looks a lot different than your original assumptions you can just do the replanting the entire thing over and over again this is a different scope than this this is sort of like one off query that you fire off this is sort of something that is continuously running so in addition to being difficult to implement this is also difficult to debug because it can change dynamically on the fly and it's hard to figure out to go back and say whether that was a good choice or not to understand why your query might be running slower than you expect it to be the last choice is to do hybrid optimization which is a combination of these two this is where again the query comes in you immediately generate a query plan for it and then over time as you run if you recognize that your your estimations about what was going to happen when you execute the query turn out to be incorrect then you can modify how you're actually going to finish the query so the tricky thing to be mindful about this is that if the cost benefit you're going to get to do the reorganization or the re-optimization is going to be going to be less than what it's going to take you to organize the query or re-optimize the query then you don't want to do this let's say you have a query that took 60 seconds to run and you're at 60 seconds into it and then you know you have another 60 seconds to go but if you re-optimize the query plan you can maybe get it down to another 40 seconds but if it takes you 50 seconds to do that re-optimization then that's not a good trade off so typically they have some way to estimate whether it is actually worth re-optimizing what I have left to do so one example of this approach is actually done in Green Plum so what Green Plum will do is they won't change the actual what algorithm you would use for a join but if they start doing a hash join and they recognize that the hash table size is too small because their estimates were off then they can dynamically on the fly resize the hash table to reduce the number of collisions you have that can speed things up so that's an example of what you can do here so you're not really changing the algorithm but you're changing the characteristics of the physical operator alright the last thing we need to talk about is plan stability and the idea here is that in enterprise settings typically they don't they don't like having wild fluctuations in the run times of their queries it's always good that things get faster but if some things get faster some things get slower then that's bad and so in the commercial databases they have a lot of additional mechanisms that allow you as the human to instruct the database system how should generate different query plans alright so the first choice is to use hints and this is basically where you as the human write in your SQL queries annotations to tell the optimizer what you want it to do if you know that you really want to use a hash join for this particular query you can annotate it with the hint and say don't use sort merge, use a hash join on these products or you can change the join orders and things like that and the optimizer will take that in consideration when it generates the plan so this is actually what my SQL actually supports this all the commercial systems support this so like in sql server what you can do if you have a prepared statement you can extract out the query plan annotate things and tweak things then feed that back into the system another approach is to allow the dba to specify what version of the optimizer they want the system to use to generate query plans so let's say for example like you're running oracle 10 and you want to upgrade to oracle 12 but you don't for some of your queries everything gets faster but for a small percentage of them they get actually slower we're back to the old version of the optimizer because that generated query plans or costs that had a uniform and weren't wildly different so what you can do is when you call prepared statement or load a query and you can specify here's the new query I want but use your optimizer as a version 10 and actually what happens in the oracle binaries they'll include all the different versions of the optimizer from all the previous versions of the system that allow them to switch on one versus another all sort of compatible with the entire new version of the system so what happens in this case here there may be some new operators or new optimizations that come out in the newer version but you want the old plans that won't take advantage of them because that won't slow down other systems or other queries and the last one is to have sort of related this but have backwards compatible plans and the idea here is that before you do your upgrade from the old version of the system to the newer version you dump out all your cache query plans into like an XML file and then when you install the new system you load them all back in and it knows that these query plans are generated from the previous optimizer and it uses them rather than trying to do replanning and again this is important again in an enterprise setting because it's not good to upgrade to a new version of the system when 90% of your queries get faster but 10% of them get slower people are going to complain that there's 10% of queries got slower rather than thank you that the other ones got faster so this is what they care about and we'll see in some cases with the when we look at randomized algorithms because these things can be non-deterministic it's important to have these overrides alright so any questions about the basics of what we're going to want in an ideal optimizer what the things we're going to care about so now we can talk about the different search strategies so there will be five different approaches that I'll go through and as we go along in this discussion I'm going to talk about in terms of what I think is actually which one produced the best plans which one I think is faster but one thing that's going to be really important as we go along all this as well is that we're also going to discuss what it takes to actually implement these approaches there's sort of a software engineering side of things because it's great that we can have an awesome optimizer that can generate awesome query plans but if it's impossible for us to maintain or possible for us to extend then it's essentially useless because no one except for like one or two people are actually going to be able to work on it and it's going to make it very hard for us to add new features later on and that would be one of the main two deficiencies for the first two guys and then for this last one here the last two at least we'll see how they solve that using a sort of declarative rule set so another thing I'll say also too is that like I said in the beginning of the class that like this course was not going to be about reading classic papers and databases and I realized the volcano paper you guys read was from I think in 1993 and that was probably before some of you were actually born for the first two approaches these are from the 70s so I wasn't even born then and what I'll say the reason why we can read these older papers is that although the hardware has changed although the complexity of the workload has changed they get at sort of these fundamental algorithms the fundamental approaches to how you do query optimization that are still applicable today it's sort of like no one's really generated a better quicksword algorithm but you know quicksword is really old the same kind of idea here the basics are still sound when we talk about how to maybe optimize them for in the modern context okay alright so the first approach we're going to talk about is doing heuristic based optimization so this is what you would essentially end up doing if you build a database system from scratch and you're like oh I need a query optimizer this is what pretty much everyone does this is what our current optimizer does and so the basic idea is that you're going to have these static rules that are going to be written in your database system that's going to allow you to transform the logical operators into a physical plan and I sort of listed the four main rules that you always end up doing simple things like make sure you perform the most restrictive selections first that way you can get rid of the most data as much data as possible you want to perform all your filters during your scans before you actually do your joins and you do predicate limit and projection push downs and then to figure out the join order for tables you just get a little heuristic like pick the one that has the smallest cardinality and you figure that use that to figure out which one goes before the other so this was actually what was implemented by Mike Stoembricker and his team at Berkeley when they built Ingress and this is actually what was implemented in Oracle surprisingly up until the mid 1990s and you maybe think this is really crazy how do Oracle get so big and so popular when they're doing like the most simplest query optimizer and I'll say two things about that is one is they had a lot of people because they had a lot of money that could really hammer at the heuristic based optimizer and make it be as good as it was going to be so near the end it was actually a pretty impressive piece of code I've heard it was terrible to maintain they had a whole floor apparently of people just working on this thing but it got them pretty far the other aspect I'll say about this is that you have to understand that back then not only were the size of the databases people were dealing with much smaller than what we had today but also the complexity of the queries that they wanted to execute on this data was much more simple thinking of the SQL92 standard it has your basic selects and joins and aggregation functions but it doesn't have all the more complicated things like CTEs, window functions and all the things that you can do in SQL today so the queries that they had to support back then were operating on smaller databases and were less complex and that's why they got pretty far with the heuristic based approach so I want to show an example of how the ingress optimizer worked and I'll use this example running stable going through all the different search strategies for this we'll use the sample database we have is a sort of a clone of a Spotify database or a music application so we're going to have one table that contains all the artists another table that contains all the albums that they put out and then we'll have a cross reference table called appears that says for a particular album here are the artists that appeared on it so the query we're going to run is they want to get all the names of the people that appeared on my mixtape that I dropped last year and so we're doing a three way join between the artist table appears an album and we see that in the case of the artists we're going to look up between the cross reference table for both artist and album so the way the ingress optimizer worked was it was actually going to have to decompose this into single table queries because ingress in the earlier version did not support joints so they had to rewrite this query into single select statements so the first thing we're going to do is going to decompose this thing into single variable queries or single table queries so we'll start off with taking the first first query the original query and we'll break out the select on the album table and we'll write that into a temporary table temp one and then we'll rewrite the original query now to replace album with the temp one the temp table we generated up above now we'll do this again now we'll take this thing and decompose this into two separate select statements right so we would end up with query three and query four query two goes away so query three is going to take the output attempt one and do a join with the appears table and then query four would take the output attempt two and do the join with the artist table to remember I said you can't do joins in ingress so what then they would now go back and do is execute these queries one by one and then substitute the values that they have in their where calls with the output of the previous query so in this first case we execute this first query and we get some album number 999 then now we're going to feed that into the input for this query here and replace the join look up with just the 999 so now we have a single table select statement we get the output of this and here now we have we have we have two artists we then rewrite this net query down to be two separate select statements that both deal look up on either the first result or the second result so then this produces our correct result so in this case here they did this all with heuristics they were able to convert this from a you know a three way join not worry about how to figure out what the join order could be and just rewrote them to be these single select statements and again you may think like wow this is really kind of like inefficient because I went from a one doing one single select statement to now four again understand back in the 70s they weren't running on very very good hardware like we have today and it was very limited what they could do so they couldn't do complex search to find the correct way to do this I'll also say too is when you read the paper about this approach they talk about operating on a database with 100 tuples 300 tuples right that's nothing but back then that was a lot so that's why they they could do this and still produce reasonable results so the advantage of the heuristic optimization approach is that it's really easy to implement easy within reason right but the code over time gets very difficult and it's also going to be easy to debug because you know exactly what the transformation is going to be because you just look at the code and trace through it and you see what steps it takes it's also going to work reasonably well and fast for simple queries because it's almost a beginning I'll be a one-to-one mapping from the physical plan sorry the logical plan to a physical plan the downsides are obviously is that it's going to be impossible to generate good plans when you have complex interdependencies and more complex operators like we have in SQL today it also relies on having magic constants to be able to predict whether you're making a good decision or not so in here before I mean I more or less you know pick the order I was going to actually select statements and it chose that based on how it thought the you know the cardinality of these different operators are going to be right and so if you get these estimations wrong in these rules then you may end up choosing a bad plan so as I said in the this is what Oracle did and this is what Ingress did in the 1970s remember those are two of the three original database systems relational database systems that were developed at that time the third database system was obviously IVM system R and they came up with a completely different approach to do cost based estimation or do query opposition using a cost model so this was developed by a woman named Pat Selinger she was one of the team of 78 people at IVM that sort of started off working on system R they had seven people with PhDs everyone went off and did their own piece of the system and they combined it all together so she worked on this cost based optimizer and this is really groundbreaking work because this is pretty much how at a high level everyone's going to be doing query estimation or query optimization in current systems today so this is the first first cost based query optimizer where we were going to have some way to estimate whether one plan was better than another and we're sort of going to do this in two steps so the first step we're going to do use the same heuristics or static rules that we did in Ingress and Oracle but then when we want to figure out what's the best join order for our tables that's when we actually do a dynamic probing approach to evaluate these different plans we choose so we'll talk more about this later on in the class but this approach here is using what's called the bottom up planning or if you're coming from an AI background this is called forward chaining and the idea here is that we're going to start with nothing in our query plan and then we're going to build it up to get to the goal that we want and we'll see this in the volcano approach in the Cascades approach they're using a top down model where you start at the the goal you want to achieve and then you work your way down to figure out how to get there so as I said this was implemented in system R but system R was never actually commercialized or released so in the early versions of IBM DB2 when it came out in the early 1980s this is the approach that they used and I'll also say too this is pretty much what every open source database system does today so this is what my SQL does this is what Postgres does and as far as I know I think this is what SQLite does as well right it's the same the specifics of the algorithm may be slightly different but it's the same high level approach so let's go through what system R does so system R is going to break up our queries into these blocks and then we'll have these logical operators for each block so you think of a block if they say you have an nested query one block would be the inner query the other block would be the outer query and then it'll do planning just on individual blocks and then it'll sort of append them all together at the end so what we're going to do is we're going to look at for each logical operator in a block we're going to generate all the set of physical operators we can use to implement it and then in this version system R they can only deal with joint algorithms and access paths so we're going to generate all the possible choices we can have and then we'll have our dynamic programming search try to find try to build up a plan that has the minimal cost and so because this could be an exhaustive search they're going to apply a bunch of heuristics and rules to cut down the number of solutions or plans they have to look at so the most famous one that they apply is that they only consider what are called left deep trees so we only have the joins they only sort of go from the left going up to the root and this is different than right deep trees or bushy trees and they did it for two reasons one was again it reduces the number of things you have to look at but also back in the old days when you assumed the data was going to be out on disk having left deep trees allows you to do pipelining of the query plan so you can take the output of one join plan and that'll be the input for the next join operator you don't have to worry about materializing it out the disk as you would in a in a bushy tree so now in the context of in memory database since we assume everything is going to be in memory we don't have to worry about this as much and so we'll see in the later later optimizers where they actually do consider bushy plans because they actually may turn out to be the most optimal path or plan alright let's go back to our example here so this is the same query where we were going to look up my mixed tape and figure out who was on it but the one change I'm adding is that I'm adding this order by clause at the end to say I want my output to be sorted on the artist ID and we'll see the ramifications of that in a second so the first thing they're going to do is they're going to choose the best access path for each individual table and this is just then doing basic heuristics so for this particular query since we're doing a join on the artist table and the appears table there's nothing in the where clause that we can use to allow us to do a probe on an index so these will always get assumed to be sequential scans but in the case of the album table assuming we have an index on the name field then we know we want to do an index look on that so these are the access methods we're going to choose and then for this now we're going to assume that these access methods will then feed into our join operators so now we're going to generate a table that has all possible enumerations of join orders again this is still at the logical level we're not saying anything about how we're physically going to do this but you sort of think of it going from left to right that's going to be the join order so the two things I'll point out here in our table this can go on forever for all possible combinations we're going to include both the natural joins as well as the Cartesian joins but again another heuristic that will apply to reduce the search space these guys because they're usually not going to be the best approach or have the lowest cost so then now we're going to have to use our dynamic search to find the join order with the lowest cost so you can think of sort of like the search space would be like this I'm only showing two possible join orders here but you know you have to have all the entries for the table I showed in the previous slide so these are all logical operators there's nothing in them that says how you actually do any of the joins so what you do is now generate edges that go from the base where you don't have anything joined and you'll specify what join algorithm you want to use to get to a joined intermediate result for either of these two tables and here you see I annotate some of them with hash join and some of them with sort merge join so now what you'll do is you'll pick whatever one has the lowest cost going from the beginning to the next stage and then you just do the same thing again now you're going to generate what are the different algorithms I can use to get to the next logical operator that I have and again you'll throw away the ones that have the highest cost so now at this point we want to figure out what path we're going to want to take that's going to have the lowest cost going from beginning to the end and that's going to be the one we're going to choose but the tricky thing about this particular example is that the query added that order by at the end of it where we cared that the the final result would be ordered on the artist table so in that case the sort merge is going to be better for us because the the merge algorithm is going to generate the output in sorted order so we don't have to do an additional sorting set but the problem is as you see there's nothing in this search tree that says anything about the physical property of the data I just know I executed some algorithm but I don't know what it actually what the data actually looks like so the correct answer is that the sort merge join is what we're going to want to use but the way you'd actually have to make this work so that I would pick this is that inside your cost model you'd have to write these rules to say I know my output should be sorted therefore give the sort merge join the lower cost than the hash join right and this becomes very cumbersome to do because it's not just join orders it could be location it could be the layout of data there's all these other different aspects of the physical properties what the data looks like that's being generated out of these these operators that you have to care about but now you have to bake all of this information in to your cost model alright so what are the advantages and disadvantages so the first advantage is that it usually finds a reasonable plan without having to perform a complete exhaustive search and for most reasonable queries this does well enough but then the problem is that we're going to have all the same issues we had the heuristic base approach where we have to write these procedural code to do these rules to do our initial transformation in the system R case we're pruning out anything that's not a left deep join tree which may not be the optimal thing and then I think the biggest issues that we have to bake into the cost model the physical properties of the data that makes it also very difficult to maintain so I'll talk, we're short on time but I'll talk a little bit about random hours because I want to get to the starburst and volcano stuff at the end but a random hours algorithm is basically instead of having an exact search like the branch bound or a divide and conquer search we're going to perform a random walk over the solution space of all possible query plans and we'll just see whether we find one that has the lowest cost and the way this typically works is that you keep searching until you meet some threshold in terms of either I found a cost that is good enough or I found a query plan that has a good enough cost or until I run out of time like maybe I don't want to do this search for about 500 milliseconds and then I just give up and choose whatever the best plan that I saw so the only database system that I know that actually implements something like this Postgres is a genetic algorithm which I'll show in a second so the first sort of paper that discussed doing this was came out of Wisconsin in 1987 by using simulated annealing and the basic idea is that you first generate a query plan with a heuristic based approach and then you just do random permutations on physical operators to see whether you produce a new plan with a lower cost if it does have a lower cost and you always accept it if the cost actually gets worse then you flip a coin to decide whether to take it and move sort of in that direction in a solution space and the idea is there this allows you to break out of local minimums because it may allow you to go down a path in the search space that you would not normally choose because the initial solution you looked at is actually worse the key thing about this thing tricky is that you have to have again a bunch of rules to make sure that you don't violate the correctness guarantees of your query meaning like the sort ordering again the query has to be sorted in this final result you generate a query plan that does that so there's a bunch of rules you have to write as you go along to make sure any permutation you make doesn't violate these things this is essentially also what Postgres does in their genetic algorithm so Postgres has a heuristic and sort of the it has a system R style search model in the very beginning but if a query comes in that has a join on more than 12 tables then they fall back into this genetic algorithm and the basic way it works is that you're going to use the heuristic to pick out initial good plan and then you're going to do a bunch of permutations of that plan to try different solutions and then you choose the one with the lowest cost and then you move them into the next generation and then you modify them again and you keep doing this until you find one that you think is going to do the best and again the key thing again that we have to care about is that when we do the mutation for one generation to the next we have to make sure that we always produce valid plans so I will say that I think the random algorithms are interesting because it allows us to break out of the local minims and possibly explore things that you wouldn't wouldn't be able to explore normally and if you don't care about the history of the different plans you've checked out in the past then the memory overhead of this is actually very very low and the downside obviously is that you may end up doing a permutation or looking at a possible solution that you've seen in the past that you know is bad so I think in the post best case they always keep the history for this the thing that makes this difficult and practice though is that it's very hard to figure out why the database system chose the plan that it chose because it's completely random and there's a bunch of extra work you're going to have to do to make sure that the randomized algorithm is deterministic meaning if I run it today and I get one query plan and I run it tomorrow I want to get the same query plan so in the case of Postgres when you read the documentation they talk about how they they have a random number generator that's seeded such that the same query will always generate the same random permutations from one generation to the next and again the re-encoring theme that we've seen over and over again is that you have the right procedural code to implement the correctness rule so that you always have a valid plan so this last piece we're seeing this over and over again that we're having to write procedural code in Java, C++ whatever it is we're writing in to make sure that we don't generate invalid plans and this is actually really hard to do and it's actually very error prone because there's no way to formally verify that your transformations are going to be correct the best way to do this if you're doing this in a procedural way is you have to write a fuzz test generator that just throws in a bunch of random queries into your optimizer and check to see whether anything breaks the other issue that they have is that the generation of the physical operators from the logical ones are sort of done with a narrow viewpoint of the actual query without understanding what's truly going on in the query you're just trying to do the transformation for that one logical plan so we saw this in the case of the system R approach when we were going from this side to this side we were only looking at each level of the join when we were making decisions about what operator we'd want to use we didn't understand anything else that was coming down later and this is a byproduct because we're going using a bottom up approach and we're just we don't know how we're going to get where we need to go and we just look at things at one level to the next so the better approach is to use a declarative domain specific language or DSL that's going to allow us as the database implementer to write our transformation rules in a sort of high level declarative manner and then we can have the optimizer essentially act as a compiler that will enforce those rules when we do our query optimization this is why in the volcano paper you guys read they talk about this thing as being like a framework or a toolkit for people to write optimizers right because again it allows you to write these rules out whether it's relational model or whatever model data model you want to use you can write them out in a declarative way and the compiler will take care of everything for you so these class assessments are called optimizer generators there's only one that I really know about that actually was possibly used in practice outside of sort of academia and that will be the Orca paper that you guys read the basic idea again think of like the optimizer now as a rule generator or a rules engine where you take the rules that you declare and allow you to do the modifications to generate new query plans and what will happen is now the physical properties about the data that is being generated from each operator is now going to be embedded in our plan so we can reason about whether we're generating the correct result or a better result without having to bake in additional information into our cost model so all the information that we need to know what the sort order is how weather data is compressed or not will be in the actual physical plan itself and not in this other code base so the two approaches we can do this with our stratified search where we do planning at multiple stages and then a unified search where all query planning logic will be embedded inside of a single search space so this is what the Cascades Volcano model does and then we'll see this in a second this is what Starburst does so for stratified search basically you're going to have a rewrite of the logical plan using the transformation rules and put it into an optimized logical form and the key thing about this is that this is sort of like the tree rewrite component that I showed in the very beginning this is where that's occurring it's not looking at the cost model it just knows how to do some basic permutations to put something into an optimal form so the cost is not considered at all then after that then you have now a way to map the physical logical plan into a physical plan so you can sort of think of this as like the same as the system R approach where we apply a bunch of heuristics in the beginning to do some initial optimization and then we do a divide and conquer search to find the best physical plan it's sort of the same way but again we're not actually we don't have the right procedural code to say if this then that to do different rules we can declare our rules in a high level language and have the engine enforce this for us so the most famous one was this, it was called Starburst so this was a project out of IBM research in the late 80s led by this guy Guy Lohman who was a pretty famous researcher at IBM who retired last year and again it's just like in system R the first phase we'll do the query rewrite and in the second phase we'll do the system R style dynamic programming so this is actually what they use now in IBM DB2 so they sort of throw away the system R style query optimizer that we talked about in the beginning and what they use now is based on Starburst so the Starburst optimizer works well in practice and it gets really good performance the things that are problematic for it though is that the grammar at least as it's defined in the original paper doesn't allow you to assign priorities for transformations so say like I can transform my logical join to either a Cartesian product or hash join clearly I want to do the hash join first and evaluate that they all don't have an easy way to do this in Starburst in some cases too you may have to do multiple transformations when you use some transformations they're sort of done in multiple steps and for each of those steps you then have to compute another cost estimate and that becomes expensive because you go back into the cost model and re-compute do another computation the last piece we're not really going to get into is the rules you define in the first phase or the first stage are based on this derivative of relational calculus which is not very natural in some cases and there's papers around the time to talk about how this is actually difficult to maintain for average developers we don't really need to get into that too much but again from a software engineering standpoint it's considered that the cascade approach is easier to do alright so the next model is the unified search and in my opinion this is actually the better way to do this this seems more natural to me the idea here is that we're going to have a single search tree and we're going to be able to do logical to logical and logical physical transformations all in that single space we don't need to have multiple stages like a re-writer and then do the search it's all done in the context of one search tree now the problem with this though is that it's going to generate a lot more transformations because it's looking at everything and the way they're going to cut this down on having to do redundant calculations is that they're going to make heavy use of memorization using a hash table to save the cost of certain transformations so that you don't have to revisit them again we'll see more about this when we talk about Cascades on Tuesday but because they're doing top down they can make use of this and avoid having these redundant redundant computations so the volcano optimizer the paper you guys read was from the early 90s and to put it in perspective in the paper they talk about this thing called Exodus and Exodus was sort of the first generation optimizer generator that Gertz-Graphy worked on and the volcano is the second generation and then Cascades will be the third one and that's considered still state of the art and this is the same guy at Gertz-Graphy that wrote the paper on the survey of B-tree locking and lashing that you guys read a few weeks ago and this volcano is the same volcano that we used to describe the iterator model of doing processing queries at a two point at a time it's all in the context of the same project so it's going to be a volcano optimizer it's going to be a general call space optimizer that's going to have a bunch of equivalence rules allows you to do these transformations and we can write them in a declarative manner and the key thing about is that it's going to record the physical properties directly inside these operators so that we know whether something is sorted correctly or not without having to have extra logic inside the call model and then the key difference between everything else we talked about so far is that it's going to be using a top down approach or an AI terms backward chaining to look at to start with what you want the goal to be and figure out what it takes to actually get there so in this version of a volcano in this system or this optimizer volcano as far as I know is never actually implemented in any real system outside academia they talk about a couple of different systems that from other different universities that use this it's the cascade approach that we'll talk about next time is that that's actually used in production in real systems so we talked about the beginning of the decision between top down and bottom up and the way to sort of think about this is again in the top down approach you start with what you want the answer to be I know I want these things to be joined and I know I want my data to be sorted or I want some kind of aggregation and then you go down and you start adding operators a query plan that will produce the answer that you're looking for and the nice thing about top down is you actually apply branch and bound search on this so that you know if you add a operator that has a higher cost than the lowest cost you've seen for any possible query plan you don't need to explore all the way down in the tree you can then prune it off early whereas in the in the bottom up approach again you may have to traverse down paths as you get towards the goal that you end up being useless to you and have a higher cost so you end up doing more work than you actually need to so let me show an example of what this looks like with volcano let's say we have that same query plan that I showed before we want to get all the people that appeared on my mix tape so again you start off the very start at the top the root you have your join and then you know you're doing an order by so in this case here there's a logical operator but we're annotating with what we want the physical data to look like so we want our data to be ordered by the artist ID so then we're now going to invoke these rules the transformation rules that we define in our DSL to do either logical to logical transformations so in this case here taking a join A and B and flipping the order so it's a join B and A or logical to physical where we can take the join A and B and say you want to implement it with a hash join so for this we have a bunch of these logical operators and then now we want to figure out what are the physical do transformations to figure out what are the physical operators we need to get there so in the first case here we could say we could do a a sort merge join using the output of the join on artist and appears and the output of a scan on the album table so I'm not sort of showing that you can have access path methods here because that's another transformation from the logical operator to a physical operator but assume that you would have something in here as well so then now we can keep going down the path of this tree it would say alright now in order to produce awesome that sucks I guess we can just come back yeah why don't we just stop here because I mean it's 10 minutes ago and this thing's going to take 5 minutes to finish and then we'll be out of time so I'll focus more on the cascades and I'll finish up talking to Volcano again the main thing to understand when you read the Orca paper it's basically the same approach I was about to share you here with a top down model they make a big emphasis in the cascades about being about being object oriented because that sort of was in vogue in the 90s but there's other aspects of it that are relevant like you can you can delay doing expansion of relational algebra to different equivalent classes until you actually need them so in the case of Volcano when you start off you generate all possible logical operators that are equivalent with each other and that increases the search space but some of them actually may not actually need but then in cascades you can do them on demand so you can prune down the number of things you have to evaluate by not expanding the entire search space all at once all right let's see and then so I'll go back and talk about the I'll finish the Volcano example when it comes back almost there halfway two one that took two minutes sorry we were at here we were generating physical operators to feed into logical operators that then feed into other physical operators have rules to do these transformations and then we can also generate all the different combinations of physical operators right this is doing hash drawing this is doing certain merge and all the same sort of thing we can evaluate as we go along as we sort of go from the top down to evaluate the different cost of these things so what's going to happen also in Volcano too there'll be enforcer rules that are in the search space that are going to allow us to require that the output of a physical operator be in a certain form right in this case here we know that we have we want to be ordered by artist ID so that means that if we have a physical operator that we know is going to be a hash join we can immediately prune it because it would violate that that physical property enforcement rule that we have same thing now say that we since we know it needs to be sorted we can add a physical operator to the quicksort and then feed that to a hash join but at this point we know that the cost of doing the quicksort and the hash join is greater than the cost of doing the straight sort of merge join so we can immediately cut that off and not keep going down and exploring different options and this is just basic branch bound search which you can't do in the system or bottom up approach so I'll we'll discuss more in this next class so the advantages of using volcano is that just like in starburst you're going to have declarative rules that allow you to generate these transformations it has better sensibility of how it does search because you can start reducing the number of things you have to look at because you know that it can't be better than the best plan you've seen so far and I sort of mentioned this when it was down but the way it works now is you generate all possible combinations of permutations for your different logical operators and here I'm only showing sort of three tables or three join orders but you can have possibly everything else and so what they'll do is in Cascades they'll allow you to do this lazily on the fly as you need them so that you don't expand the search base all at once the other thing that's hard to do is they can't do any rewriting on predicates and that's not something we have even talked about so far like if I have a where clause maybe I want to modify what the where clause looks like to put it into more optimal form you can do that with heuristics you can do that with the tree writer but ideally you want to be able to do this all in the context of a single single optimizer so in Cascades we'll solve that problem alright so what are my party thoughts it should go without saying that query optimization is hard it's you know not only do you need a good efficient search algorithm but you actually need a good cost model a good cost estimator and this is part of the reason why none of the no secret guys when they came out they did support a sequel or did any of this right they would require you to write however you know whatever the algorithm you want to write you had to write it yourself like it is not only the context of things like Cassandra MongoDB although both of them are trying to add back you know cost based optimizer think of like in Hadoop when Hadoop first came out you had to write all your ramp reduce jobs to compute joins manually yourself and now they think of things like Presto and Hive sit on top of it and those things have a cost based optimizer like in system R so this is really important and it's hard to do right and that's why you're starting to see some of these systems add this back after the fact alright so for next class we're going to spend more of our time talking about Cascades and you'll see how it's closely related to Volcano the paper I signed you guys to read is a newer paper from the Green Plum guys called Orca and this is a modern implementation of Cascades and they designed this thing to be extensible so that like you could run your query optimizer on one machine and you could have different distributed databases or different databases connect to it and take advantage of its resources like they could talk about having Halk or Green Plum or whatever system you want connect to Cascades optimizer that does the search for you and I'll talk a little bit about Columbia which was an academic project that is also based on Cascades that came out of Portland State in the late 90's any questions? okay awesome alright we're done I'll see you guys on Tuesday next week thanks