 Today is our second lecture on query optimization. So again, the high-level idea of what we're trying to do here today is give it a SQL query. We want to generate a physical execution plan for that query that is both correct, meaning it will generate the correct answer we expect, as well as has the lowest cost of all the query plans that we could examine. Again, we said last class that this is super hard to do because the problem with trying to find the optimal query plan is proven to be NP-complete, and therefore, even though the name of what we're trying to build is called the optimizer, we're not really ever going to be able to try to find, in most cases, at least for complex queries, the true optimal plan. So last class, we looked over a bunch of different approaches to doing this search. We first talked about how to take use heuristics, where you just have these if-then-else statements that are baked in the source code, looking for certain query patterns and applying changes to the query plan to put into a different form. Then we talked about how to do a combination of the heuristics from the first approach, as well as a cost-based search approach that came from system R, from IBM, that allows us to find a optimal join ordering. The key idea here, what makes the cost model based one different from here, is that the cost model is guiding our search. It's a way for us to approximate the cost of a query plan and then use that approximation to compare one plan versus another. We then talked about how to do a randomized search with simulated nailing or the genetic algorithm from Postgres. As far as I know, Postgres is the only one using this third choice here, but they only use it for queries that have more than 13 tables, otherwise by default you get the second approach. Then now we talked about the movement in the late 1980s, early 1990s, towards these optimizer generators where instead of having imperative code in C or C++ to do all these steps for the query plan, we can instead write our optimizer rules in a high level language and have a compiler generate the optimizer for us. Starburst was the first one in this space from IBM and it was using a stratified search which is like this, the heuristics plus the cost-based join search and then the unified search is doing the logical and logical physical transformations all in one. This is what was used in Cascades for the paper you guys read. Again, just reiterating the last two points, just repeating myself, the stratified search is where we're going to have these rules to do the logical query plan transformation. The idea of where this is going to work is again, there isn't just a list of rules where we apply them one after another. There's this rule engine for us that's going to look at the query plan, figure out patterns that would match what rules we have defined and then fire them off to make our change. Again, in this first step, it's entirely done without consulting a cost model because there isn't a cost model. Then after this, we do the same cost-based search to map the logical plan to the physical plan and again, that was the IBM technique. You know the idea here is that rather than having these two separate stages, you just have a single search process or single search procedure that can do both the logical to logical and logical to physical transformations. No longer do you need the separate stages because you're just doing everything all at once. What we'll see when we talk about Cascades is that this approach is going to generate a lot of redundant transformations because again, in the stratified search, once I get past that first stage, I'm never going to reconsider those rules again, but in the unified search, it's all combined together so you may be trying to evaluate the same rule over and over again and so you'd have to use memoization to identify that you're doing redundant computation or redundant transformation and use a cash result rather than applying it again. Again, we'll cover this in more detail about when we talk Cascades. In addition to whether you want to unified versus stratified, another design decision you have to make in your query optimizer is whether you want to do a top-down or bottom-up optimization. Top-down will be Cascades and that's where you start with the final outcome that you want. I want to join these three tables together and then you traverse down into a search tree and start adding in the operators, the physical operators or logical operators you need to get to that end goal and you can use the classic branch and bound pruning technique to recognize that if there's a branch that has a current cost is greater than the lowest cost you've seen from the current best plan, you know you don't need to traverse down into it. The dynamic programming technique we saw from System R is an example of a bottom-up optimization where you start with nothing at the bottom and you start literally adding in the components or the operators that you need for the query plan to get to that end goal. Again, this is the dynamic program technique. So the current literature suggests that the dynamic programming bottom-up optimization is better. We'll talk a little bit about today why that's the case. In my opinion, from a software engineering standpoint, I feel that I'm able to reason about the unified approach and the top-down approach better. This is what we implemented in our system. Again, there's the research literature and then there's what actually gets implemented in real systems. So you will see for the paper next week that SQL Server actually has the best query optimizer and they're using a top-down cascade approach. But the literature shows that the dynamic programming bottom-up one is better. Because again, there's a lot more going on than just whether you're going in either these directions. The cost model matters a lot. How you're collecting that, the card and now the estimations, all that will contribute to the overall efficacy and quality of the query plan to generate and not just what I'll be using. All right. So for today's agenda, I want to spend some time beginning talking about how we're going to do logical query optimization. Again, the things we'll talk about here are applicable to whether you're doing a stratified or unified search. Then we'll spend time talking about what the cascades are, how the cascade optimizer looks like. We'll do this in the context as described in the paper I had you guys read, which is a master's thesis from the late 1990s on a query optimizer based on cascade called Columbia. The reason why I had you read that paper instead of the original cascades paper is because the original cascades paper is not very good. It doesn't really describe what they're actually doing. They keep banging on about how great it was that their object oriented because that was in vogue in the early 1990s, whereas those 30 pages from that Columbia master's thesis lays out, in my opinion, the best description of how cascades actually works and how to implement them. We'll briefly talk about why some things that dynamic programming can do that maybe cascades can't do very easily, at least in the modern implementation as used by the hyper guys in Germany. Then we'll talk about a bunch of other sort of optimized implementations that are out there. Again, these will be categorized as whether they're doing dynamic programming or cascades, but again, I just want you to be aware what's out there. Then we'll finish up with a quick overview of what's expected starting on Wednesday this week for your project three code review submissions. Again, we'll do two rounds of code submissions or code reviews. I'll sign you to another group on Wednesday. You submit your code on GitHub, Dodo code review of your code and you do a code review of their code. I want to talk about what's expected for you as a student participating in this project with this step. Again, there's going to be two types of optimizations that we can do. There's logical optimization and physical optimization. Logical optimization are changes or transformations that we can apply to the logical query plan that we get from the binder or further up in our front-end stack that we can apply rules based on matching patterns or certain patterns that we would see in the logical query plan and we can make changes to the query plan in such a way that we can potentially set ourselves up so that when we do the plan enumeration in the physical query optimization step that we can potentially find the optimal plan more easily. So it's not always going to be the case that the changes we make here will guarantee that we'll be able to find optimal plan, but if we know a little bit about how our optimizer works, we know a little bit about what our plan enumeration algorithm actually is, then we can apply these changes in such a way that we can prepare ourselves accordingly. So an important thing to understand about this step, how we're going to distinguish this between the when we do physical query optimization is that again, because there's no cost model, we have no way to determine whether one query plan is better than another. So we don't know whether we take a logical query plan and transform it to another logical query plan. The optimizer at that stage can't say, oh, yes, that transformation was a good thing. It's going to help me generate a query plan with a lower cost. We don't know that, but since we know some aspects about relational databases or the kind of databases we're trying to run and what our query looks like and how our system actually performs, we can then maybe direct our transformations so that we prefer one plan over another. Again, but the end of the day, the important thing is that the original plan or the new plan we generate after these transformations has to be equivalent to the original plan because it doesn't help us that, yes, we found a faster query if we produce the incorrect result because people will get pissed and then now your database system is considered unreliable or there's something wrong with the data. Like it's hard to debug that. So we want to make sure that any changes we make, we ensure that our query is equivalent. So I want to go through four different examples of the types of logical query optimizations we can do. And this actually comes from Thomas Neumann, the hyper guy in Germany. He actually teaches a whole class on query optimization and he goes in much more detail over multiple lectures than what I can do in just four lectures. But some of these examples are coming from his slides. So we'll see how we can do splits on connective predicates, predicate pushdown, replacing Cartesian products with joins because again, at the very beginning all we have are Cartesian products and then we can finish up with doing projection pushdown. So for each of these we're going to use that same example query we have from last class where we want to do a three way join between artists appears an album. Good to go find all the artists that appeared on my remix tape. So the very beginning, what we're going to get out of the binder, which is essentially just a copy of what comes out of the SQL parser is all our predicates are going to be combined together into a single filter clause. So we have three predicates expressions in our where clause. So we would have a single filter expression in our query plan. Again, this is a logical query plan. It's not telling us how we want to execute anything. We said we think a single filter up here that has all three predicates combined with each other and we do that filter after we've taken the Cartesian product of all of the three tables. So in order to make it easier for us to then start doing predicate pushdown and other optimizations on these individual predicate expressions, we want to decompose the filter into multiple filters that each have one conjunction of the where clause. So we just split it on the end and we get it like this. For Ores, it's a little bit tricky. You can essentially start duplicating the Ores. If you have different conjunctions, knots or negations are a little bit tricky as well, but the high level idea is essentially the same. All right. So now that we have these filters like this, we can start moving them in different ways. We can start moving them to separately into different parts of the query plan. So this is how the classic predicate pushdown is going to work. And the idea here is then we can move the predicate that we have to be after the Cartesian product or at the lowest point we can in the query plan where we have all the information we need or all the attributes that we need for the filter at some point in the tree. Right. So in this case here, after we do the Cartesian product, we can then apply artist ID equals appears to artist ID. We can't do it before on either side because we don't have both the combined tuples, the two tuples together. Same thing for this one up here, but in this case for album dot name equals and his remix, we can put that immediately after the scan on the album table because we have all the information that we need. So now now we want to replace all the Cartesian products that we have in a query plan with with inner joints. And this is pretty simple, right? This is just identifying that, oh, we have a Cartesian product right above it. We have a filter that takes the output of the Cartesian product from the table on the left and table on the right and combine them together, you know, do our quality match. So we can just replace those Cartesian products were here and here with a with the joint operator when with the joint operator specified where the joint clause is, you know, is that filter that used to be right above it and we can do this regardless of whether the we're using the old, old style SQL where the joint is actually in the where clause or sort of the more modern SQL syntax where you have the joint on clause, right? This same idea works in either one. Then the last optimization we can apply is to do projection push down. So in this case here, you know, that we're being pedantic here about the, you know, what our query plan is going to look like in a column store. You could have implicitly assumed somehow that I'm not scanning the entire tuple, maybe I'm just scanning just the columns that I need, but you still need some information to say what actually those columns you actually need to propagate up into the query plan. So in this case here, we can recognize which attributes we actually need at different levels in the tree and we introduce projection projection operators to then only propagate up the just the attributes we need at each step. And so the way you think about this is implicitly there's pipeline breakers here, here, and I guess up above here, but like at the pipeline breaker, we would introduce a, introduce the projection and that makes sure that we only propagate the data that you need. And the reason why we don't want to do this all different parts, because at the pipeline, if you're doing a push up approach, like in case of hyper is you, you're going to take a single tuple and ride it up as many as far as you can up in the query plan and only materialize it when you get to the pipeline breaker. Other systems that we're doing volcano are going top down and that, and that the boundaries aren't, aren't, aren't, aren't exactly the same. Okay. And again, I was sort of saying here, like in this case here, we only need album name and the album ID, so we could have a projection here just to filter out album ID and album name. But again, what will happen is this filter actually will get combined here with whatever this, this access method here, how we're accessing the tuple. So again, we're not, we don't have projections everywhere we could, right? But because somehow it's going to be implicitly represented this way. All right. So again, these are all logical optimizations. Again, I, I didn't look at a cost model. I just said, I know I want to do things and I can bake the transformation rules in my optimizer to, to do this, right? Like this one's the most obvious one. If I have a cartesian product and I have a filter that has a joint clause right above it, then I can easily just write a rule that combines that and I don't need to consult a cost model because I know in my data system, a inner joint is always going to be preferable to a cartesian product. You know, it's never really, you never want to execute cartesian products unless you're explicitly told to do a cross-join. All right. So now let's talk about how to do physical query optimization. So this is where now we have a logical query plan. We have these logical operators and we want to transform them into physical operators that we can then either cogen if we're going to do in compilation or to actually execute on in our execution engine. So what are the kinds of transformations we're going to want to do? The most obvious ones are going to be things like specifying how we're actually going to access the table or the data that we were targeting, right? Are we going to use sequential scan? Are we going to do an index scan? If so on index scan, then which index are we going to pick? We can add additional execution information about how we expect the data to be represented or sorted or compressed. Choose operative implementations for the joins, like what join algorithm we want to use, how we actually want to do aggregation. And then in some cases also, too, which we won't talk about today is identifying when you actually want to materialize potentially the data for one portion of the query so that another portion, you know, another part of the query plan could actually access it, like if you're doing a subquery or views or CTEs. So in this stage here, we now need to make sure that we or we now have to support cost model estimates because now we want to be able to determine is one physical plan going to be better than another and a cost model is going to allow us to do this. So before we go a little bit deeper into into Cascades, I want to talk about and how we're actually going to do these these physical transformations. We want to sort of have a caveat and say that everything we've talked about so far has been sort of the simple case of always doing equity joins or inner inner inner inner joins. Things now, though, become more tricky when you start having, you know, as you see more realistic queries, outer joins, semi joints or energy joins, because now we're not always going to be able to make have symmetrical transformations to allow us to move any operation around or reorder our query plan easily because there may be some cases where you you're not actually allowed to do that because that would end up with an incorrect query or an incorrect result. So let's take an example here. So say, again, I'm now doing a three way join on A, B and C, but I'm going to do a left outer join on A and B and then do a full outer join on C. So what we're getting into now is how we actually explore the different join orderings that are available to us in a query plan. So we can decide whether we want to join B and C first or A and C first and so forth, as we saw with system R. But in this particular example, there's actually no join orderings that that are actually available to us or possible because that would end up with incorrect state. And this is because we have these outer joins. So in particular, the outer join on A and B, the left outer join A and B is not commuted without the outer join, the full outer join in B and C, because I don't know what the value coming out of this join for B will be, for B dot val, in order to join it with C. So I have to execute this first because they don't know whether it's going to be a real value or null. Right, because it's a left outer join. If A dot ID doesn't match anything with B dot ID, then the portion of B in the combined tuple will just be null. So B dot val could be null here, but I don't know that until I actually do this join. So you need a way when we start doing these join reordering, so the plan of iterations to look at a different possible choices to be mindful that there's some cases where I can't do the reordering. And I'm just bringing this up now because I'm going to show examples where we assume that everything's commutative. You assume that everything could be flipped around, but just in the back of your mind, we recognize that, oh, there is additional mechanism we need to make sure that we don't reorder things incorrectly. The dynamic programming is a little bit easier to do this because you're having explicit graphs that you know you can move things around. In the case of Cas Case, the best of my knowledge, you need to have additional property enforcers to recognize that this reordering is invalid. So the way we're going to enumerate different plans at a high level, there's two different approaches. There's the transformation approach where you take an existing plan and you start permuting it in different ways to generate new plans that are equivalent to each other. You still have to check to make sure that the transformation doing is valid. And it's almost sort of like a random walk, except that you have these rules to apply them and change things in sort of a directed way. The other approach is as was used in the dynamic program approach where you are doing generative enumeration where you start, you take the parts of the query that you want to execute as building blocks and you start graphing them on and constructing the query plan. Again, be mindful that you don't want to add things incorrectly to produce an invalid query plan, but you sort of start with nothing and start adding things. So not saying one way is better than another, but these are just sort of at a high level of the two approaches that we could pursue. So the state of the art for, at least academically, for doing dynamic programming is the method used in hyper where you're doing sort of hypergraph generation where you start with, you know, it's the general approach where you start adding building blocks to the query one by one and you're checking as you go along whether the change you want to make by adding a piece in is valid or not. So this is the approach that's used in hyper, this is the approach used in DuckDB. I don't know of any commercial system that uses this approach. The hyper guys might know, but I don't know the answer for this. We, in our system though, we were based on cascades. So I'm going to spend more time talking about that. So as I said, there is an original cascades paper from 1995, but I didn't have you guys read it because like I said, I don't think it's very good. It's not that it's not well written in terms of English grammar. I just don't think as from a from a pedagogical standpoint for this course, it provides a clear explanation of what cascades is actually doing. So that's why I had you guys read the master thesis because it might get in my opinion, that's the best explanation of cascades. So what is cascades? Well last class, we talked about volcano and that was a top-down approach for doing a branch-and-bounds search of adding, constructing a query plan. Cascades is the sort of third generation optimizer that Grit's Graphy had built. And this is sort of again, this state-of-the-art approach from what I'm aware of. And so the big difference we're going to do here in cascades that we didn't do in volcano is that the optimizer is going to generate the transformations on the fly as it needs them, rather than pre-computing all possible transformations at every level we go into, as we're searching into the query plan. So in the case of volcano, what happened is when I went from one group to the next, so one down one level in the tree, before I did any evaluation of the possible transformation that I could, or of each transformation I could apply, I generated all possible transformations, which would explode the search base, right? And maybe the case that maybe got stuck in a local minimum and didn't explore the rest of the tree more carefully, right? So the other cool thing that we're going to be able to do in cascades is that we're going to be able to do simple rewriting of multi-expressions or expressions within the query plan, but just having these functions be able to apply them without having to do an exhaustive search for all possible combinations of how to execute that multi-expression. That'll make more sense in a few more slides. The basic idea here is that I could have a placeholder to say, here's some lower portion of the tree. I don't know how I'm going to actually implement it just yet, but I don't need to search it just now. I can search other aspects of the tree and use that placeholder. So for cascades, the four main design ideas that they're going to use is that, first is that all the optimizations are going to be the simple pair data structures where you contain the pattern that you want to match in your query plan. And then if it matches, there's a rule to fire to do some kind of transformation. And to contrast this with the heuristic-based approach, using Postgres, using other systems, where you have a bunch of if-then-else statements. It's always these rules, there's a rule, and then there's the pattern, and then there's a rule that does the transformation. And I can do this for the logical, the logical, the logical, and the logical, the physical. It's all the same. The next thing is that we're going to have rules to enforce the property requirements you would have on data or on the intermediate results produced by a given operator. And these are the things like the sorting order, care about compression and other things like that, like we can enforce those things. And then the rule engine knows that it can apply the transformation if the transformation would end up violating the rule that it has to maintain. It contrasts this with the system R approach, where I said that the rules to do the enumeration or the search for the join ordering, it had no notion of these physical properties of the data. It all had to be baked into the cost model. But now, again, in one place, I can define my rule that says, my data has to be sorted this way and make sure that you don't do a transformation that violates that. The next thing is that we can then now also define the priority or the ordering of the rules that get applied based on these promises. So the way to think about this is that rather than just randomly picking what transformation we want to apply at a given level in our tree, we can specify an order and we need to actually do this dynamically so that we can identify aspects of our query plan and say, we think these transformations will lead us to the optimal plan more quickly, so apply these transformations first and only evaluate the other ones if we have more time and want to continue searching at that level in the tree. And the last one that's important is that they're going to treat the predicates as first-class entities in the system that's going to allow us to do transformations on them just as we would do for operators and this is how we're going to be able to do all the things that I showed in the beginning like the predicate pushdown, projection pushdown and things like that because predicate pushdown is a better example because those again, even though there are predicates inside the where clause that would normally be baked inside of a select statement or a scan operator, we can have the engine do transformations on them and do other optimizations just as it would with physical and logical operators. And this is actually something we're trying to build now in our own system. We had a project in 721 last year where they built a expression rewriter using our Cascades engine and the student that worked on it last year is now working on it this year again to help finish porting it into the new system. So we won't talk about this today but this allows you to do certain things like if I have a where clause that says where one equals one, we can rewrite that into just to be true and instead of having to evaluate the query tree in its entirety for every single tuple. Now we compile with LLVM so the LLVM compiler can easily wipe that away by doing constant folding but these are the kind of things we can do now though inside the same Cascades framework that we normally would use for optimizing query trees. All right, so the first thing we got to talk about to understand Cascades is what an expression is. So this part's a bit confusing because normally I've been talking about expressions in terms of the predicates that are in where clauses or join clauses or select outputs and certainly in the code of our system we use expressions to mean those expression trees that like one equals one and things like that but in Cascades they're going to declare an expression to be a high level operation or operator in our system or in the query plan that's going to have zero or more or input expressions to it. So say I have my query here select star from A join a B and C together. I can have a logical expression that would be the join on A and B followed by the join on C and then I can have a physical expression that would be the sequential scan on A doing a hash join on the sequential scan on B and then doing a nested loop join on a index scan on C. So an expression would be some essentially the relational algebra components for the query but if I annotate them with these subscripts I'm specifying what the physical operator I want to use to perform that relational operator. So in this example here these two expressions are logically equivalent. So even though this is specifying the physical operators and this is just the logical operators logically they're equivalent. They're both computing the join on A B and C and so now we're going to be able to rely on this commutativity property to enumerate all the different join earnings as part of transformations to exploration of the search base with this query plan and try to find the optimal one. And this is what I was saying before when we talked about that left out of join, four out of join example. So in these case I can move these operators out in any way I want because they're commutative but if one of these was doing a left out of join or four out of join then I would have to restrict the transformation to make sure that I do things in the correct order. But for simplicity in our example here we won't worry about that. All right so now we have expressions and that's going to be some sequence or some relational algebra expression. Now we're going to have to define groups. So group is going to be the set of logically equivalent logical and physical expressions for one expression that will produce the same output expression. Right so all logical forms essentially for a given expression it will have all logical forms and it will have all physical expressions that we can derive by converting any logical operator into the allowable physical operators from the corresponding logical forms. So what I mean by this so this entire unit here is a group and this group represents the output expression of a join, b join, c. And then now we have the logical and physical expressions that are all going to be equivalent to to this output expression here. Right so the contents of the group are going to be defined based on on its output. And so I can't have a another group with the same output. Right it has to be unique to to to my query plan. Right so it's important to note that these expressions in here are unordered meaning the uh this is a sort of a illustration of what's being contained inside them but just because this one's first doesn't mean anything about whether we should examine this one first. That's applied or that's sort of computed as part of the priorities you would get when you do when you do your transformation. The other important thing to also point to is as I was saying in the case of volcano for this output expression they would have to generate all possible all possible permutations of the logical expressions and for each of those all possible permutations with physical expressions. In cascades these are sort of like you know these are sort of computed on demand or just in time so that I don't have to materialize every possible logical expression in feature of those every possible physical expression I can say I can apply my transformation rule and it'll generate some equivalent logical expressions and I can determine whether I even explode them further and look at their their physical expressions. Right so this is you know this is so this is what I'm saying the in the case of volcano since they were doing it all at once the amount of state you had to represent here got really large and so the on demand thing is is is the way we're going to sort of speed this way to reduce that state but another optimization we can do in cascades is to use what are called multi expressions and the idea here is that instead of instead of instantiating exactly the verbose form of all possible expressions for a given group we can instead represent the some sub-expression with an expression as a placeholder that is then represented by or taken care of by some group lower in the query plan and so the the the set of redundant expressions represented by these placeholders in a group together is called as a multi expression right so in this case here before I was saying a join b join c and I was explicitly saying hey I'm joining a a and b but now in the multi expression I just have a placeholder here that says hey if a and b they're being combined together I don't know how somebody else below me in the query plan will tell you how to do this but I know that I just have a placeholder here and the same thing for it from this case here I have a scan on c I'm not saying how to actually scan it whereas in here I was explicitly saying you know here that one is central scan I don't have to say anything for that right so it makes more sense when you look at the physical ones and the physical ones I'm not I'm not specifying how I'm doing joining a and b or how I'm actually accessing c but I am specifying that the I have a physical join operator that's doing a certain merge here right and so the idea here the reason why we're doing this is that in a top-down search strategy like in cascades we can then make decisions about whether you want to traverse lower down into the tree to understand how we actually want to join these things together right without explicitly blowing it all out in the the bottom up optimizer they're going to enumerate all the multi expressions sort of from one group at a time in the order of the tables that they're generated so they'll start with a figure out how to maybe access that then figure out access b and figure out access c and then figure out how to combine a me and the gather and go forth so it's sort of it's sort of a different way of looking at it so the before we can make decisions now we'll see in the next slide our next two slides before we make decisions about how you know how how we're actually going to do you know for this physical operator here we have to make sure that for all uh placeholder expressions we have below us in the tree that we we figured that how we're actually going to access them what's the best physical cost uh or the best plan with the lowest cost for the physical operator below that in the tree and then we can combine that together when we make a decision about which which operator we want to do this for here again when we go through the example this will make more sense so the last thing we need to understand in cascades is the rule system so again a rule is going to be some transformation we can apply to an expression that will convert it into a logically equivalent expression meaning it will produce the same logical results when we actually execute it but and so we'll have two types of rules we'll have a transformation rule which is a logical to logical logical operator to physical operator or a logical plan to a plan or plan subset to another logical plan subset and then we'll have an implementation rule that will convert a logical operator or a set of logical operators to physical operators and as I said every rule is defined by the pattern we will match to identify which sub-plan or find a portion of the plan that has a specific layout or configuration that matches to our rule and then if we find one of those then we apply the substitution policy or the transformation that we want to convert the query plan structure into the logical logical equivalent change or logical equivalent new plan so let's look at an example here so say this is our pattern and our pattern is we want to match a sub-plan that has a equa join followed by another equa join and then we have these placeholder groups where we're just saying we don't actually care or know we don't need to know and we don't care what these other groups are actually doing we just care about having sort of a left deep tree where you have an equa join followed by another equa join so now say this is a query that will match this so again I have now a group here that's combining a and b and joining with c then I have another here another group here that's doing join between a and b so these are both equa joins assuming that you're doing a natural join and then below that below each join operator I have these get expressions this is somehow saying here's how I'm going to access the tables a, b, and c but we don't care what that actually is in our rule because we only care about having these two matching joins like that so the first type of joining we do is a transformation rule the first type of change we can do is a transformation rule where we want to rotate the joins to go left or right so in this case here we're going to switch the join below the join at the top to now be on the right side of the tree whereas originally it started off in the left side a implementation rule would actually now be converting the logical operators for these joins into now a physical operator that specifies what algorithm we're going to want to use so in this case here we can convert these to be the joins to use a sort of merge join so one thing we have to be mindful of in our system is that or in cascades is that we're going to make sure that we don't get stuck in infinite loops so in the same way here that I have a transformation rule that allows me to go that rotate the joins from left to right of another transformation rule that will convert it from right to left so unless I'm careful I could just keep getting stuck in if a loop going left to right right to left and back forth over and over again so we need a way to recognize that we've already applied a transformation and therefore we don't need to do it again and possibly search what the query plan looks like after we apply it so this is what the memo table or the memorization table in cascades is going to do for us so in the original cascades paper I think they use they talk about using a graph data structure and our implementation we use a hash table as far as they know in the other open source cascades implementations they're also using a hash table so what it's going to allow us to do is we can identify that we've we can identify that we have a group that we visited before and therefore we don't need to examine it again because we already know what the query plan looks like after we apply that transformation so that's going to so if we do that left to right if we try to try to apply the transformation rule to go back right to left we would say well we already know what that looks like we've already been there and we see that cost in our in our memo table therefore we don't need to apply it so what we're how we're going to do this is that we're going to have equivalent operator trees and their corresponding groups stored together are stored together in the groups and we store them in in the memo table and we just need a way to to quickly hash and identify that we have we have duplicates right so what this memo table is going to do for us is it's going to in addition to be able to avoid duplicate computations and duplicate query plan transformations we can also use that to be identified quickly that we've gone down now it's some part of the query plan and we don't need to traverse it any further so this is tied to now this idea of the principle of optimality that cascades is is going to be based on and this is from control theory this is for this Bellman's principle optimality and this is not specific to cascades top-down search but this is this is how we're going to be able to you know converge on finding finding an optimal query plan so all it really says is that if I if I have an optimal query plan and it's comprised of two groups then I know there can't be a another optimal physical plan that for one of those groups because if there was then I wouldn't be the optimal plan it's almost like a tautology look if I know that I know that this is the best plan I've ever seen and I have the best plan I've ever seen and I'm at some point in the in my search tree and now I have a cost that's greater than the best one I've ever seen but that if I traverse down into further down in my query plan tree I'm not going to magically find the optimal plan that's me better than the best one I've ever seen because because the way the search works is that you're adding more cost as you go down and you're not magically going to get less than what you've seen before so and this is just the branch and balance search method for a solution tree this is just saying that like we know that if we get to a point in our query plan that we have a self plan that is worse than the best one we've ever seen we don't need don't need to do any further explanation on it okay so let's look at now a full nn example of doing a search so let's say we want to do that three-way join between A, B, and C and so at the very beginning I have this group up here that says the output expression should be A, B, and C so the first step I need to do is fire off a transformation rule that will generate me a logical multi-expression so here we have now A and B combined together and then we're joining C so now what I want to do is traverse down into the tree and figure out what's the cost of executing of these two components of my expression so I'll start down here I'll take A and B so again now I have a new group but the output is A and B so that matches what my placeholder up here is and same thing now I want to have a logical transformation or logical expression to represent what how to compute this output so that's just now somehow accessing A and I'm going to join that with somehow some way of accessing B so now same thing I got to start with this first this first component of my expression and go down and examine how I'm actually going to access it so my logical operator would just be get get A and I'm going to access A but I'm not saying actually how to do it so I could have a logical to physical transformation to then say oh I could do an index scan on A or I could do a sequential scan on A and then now at this point here we have these physical expressions there's nobody below us in the query plan tree so we can compute the cost of what's going to take to execute both of these so let's say in this case here the sequential scan for whatever reason is the cheapest way to execute this the access method on A so we're going to compute the cost of this and then we update our memo table it's cut off here but this is just A to say that the best expression we've ever seen for the producing the output of just A is the sequential scan on A with a cost of 10 so now I can't do the same thing go on the other side take B I end up with this logical expression B get B I compute all the different physical expressions I could possibly have sequential scan index scan the sequential scan turns out to be the cheapest the cost of 20 I update my memo table and say to access B sequential scan is the fastest with the cost of 20 but now I come back up here and I can back in this group here and I can still start computing more logical expressions for or generating more logical expressions to produce this produce this output for me so now I want to join B and A because I did that rotation all right I changed the order so now the same thing I need to get the cost of accessing B and then I get the cost of accessing A before I compute the physical cost of converting this to a physical a physical operator so but in this case here when I go down and say well what's the cost of accessing B or what's the cost of accessing A I've actually already done this right I don't need to do that because I already have the best cost in my memo table so I don't need to do this search again down here or apply more transformations I'm done and I just reuse the cost that I've had here so I can I could keep doing this like it's only a join B or B join A there's only two choices but now I can look at all possible combinations of what physical operator I want to use for computing the join and here I'm only showing a subset but now again if I want to get this physical cost and you know what the cost of accessing A you need the cost of accessing B but I already have that in my physical plan right so the cost of say of this one here to do in the cert merge join is the cost of accessing A which is 10 the cost of accessing B which is 20 and the cost of the cert merge join or assume is 50 and now I can write that into my memo table and say the cost of doing this is 80 so now here I come back up to my to this group here and now I don't need to examine the other side accessing C so same thing jump down here I can either do a scrunch of scan or index scan let's say for whatever reason the index scan is faster so I record that cost to here being 5 and then now I just again I could do more permutations of different join orderings of A, B, and C or I could start examining the converting the logical expressions into physical operators and examining what are the different ways to access to do the actual join and then I can again pick one different has the lower cost based on the summation of of what I've done below me in the query plan right so that's essentially how how Cascades works right at a high level right we just keep doing more and more permutations use the memory table as much as possible then we invoke the cost model when we have enough to do a estimation on the physical cost and now it makes more sense what we talked about last time with the search termination you could see how for really complex curves a lot of joins I could keep going really deep into my query plan and I could just keep doing this search forever so you need a way to determine when should you stop and we talked about how the wall clock time just says I stopped when I run out of time right for a certain amount you know some configurable amount I could have a cost threshold to say like if I if I find a if I haven't found a lower cost I haven't found a better plan that has been 10% better than the best one I've seen so far then I just should just stop because I'm I'm sort of getting diminishing returns on continuing the search and then in the last one of the if I and it's also they search all different options then I can just stop so Cascades is as I said as far as I know I don't think there were ever was a standalone Cascades implementation at least I don't think so I think that Grit's Graphy worked with Microsoft in the 90s to implement Cascades in Seagull server and that's my knowledge they're they're still based on that but it's been used in a bunch of different different systems as well most database systems as far as I know are using the dynamic programming technique method but some of the newer ones in the last 20 years or so are using some of these are using Cascades so Wisconsin Ops Plus Plus was like Columbia was a sort of standalone optimizer generator that would you know you define the rules and would compute the or generate the cost model or generate the optimizer for you Pivotal orca and calcite are actually two interesting projects we'll talk about in a second but think of these as standalone services that do query optimization so you can have like your database system running one machine your query optimization server is running on another machine and you send query requests over to the optimizer service it crunches on them and then spits you back a optimal plan that you then execute and so these again these are not these two are not tied to any specific database system the ones that are actually integrated in the system again SQL Server is pretty famous non-stop SQL is a the distributed database or parallel database from the early 1980s that's still widely used today in a lot of banks and financial systems and so they rewrote their query optimizer in the 1990s Clustrix was a or is that still exists is a distributed version of MySQL that was bought by the MariaDB people it's two years ago now and according to their documentation they use a Cascades Optimizer to replace the original MySQL optimizer and then our system Peloton which we started in the 2000s we killed it off two years ago but since then 721 students have ported it over into the newer system that you guys are working on today so and as is open source all right let's talk about pivotal orca and calcite because I think those are like I said these are standalone optimizer systems that are not tied to any one database system we did examine orca in the very beginning days of Peloton decide whether we wanted to use it they didn't have a really good documentation and the way it worked was you would run the service and you had to send over these XML files that specify like the query plan the rules and the catalog statistics about what the data looks like but like I said it wasn't very documented it's gotten much better since then the last time we looked at it but at the time it was not in the shape where we could actually use it so orca is a standalone Cascades implementation that was originally written for Green Plum Green Plum was one of the original data warehouse systems from the mid 2000s that came out with Vertica and Data Allegro and Astrodata and so what happened was EMC bought Green Plum and then VMware bought Gemfire or Seagullfire and EMC didn't want to have a database division VMware didn't want to have a database division so they both divested their database products combine them into a single company called Pivotal so orca was originally written for Green Plum but then they broke it out and had to be a standalone system so they can then use it for their SQL engine on top of Hadoop Hadoop file system called Hawk so again it's agnostic what the actual data model is or what the actual database system it is you specify through these XML files what you you know what rules you have and how it actually wants to how it should actually form the plans and then it's up for you in your database system to take those XML files when you get the response of here's the optimized plan and be able to execute it so the one interesting thing about orca that we haven't talked about here is that because it's a standalone service you can imagine like I have my cluster of database machines and then I have one machine that's dedicated as the the query optimized machine so that means that can that machine can use all the cores or all the resources that are available to it to do query optimization so they actually can do multi-threaded search so how would that look well in either we're doing dynamic programming or the cascade approach you can it's easy to imagine that there's different points of the query tree where I could fork off or spawn off different threads to do exploration in them and they're both they're all using the cost model they're all examining doing their own transformations you still have to maybe update a global memo table to avoid redundant calculations or update here's the best plan I've ever seen but in for the most part the the search themselves is are independent so they're one of the few optimizers that I know about that can do that can that is multi-threaded which I think is kind of cool and certainly our optimizers not our optimizer right now could run different queries do optimizing different queries at different threads but we can't take one query and do the search on multiple threads so there's two interesting engineering things that came out of the that are in the orca paper that I find really fascinating that I haven't really seen discussed in other descriptions of optimizers so the first is that in orca they've set it up to enable them to do remote debugging more easily so green plumb is designed as on premise software so it's not running as a host of service in the cloud like if your corporation you download the software the green plumb software in orca and you run it on your own machines in your own data center so now the problem is though if now the optimizer breaks or throws an error there's a some problem with it how you how do you actually debug that if you're now the orca developers back and working in pivotal because they're not going to let you SSH into the machine and you know rerun the the query that failed or you know get a dump of the data and run it locally because it's it's going to be way too big so what they did was they had it so that anytime that there's an error the system would be able to dump out its state in such a way that it made it that allowed the developer to take that state dump load it back locally and be able to walk through exactly what the state of the system was the optimizer was when that query had an error so that it can easily reproduce the problem and fix it so thinking of this is like a core dump for your optimizer where it has additional information like what was what were all the transformations that I applied was in the memory table like it's more than a core dump it's sort of more it's more specific customized to doing query optimization to make their life easier and again I have not seen this maybe the commercial guys have this but I've not seen this talked about in other optimizers the other cool thing they do in orca which I find super interesting is that they talk about how they have this testing framework that allows them to easily check that the cost model and the optimizer is truly picking the best query plan the one that has the lowest cost so they had this thing called taco which is the testing the accuracy of the query optimizer they claim it's open source I haven't made it all fine on on github I need to email them but the way it works is that think of think of like I think I take my query I put into the optimizer and then it'll spit out what I think is the best plan but I can also have it spit out what it it was the worst plan and so now I can take the worst plan and the best plan actually execute them on my database system and see oh yes it actually truly is that that the best one is actually faster than the worst one now in that case you don't want to do you know that's two extremes the worst one probably would be so slow that you don't want to do it like it's a bunch of cartesian products it'll take forever to run that's obvious so you ought to sort of narrow the gap of what the worst plan actually is or the the the less optimal plan is how many steps away from that is from the the optimal plan like you take the best plan and the second best plan run them and prove that the best plan actually truly is the better plan this is something we want to do in our own system we have the infrastructure and the optimizer to potentially do this we just haven't built out the testing you know the testing framework to make it happen so orca is super cool orca was one of the first standalone query optimizer frameworks at least in the modern area era the dot plus plus and and uh columbia around for the 90s but as far as I know the codes online when you know it's it's it's dated nobody actually uses it anymore it's not maintained whereas orcas actually being maintained the other optimizer framework that is actually more popular I think than orca at least from talking from database companies is this thing called calcite so again this is another standalone query optimizer optimization framework that where you can specify you know what data model you want to use what query language you're going to support what the cost models are and all the transformation you want to apply and again it just compiles them down and can do this all for you so it originally came out of lucid db lucid db was a java based database system startup out of the late 2000s that failed as many database startups actually failed there's a lot of them but what came out of the ashes of lucid db was the actual query optimizer which then got open sourced as calcite if everything's written in java and there's a quite a number of database startup projects open source and closed source that are interested in actually using using calcite or actively using it or pursuing the adoption of it some examples are like omni sci herd db blazing sql apache hives splice machine and i know maybe two others that have talked about that they want to start using this now they all use it to different to different levels so this calcite also provides you the sql parser as well so some of them is using depart sql and calcite has its own dialect some of them are using it to actually do the query optimization i have not seen reports on how good it actually is but i think again this seems to be a lot of traction around this because it's the patchy project whereas orca for whatever reason maybe because pivotal still owns it that uh as far as you know only pivotal is the one using orca even though it's open source so i think for this one this one you'll see a lot more in systems that's going forward at the last optimize of implementation i'm going to talk about is from them sql it's not anything mind blowing the reason i'm bringing it up because they do one thing that's slightly different than than the other optimizer we talked about here and this is probably because they're targeting a distributed a distributed database system environment which we've not really focused on in the semester at all but it's just worth sort of bringing up to and understand you know the types of things people can do in their core optimization so it's going to have three stages it's like the stratified approach you'll have a a rewriter that is logical logical transfer transformations but these logical logical transformation actually can use a cost model then they're doing the logical physical transformations and this is mostly going to be the dynamic programing approach to do the join reordering but then now then the last stage they're actually going to take the physical plan that you get out of the numerator and then convert it back into SQL but they're going to annotate the SQL with some additional mem sql so the commands for moving data that in a way that sql can actually specify because again sql is high level doesn't say where the actual data is but now we're going to add commands here to say like all right run this query but then send the data to this this particular machine in your cluster so this part is super interesting that I've never really heard about in the from an optimizer standpoint of doing this I've heard about this for testing where people take the physical plan that's produced by the optimizer and then reverse it back into sql and then see if you run that through the optimizer again does it generate the same plan or logically equivalent plan but I've never heard of this where they converted back to sql and then actually execute it and then parse it all over again so this is this is their pipeline for the query optimizer and again it looks a lot like the what we talked about so far you parse it you bind it and then there's the rewrite or enumeration phase but then you take the physical plan and then this planner thing converts it to sql and then you send it to all the individual execution nodes who are then going to do all these same steps all over again except for this last one here because now the idea is that we want to have each local node that has the data to make a decision about how to actually do the lowest levels of query optimization like access things like that so the way to think about this is like say I want to do a join or say I want to do a scan on a table across five nodes and on some nodes maybe the data has a certain distribution that's different than other nodes so rather than me and a global have a like a global planner try to make a decision of what's the best way to execute this query on all five nodes I'll come up with a high level global plan of how we want to move data around in this first phase here and then I send the sql query that I regenerate back from the physical plan send that to the local node who then can now make a decision like oh now I know since I have a a more accurate view of what my data looks like because it's local to me I can make a better decision how I want to execute that plan on that node again most distributed databases as far as I know don't do this last step they take a physical plan and then ship that off to all the individual nodes which I think this this is kind of interesting okay so just to finish up the main takeaway you should get out of all this is that we looked at the dynamic approach we look at the cascades approach these are just sort of again two two ways to to sort of go at searching for a physical plan the all of this though is going to make the highly dependent on having a good cost model if you don't have a good cost model doesn't matter where you're doing top down or bottom up you're going to make bad decisions and the plans are just going to be are going to be terrible and so we will see next week on Wednesday next week just how bad these cost models can actually get in particular when we do carded now the estimations with joins and so there's no magic way to make this work and so when I say SQL Server has the best query optimizer the research shows it actually bakes the best plans and that might be because they have the best cost model and cost best estimates okay all right so let's quickly go through what's expected for you for the project three code review so again as I said on on this Wednesday as part of the project presentations so project updates every group is going to have to send a pull request to the master branch on on github so this will kick off a build on Travis if that fails you can ignore it but I'll kick off a build on Jenkins and that one we do need to pay attention to and also repeat your code coverage so for this one your PR needs to be able to cleanly merge in the master branch that means you want to do a rebase get all our latest changes depending on when you guys forked the code from us this shouldn't be that many changes and so what'll happen is you'll go to the google spreadsheet I'll announce Nantes Piazza you put the link to your pull request there then I'll sign you to another group you guys go into the I'll give you an access on on github you can go comment do a code review on github against the other teams pull request and I'll keep track of who's participating and the idea is that when the pull request is done or the code review is done in a week from now the the team that you review will take your suggestions and take your comments and actually incorporate them and make changes into into your code and we'll do this around again another round of code reviews when we get closer to when we get near to end of the semester so above all else we should be you want to your code reviews to be helpful we want to be courteous you don't want to be overly disparaging or harsh up to your fellow students so don't say like this is the worst code I've ever seen in my life that doesn't help anybody if you you want to give constructive criticism about about about their project so these are just again some general tips about how to actually proceed with the code review so for the the group that smits the code review or smits the pull request you want to include a summary and information about what files and functions and methods that you want the reviewing team to look at so it doesn't help us if you have this giant pull request with all these changes some of these changes are work in progress and you don't actually want feedback on them yet so you want to in your in your summary you should provide information say here's the list of files or here's the list of functions that you want the other team to review for you now when you do the review you want to try to limit yourself to only look at maybe 400 lines of code at a time I mean sort of set a max limit for anytime you sit down to do the review to be up to 60 minutes after 60 minutes it just all becomes a blur and you're just leaving stupid comments or you're missing obvious mistakes right so it's sort of limit yourself to looking at this amount of code at a time and then when you go into the do the pull request you want to have a checklist or so when you go into the code review you want to have a checklist on the side to say what it is that you're actually looking for and how can you actually provide help so these are some things that that I'm suggesting that you should look into or be mindful of when you're doing a code view so above all else obviously does the code work right there should be you know should compile should pass any tests that you that they would have for their code in the pull request is all the code easily easily understandable like it's not some you know arcane assembly code or weird intrinsics that are not obvious and make sure so make sure that the code is people don't have these giant functions that do a bunch of things make sure that things are broken up into nice composable chunks we'll avoid any duplicate code we'll avoid any commented out code again is the code modular as possible there should be no global variables we can't I don't think we check for that but they're you know humanly I think as a human you have to check for this if they have any debug output to make sure that we're actually using the this the the debug functions would give you if you know print F snow standard couts right well the the scripts will check that when you when you submit the pull request on github but it's in general make sure that they're doing that these things correctly everything should be documented so the comments and the code should actually describe what is the true intent of the code all the functions should follow the dox in java doc format and there's a check for this again in Jenkins make sure you every function is actually documented if there's anything weird that's happening in the code that's not trivial to understand just by reading it make sure that's actually fully documented about what what the expected behavior of the the component of the system is here if you're using the third party libraries make sure that it's an understandable why we're using that third party library and there was any code that's incomplete you know which have to do statements that are actually labeled correctly to say what it is that are actually doing testing is super important the overarching rule about your your projects is that the code coverage never go down for your tests your test should actually be meaningful so that means you just can't have your test you know run through some code and then print the standard out and that's how you're checking to see whether it actually did something correct right it should actually do rigorous checking to make sure that the functionality and the behavior is is operating correctly so are the the tests complete are they comprehensive are they actually testing what they meant to be testing and we want to minimize the amount of hard coded answers we have in our tests now not all our tests work this way but uh you know if there's a way to derive what the correct expected output is without having to hard code a constant value then that would be preferable so that if we if we tweak your test case then we don't have a bunch of you know erroneous errors or having to spend time to figure out what that new constant value should actually be okay all right so again I'll post this on these these updates on this information on piazza and what's what's expected for you and I will discuss this on on Wednesday this week at the project proposals all right so for next class will be a new lecture on what I call non-traditional query optimization methods the idea here is to to now that we understand the background of like dynamic programming and cascade so the core fundamentals of doing query optimization let's look at different ways people can extend these techniques or try additional things to uh to improve the efficacy of our query optimizer okay all right guys wash your hands see you bank it in the side pocket what is this some old push hey yo hey yo took a sip and had to spit because I ain't quit that beer called the O.E. because I'm old G.I. skew down with the S.T.I. you look then it was gone grabbed me a 40 just to get my clothes on because I needed just a little more kick hooked like a fish after just one sip yo put it to my lips and rip the top off they put all just dropped off your sane eyes hopped off and my hood won't be the same after ice cube take a sane eye to the brain