 So this is the second lecture now we're going to do on query optimizers. So last class, we spent our time talking about the history of query optimizers, and we started getting into cascades while we ran out of time. So we'll focus most of our time today on that part. Before we get into the material, we have two lectures coming up. So we have one today immediately after the class, Ipocratus pandus, I'm way off. He graduated here from CMU in like 2010. I started grad school in 2007 and he and I overlap. So I have no idea why I said he graduated in 2007. So, but he's CMU alum, he did his PhD here with Natasha Alamaki who used to teach 721 before she left to go to Switzerland. So he is an engineer working on Amazon Redshift, that's their cloud system based on Parkcel, which is based on Postgres. So they're going to talk about some of the stuff that they're building there. Then next week, we're going to have actually two talks on the Hanagai. So Neil will be coming down from Waterloo, because SAP bought CyBase and CyBase had a big office in Waterloo in Canada. So he's coming down to give a guest lecture in class, and then he'll be giving another tech talk, more researchy tech talk on Thursday. So Wednesday he'll be in class and then Thursday will be a separate thing at 12 o'clock over in the CIC. Again, I encourage you to come to both of these. The great thing about coming to these talks is I feel like I tell you guys, this is how real systems do this or these are the things you've got to worry about, and then these guys come and validate things I say. So it makes me feel like I'm not I'm actually telling you useful things. So as I said, today is the focus that we're going to spend those time beginning talking about Cascades or Columbia, which was again the paper you guys read last class, but we didn't spend enough time to get into it. Then we'll briefly talk about the plan enumeration problem, which is what the paper you guys read about was trying to solve from the Germans. Then we'll finish off talking about some other implementations of query optimizers that are out there. Okay. So just to refresh where we were from last class, I laid out five different ways to actually implement the search optimization strategy in a query optimizer. So we saw the first approach was just using heuristics, which is just doing the rewrite rules without a cost model. I know that I always want to push down my predicates and my limits. So I try to write rules to do that. Then we talked about how to do a heuristic plus cost-based search from IBM System R, where you do the same rewriting in the first stage or the first choice, but then you do a search to forget the best join order for your particular query. In the System R approach, we looked how to do this using a bottom-up model, a bottom-up search using dynamic programming. Then we saw a slightly different way of doing the joint-ordering search by using a randomized algorithm, where every single option, you just permute the query plan and you jump to a new space in the solution and see whether that generates a better plan for you. The last two approaches that are more relevant to us today in modern systems are the stratified search and unified search. Stratified search is essentially the heuristics plus cost-based search, but you have declared a rules to define what these transformations are actually going to be. Whereas in this approach here, it's just you're writing if statements, if then else statements to apply the changes. Then the unified search is the Cascades model you guys read about where you have the transformation from logical to logical, logical physical operators in your query plan, all done within a single search model. You don't have separate stages. So where this Cascades work and the Starburst work came out of was in the late 1980s, early 1990s, was this movement toward these optimizer generators. So rather than you implementing in C or C plus plus or Java, the rules to how to transform your queries to optimize them, you instead would write something in a higher level language like DSL to say, here's the transformations or optimizations I can do. You then feed that into a compilation engine, which then crunches on it and then spits out the C plus plus code or the Java code that will do those transformations. Again, the idea here is that it's easier for humans to reason about how to do these algebraic transformations in our query plans in a higher level language, rather than in something like a low level like C or C plus plus, right? So again, the examples where these are Starburst, Exodus, Volcano, and Cascades, and then op plus plus, we'll talk a little bit later on from Wisconsin. So I don't know whether this is how all commercial systems do this, right? Certainly in our Cascades implementation, we don't actually rely on an optimizer generator, right? We just bake the rules and transformations directly in C plus plus. So Starburst is using IBM. So I know they're doing this, but I don't know about Oracle and I don't know about SQL Server. So again, there's two classes of algorithms we could use. In our optimizer, there's a stratified versus unified, and again, stratified as we already said, it's just you have a rewrite phase or do the logical to logical transformations, and then when you want to do transformation from a logical to physical plan, again, logical would be like join table A and B, physical plan would be join table A and B with the hash join algorithm. So you do the logical, logical in the first step, and then the second stage, the second step would be doing the cost-based search to find the logical to physical. This is primarily being spent on doing join ordering. The one we care about again in this lecture is the unified search where again, we have all our transformation rules sort of uniformly put into a single search model. Then we can define priorities of how we want to do our transformations, but then we don't have these separate stages, everything just sort of all at once. Now the tricky thing about this is, which we'll talk about a little bit more next class, is since this is still cost-based search, I need to be able to determine whether a logical plan is still better than, whether one logical plan is better than another logical plan, even though I don't have the actual physical plan or physical operators that my cost model actually would need. So this gets into the area of the cascade stuff I don't fully understand of how can you actually still do cost model estimations for logical query plans even though you don't know how you're actually going to read the table or how you're actually going to do the join. And you may think, all right, well maybe I'll just take the, if I have to read a table, I'll just assume I'll do a sequential scan. And the problem with that is like it's how do you actually determine whether the sequential scan is going to be better than index scan without actually having to know whether you're doing an index scan or not. So this is the stuff I don't fully understand about the cascades of the stuff, about how all of this sits together. And then it'll make more sense in a second and we'll see this in the next lecture. So since we're doing a lot more transformations in this approach than in the stratified search, stratified is just going from logical to logical. Here we're doing logical to logical and logical to physical. So we could get stuck in an infinite loop which we'll see in our example in a few more slides. So to avoid having this, transforming the query plan one way and then putting back the other play whether way and go back and forth, they're gonna make heavy use of a memoization table or memo table to keep track of what transformations they've already applied, keep track of what the cost of that transformation was and they can do that to avoid having to do redundant computations. And this again, this avoids us getting, this cuts down our number of transformations we have to do. So the other main difference too is the difference between top down versus bottom up. So cascades is an example of a top down approach. And this is where you start with what you want the outcome to be. I wanna join table A, B and C and then you traverse down into the query plan and then you start transforming those constructs of the higher level expression you have. I wanna join A, B and C to now say how you're gonna join A and B or how you're gonna join A and C and sort of start building the query plan by going down filling in the missing pieces. The bottom up optimizations where you start with nothing and you start incrementally adding in the elements of your query plan and you start marching towards your goal and you figure out what path was the, had the lowest cost get you to that goal. So we'll talk a little bit later on but like the I'm confused actually I don't know which one is actually better. I like the cascades top down approach because as a human I can reason about that. One set of Germans say this pitch is better, don't do this. And then another set of Germans says this which is better now. And then the one set of Germans said this was better actually now says this is better. So I don't know. Okay. And then another set of Germans said this was better. The guy that invented this one said this one's better. So I don't know. I'm sorry. Okay. This is research. This is okay. All right. Let's get into cascades. So again cascades was invented by Gertz Graffi and he had two other optimizers project before this. He had volcano and before that was Exodus. So the way he had cascades it's gonna be at a high level the same thing as volcano except it was written in the 1990s. So they make heavy use of how great it is to be object oriented because that was in vogue at the time. But they have a bunch of other stuff that goes a little bit deeper than that that is improvements over volcano. And the main one there is that they're gonna materialize the transformations that you wanna do for your operators and your query plan on the fly as you wanna examine them or sort of as you enumerate the plan as it's called rather than just saying all right I need to look at this operator. Let me materialize everything. Then I'll figure out which ones actually I wanna look at and examine and maybe traverse down into the tree. So the way again at a high level think about this is that the rewriting phase of the query plan is essentially this direct mapping function allows us to go from one operator to the next for simplistic things rather than having to do an exhaustive search every single time. And this is allowed them to be more efficient. So at a high level the main points that we wanna focus on are the following. And again part of the reason why I had you guys read the Columbia paper so that it was a master's thesis from the 1990s rather than the Cascades paper because the Cascades paper is not very good. They keep banging on about how it's object-oriented and isn't that great without actually really getting to the details of what's actually going on. Those 30 pages from the Columbia thesis in my opinion are the best description of what Cascades actually does. And I mean it's not an exact implementation. They do deviate for some minor things but at a high level if you understand Columbia then I would say you understand Cascades. Again for the high level constructs. All right so the first thing is that all our optimization tasks are implemented as essentially structs or data structures where we can define them based on the pattern we wanna match and then the rule we fire or what modification we make. And again this is thinking of this as how you'd write this in a declarative language. You would write this as like, match this pattern, apply this rule rather than like if then else stuff. We can use our rules to enforce the physical properties of the data that we would want. So we say we know that this output needs to be sorted based on some key then we declare that as a rule that we'd apply to one of these groups that we're gonna create and the rules language will make sure that that gets enforced. That's still held valid as we do our transformations. We can also provide a way to prioritize the ordering of moves by a promise mechanism. So basically it's you define the priorities for you want for your transformation and then as you traverse the tree you can actually reorganize those or reshuffle those priorities based on what you start seeing as you go down in the tree. So the way to think about this is like I know like as I make a move as I traverse the tree I can then go check to see whether my priorities and my transformations have changed because some because we know we know where we're at the higher point of the tree we can then infer something about what will we see down different branches and maybe wanna go down one branch versus another. And being very hand wavy here it'll make more sense when I show an example. And the last one is that we're gonna have a predicates or we basically treat our predicates as first class entities in our system like logical and physical operators and allow us to optimize them or transform them just as we would as like relational algebra operators. So this is again, this is what Eric Newton and William are doing for their class project like they're using our Cascades Optimizer framework to do the query rewriting stuff. Because at a high level it works the same way. All right, so let's understand now at the lowest level what Cascades is actually doing. So the first thing we need to understand is how they're gonna represent the operators within our query plan. So they're gonna use the term expression to mean an operator with zero more input expressions. So I realize that for the entire semester I've been using the term expression to mean a piece of a predicate like a where clause like A equals one. I'm calling that an expression. In Cascades, it's a bigger thing. It's like joining A on B, right? So for this particular query here we're used as a running example. So we're gonna select star from A, join B, do an inner join on B with A dot ID equals B dot ID and inner join with C where C dot ID equals A dot ID. So a logical expression to represent this join operation could be this A join B followed by join C, right? Again, as we see, we're gonna enumerate over different join orderings, right? Because we can rely on the associative and commutativity properties of relation algebra to be able to say B join A join C or C join B join A, right? We can change this up in a different way. And then we're also gonna have physical expressions which are now going to be, we want them to have a logical equivalency from the logical expression it came from. And then we can derive the physical expression that actually implements or performs what this expression wants to do. So now we're gonna substitute using subscripts to say what the, how we're actually going to execute the component of our expression. So here we can say A and then we'll have a little subscript that says sequential for a sequential scan join and we put hj for hash join. So this is a sequential scan on A, sequential scan on B and do a hash join. And then we have a nested loop join on the sequential scan on C. Or if we have indexes, we can define what index we're gonna do, look up on for our low access methods and then change out the different, the join algorithms we want to use. So now that we have, again, we have logical expressions and physical expressions. So now what we're gonna do is we're gonna combine them together into what are called groups. And so a group is gonna be a set of logically equivalent expressions that are defined based on the output that they're gonna produce. So this right here is gonna be a group. Its output will be the join on A, B and C. And then I'll have equivalent expressions that represent the logical expressions and then the physical expressions. And all of these are equivalent. So the logical expressions are gonna be, in this case here, all the possible join ordering permutations that produce our target output. As I was saying in the last slide, for this pick and query, A join, B join, C, right? I can have A join, B join, C, B join, C join, A, and so forth. I'm putting the three dots represent, like I'm doing an exhaust of these are all of them. And then I have now for each logical expression, I can have in one or more physical expressions that are equivalent to, again, they produce all the correct output, but are derived from one of these logical expressions. So this one here is that sequential scan with a nest loop join and B, which we were using sequential scan, what does a nest loop join and C with sequential scan. And again, I can have an exhaustive list of everything here, all right? So this obviously gets really huge for a single group, right? I'm joining three tables here, and within one, for one logical expression, I can have all possible indexes I can look up on my tables on, all possible join algorithms, whether I'm using a sequential scan or not. I said this thing can get quite big. So we're gonna need a way to cut down on how much state we have to store for our groups. And we wanna also potentially cut down the number of groups or number of these expressions we have to examine, right? Because again, as we're doing our traversal, in order to understand whether this expression is gonna be better than the best expression we've seen so far for this particular query, I gotta go to the cost model and ask it, what's the estimated cost of executing this expression? So that's expensive, but I wanna try to avoid doing that. So the way they're gonna cut down on this search space, is, I guess the representation of the search space is to combine them together into what are called multi-expressions. And the idea here is that it's a way to represent a portion of an expression to say that the actual implementation, whether it's logical or physical, but how it's actually going to be executed or performed is defined lower in the tree, and I don't need to define it at my particular group. So again, my output is still a join b join c, but now my logical expressions, instead of saying anything about how I'm joining a and b, I'm just saying, I know that there's a multi-expression down below me in the tree that will tell me how I wanna form this logical expression a join b. And then likewise for c here, I'm just saying that somewhere below me in the tree, that'll tell you how to actually access c. So taking this example, a join b join c, now when I convert that to my physical multi-expression, again, I still have bracket a b bracket, bracket c bracket, okay, because this is saying below me in the tree, it'll tell you how to actually do this join from a physical operator standpoint. Below me in the tree, I'll tell you how to actually access c, but now I'm actually defining my join algorithm I'm using. This is just a placeholder, it's like a virtual reference to something below in the tree that I'll make a decision about how we're actually gonna perform that later on. But at this point right now, when I'm looking at this group in my tree, my search tree, I don't need to know anything about how I'm actually doing this, right? And the idea here is essentially try to sort of eliminate the fan out, right? So this is the example I was saying before about because we're going top down, we can infer something about the priorities of how to traverse the tree and do our transformations based on what we know where we're at now. So in this particular case here, I could either go down to see how to join a and b or how to scan c. If there's something I know about my query and the data I'm accessing, I may choose to go down one path versus another, right? Like I may say, well, c, I know, I don't have any indexes, it's always going to be a sequential scan. So I won't go down there for now, I'll go down the other way. Or there's some other aspect of what's in your predicate to say that a, b's, no matter how I'm joining it, it's always going to be in all set. So don't traverse this way, traverse the other way. Because I'll get more information going down the other path. And you can't do that in a bottom-up approach because you're sort of starting from nothing and you can't sort of peek ahead about what the query plan's actually going to look like at the top. And I suppose you can, but I don't think anybody implements it that way. So is this multi-expression stuff clear? All right, again, it's just placeholders that, where we can assume that it's completely optimized and we'll figure out how to deal with it later. All right, so now that we have groups, we have multi-expressions, now we can define what our rules look like. So again, a rule is a transformation of an expression to a new expression that's logically equivalent. And they're going to have two classes of rules, right? So we're just going to call, if we go from a logical to logical transformation, we're going to call that, or logical to logical operator, we'll call that a transformation rule. If we're going from logical to physical, then we'll call that an implementation rule. These are just the labels that they're using. So as I was saying, the each rule is going to be defined as a struct or an object, and that's going to be comprised of two parts. The first is going to be the pattern, is the one which will be the structure that we're going to look for in our logical expressions in our query plan, to then determine whether we should fire this rule. So if we match our pattern here, then we know we now need to fire the substitution rule, which then describes what the effect will be after we apply our transformation, or how we're going to modify the expression to put it into a new form, right? Whether it's, again, logical to logical or logical to physical. It's just like a trigger. My pattern, this matches, that fires a trigger that implies the substitution. Let's look at an example. So they say this is my pattern here. And the pattern is essentially looking to see whether I have two joins that are one after another. So I am looking for inner joins or equa joins, so I have two equa joins here. So the output of this equa join feeds into this equa join. Now over here, I'm just saying these are generic groups. So I don't care what's actually in here. I don't care whether it's a scan. I don't care whether it's another join. All my pattern needs to match to see whether I have two joins like this. And then I'll do a transformation based on those. So this particular pattern would match this query plan here. So I have a join b, equa join, and then I see I have a multi-expression to say, I'm joining a and b, I don't know how, and then I'm going to join that with c. So the equa join here and equa join there feeding into each other. I don't care what these other ones here for simplicity. We're just saying get a, get b, get c. Like it's the access method to get tuples that we then feed into these operators, but I don't care. I don't care about these two here. So I could have now a transformation rule that would do a rotation to move the equa join child from my parent equa join from the left side to the right side of the query plan. So I'm moving the join and b and c over here. And I flipped around the table. So now I'm a join b join c, whereas the original one it was a join b join c, or a b first, and then followed by c. The implementation rule could substitute these equa joins with now to say I'm doing a certain sort merge. So I'm actually defining what algorithm I want to use for my join operator here. So again, the overall structure is the same, but now my, what were logical expressions are now physical expressions in my query plan, all right? Windows should not restart. Okay, so what's one particular, what's one problem that we could have with this, with this transformation rule here, the logical to logical? You have a rotate right to the left. Exactly, so he says, if you have a rotate right to the left, which you would have, then you can get stuck in infinite loop, because I'm gonna rotate left to right, get this, then another rule gets fired, it says rotate left to right, and go back to here. Now I'm stuck in infinite loop. So this is where the memo table would help us, right? Because we'll be able to identify some way that we've already done this, we've already been to this query plan before, so we don't need to fire that rule to go look at it again. And so the way they're gonna do that is essentially maintain a hash table. So the original cascades paper says they use a graph structure. I think we use a hash table, other limitations I've seen use a hash table. The basic idea is that you have a way to hash the group in a descriptive manner to allow you to uniquely identify, you know, this group has already been examined. And then you just keep track of what that cost was for when you've seen that group before, if there's one associated with it, and that way you know that you don't need to go back and do the transformation again, then hit up the cost model. You just say, I've already seen this, let me just use what I've cached before. Right? So the idea again is that all equivalent operator trees and any plans that are stored together in a group will still be put into the same entry and our memorization table, and we don't have to go, you know, revisit things over and over again. All right? So this is pretty straightforward. So the other important thing we need to be mindful about of as now is we do our optimization is this notion of the principle of optimality. So this is sometimes called Bellman's principle of optimality. It came from I think the 1960s or 70s from like control theory, but the basic idea is that if we have an optimal plan, then every sub-plan in our optimal plan has to be optimal. It's sort of one of these like tautologies, like self-evident truths. So if I have the optimal plan and my optimal plan is comprised of sub-plans, right, like a group for example, then I know that in order for me to be the optimal plan, all my sub-plans have to be optimal. Because for a particular sub-plan, if there was another plan that was optimal, then I wouldn't be the optimal plan, right? So that basically is gonna allow us to be able to reason about the sort of sub-plans by themselves and determine whether we can find the optimal plan for a sub-plan. And then once we find it, we can cache that result and not have to revisit it if we've seen before in our memo table. Now, the principle of optimality is not specific to cascades or top-down search, right? This still applies in the bottom-up approach as well, but the difference here is because we're doing a top-down search, we can use the branch-and-bound method to identify that we've gone down some path and that path we've gone down so far is worse than the best plan we've seen so far, the optimal plan we've seen so far, then we know that no matter how far we go down in that path in our search tree, we won't find a better plan. So we can just cut it off and not traverse any further, right? This is just the branch-and-bound technique for optimization problems. So we can rely on this to make decisions about how to not go down certain path and prune our search space. Okay, so let's look at an example here. So for this one, for simplicity, I'm exploding out some of the groups already and practice what would happen is at the very beginning, you would only have, you start with the group at the top with your output is, but for simplicity, my animations, I just showed everything all at once. So again, at the beginning, our memo table is empty. So we're gonna start at the top and we want to now fire off a rule to say how can we take our, sort of a basic logical expression that we wanna join A, B, and C, like generate what is a equivalent logical expression for this guy here. So the first one could be A join B combined together as a multi-expression and then I'm joining that with a multi-expression that tells me how to access C. So now I need to make a decision on which path I wanna go down. For simplicity's sake, we'll say, let's figure out how to join A, B first before we go down C. So we're gonna go down here. So now again, we wanna apply again our transformation rules to generate different logical expressions or multi-expressions that could produce the output for that we expect, right? So here again, we have a multi-expression on A that means in A's by itself, we're gonna get the data somehow, we don't know how yet. And then we join that with B and then B's a multi-expression to say, you know, we don't know how we're gonna do it. We don't know how we're gonna access it either. So then we go down this path here and then we have our logical expression get A that implements this output expression we want. So then we can transform this into different physical expressions which could either be a sequential scan on A, which is the default thing. If you don't have any indexes, you just read the data or I could do an index scan on A. Now again, for simplicity, I'm just saying I'm doing an index scan. I'm not saying what index, but in practice you would have for every possible index that could satisfy the predicate I'm trying to look up on for my table, I would have a physical expression for each of those. Because each of those could be costed separately because they would have different selectivity characteristics or different cardinalities and that would produce, you know, more or less tuples than other indexes. And I can use that in my cost model to determine whether this, you know, which index is printing out the most data more quickly. So let's say for this one here, the index is a bad choice, the sequential scan is what we want. So for whatever reason, this sequential scan is faster. So we're gonna assign it a cost of 10. Again, this is an artificial synthetic number. This is internal to our database system. It doesn't really mean anything in the real world. I will talk about cost models next class, but this is just some way to us to tell us that this, you know, this one is better than this one. I say the index scan cost is 20 for whatever reason, but the sequential scan is cost 10. So this one's better, we're gonna choose this. So now my memo table for my, for my multi-expression on just accessing A, I'll record that the sequential scan on A, that physical operator is the best one I've seen so far with we had access thing with a cost of 10. So then now I go back up into my plan, back up here and now I wanna go down the other side and do the same thing on B. Same thing I say I have a sequential scan, the cost is 20, that's the best I've seen, so I add that in my entry in there. So now I could have other rules that would fire that would explode out my physical expressions, all right, sorry, sorry, logical expressions. So, right, we only looked at A join B, now I could look at B join A, but now I need to do the same thing. I can go down into my, to these two different groups and perform that same cost evaluation to determine whether the sequential scan is better than the index scan, but I don't need to do that because I would look and say, oh, my memo table, I already know what the best way to access A and B are, these two sequential scans, so I don't even bother going down and looking, right? So again, that avoids having to do redundant work. Now I wanna fire the rules to convert the logical expressions into a physical expression, so this is just a subset of them, right? I could be doing A join B with a nest loop join, sort merge join, and then possibly flip them in order. I can all possible combinations based on what my rules tell me to do. And then now I need to get the physical cost of this. So let's say again, I know how to access A, I know how to access B, and then for whatever reason the sort merge join for this one here has the lowest cost. So now the cost for this particular group in my query plan is the summation of all my sub plans plus whatever my operator is doing here. So the cost of accessing A was 10, the cost of accessing B was 20, plus the cost of accessing this, the sort merge would say 50, so it's 50 plus 10 plus 20, right? So now again, this is where the branch bounce stuff comes into play, like if I wanted to go down, maybe down another path, and I know my cost so far at this point of the tree is greater than 50, then no matter what I do below me in the query plan, I'll never get better, because this cost is always increasing as you go down because you're just adding in more operators that add more to your cost. So I know I don't need to traverse down it and look at other things. Yes. So you said early in an earlier lecture that the next thing the sort merge joins, like if anybody higher above the pipeline wants something sorted, right, sort merge already does that for you. But like in this model, you can't really detect that someone above you will want something sorted, right? Could you assume that any optimization comes from below you and not that like with your decision now, like you see what I mean? Yes, so his statement is in my example here, I'm just saying the sort merge is better and I'm actually ignoring the physical properties of what the things expect the data to look like up above. But I said of course that if we had an order by, we want things sort of on a certain key and I could join them at key, then a sort merge join would be preferable to a hash join. But in my example here, I'm not including the physical properties. You're right, I'm ignoring that. So the way it would work is for the group, you would define the enforcer rules of what you want the physical property to look like. So let's say that ABC needed to be sorted on B.ID. Then I would know as I go down here that anything that I would wanna pick has to generate, the physical operator has to generate an output that's sorted on the way that this guy wants up above. So your point is that I could keep going down here and looking at a bunch of stuff but then end up coming back up here and picking up nested loop join which would then generate an invalid output and therefore that I shouldn't have looked at all. But think in a real query plan, the very bottom of the leaves of the query plan are all these access methods on the actual table. So that's not actually wasted work. Cause again, I'm just gonna cache it in my memo table. So even though I may end up choosing an invalid query plan going down this path that would violate my enforcer rules, I still did useful work and now I could come back down a second time and look at the cert merge but not look all the way down because I've already computed what the cost of those access methods are. Right, so it's, you could pass hints down I guess. I actually don't know how they do it. Cause my understanding of the rules are always defined on who's expecting the data to come in and then I don't know whether you can pass that down and say oh by the way, pick something that gives me the data that I want. I suppose you could though. Yeah, there's no reason you couldn't do that. Then everything just works. Yes. How many more to eight have a great question? Then more just will be it. This question is can you merge A and C first, then merge and then join. Can I join AC first, then join B? Absolutely, yeah. So, I don't think I have an example of this but so this is just the same thing going on the other side. So I started off A join B followed by join C. That's just the way I did it. There's no reason I couldn't look at the other possible permutations. In fact, when I showed you what a group looked like going back a few slides. Yeah, here, right? Like in my example, I just started off with doing A join B join C but in actuality, you're absolutely right. All these different logical expressions are all equivalent to each other. So depending on the priorities I could set up in my query optimizer, I could say look at BC first then join A. For simplicity, we just did it in the order that they came in. But like say again, I know something about what the data looks like in B, what the data looks like in C that maybe I want to figure this one out first because that'll get me to a better query plan more quickly. So yeah, so you define rules, sorry, you define priorities for the transformations that allow you to determine in what order you apply them. But that's hard. So the time complexity is exponential. His question is the time complexity still exponential? Absolutely, yeah. But like again, we'll talk about in a second. As we actually said in this last class already too, we're not doing an exhaustive search here. Because it's NP complete, it's exponential to look at all possible join orderings. So like nobody's gonna let you run the query optimizer for days to figure out the best query plan. You wanna come up with something in a reasonable amount of time. So the amount of time you're gonna search is bounded. So rather than looking at useless things, you can use these priorities to transform things into query plans you think are gonna be near optimal more quickly. And again, this is the black art of these optimizers that the commercial guys are much better at than the open source ones. And we'll see this, the paper you'll read next class, like SQL Server does the best because SQL Server has the best optimizer. Or they have the best class model too. But it's more than just like, you know anybody can write a branch amount in search. It's all the extra stuff about inferring the semantics of what the query is actually doing, plus marrying with what you know about the data you collected from statistics that help decide how to do these transformations. But we don't do this in our system, God no. Any other questions? Okay, so going back to our example here. Yeah, so we left off here. So again, so the game, we just fill us out more. Actually, this is exactly what I just told you. So yeah, so now I start looking at the other permutations of my logical expressions, the different join orderings, and then I start looking at all these other groups. But again, because I had this memorization table, I know that I never need to re-compute the cost for the low-level exes methods. I had to reuse what I have in my memorization table. And then now to what I was just saying before, we showed this slide last class, but now it should make more sense is, how do you actually decide when to stop searching? Because I don't want to search forever. The most common technique that I know about is using wall clock time, or obviously when you exhaust all your transformations. Like if I know I've fired all my rules, there's nothing else for me to possibly ever look at. So I just stop. Usually for the really complex queries, you never get to this. And then this is what people always just use. I'll say I'll stop after 500 milliseconds or 10 seconds or something. And then whatever the best plan I've seen so far, that's one of these, I just go with it. Now again, our really simple examples here were joining three tables. You may think, oh, well how hard can it actually be? You start having more complex queries, non-trivial join predicates. The number of tables I think is a bad metric or determine how good somebody's optimizer is. Because it's like, I remember talking to Orca guys from Green Plum like, oh, we can do 35 table joins. Then I talked to MemSQL and we can do 75 table joins. And then the Splice Machine guys told me they can do 135, so I don't know whether that's actually a good way to think about this. I think it's not the number of tables, but it's how complex you're actually joining them. How much of the memorization is saved between individual queries? His question is how much of the memorization is saved in between individual queries? As far as I know, nothing. Because going back here, so in this case here, so the cost is usually gonna be, for an MMA system, it's the number of tables you're either inputting or outputting, so at the lowest level of the query plan is the access method, so it's the number of tables I'm outputting. But that's gonna be based on my selectivity or of my predicate. So that's gonna vary from one query to the next. So I would now need to encode some way to say, my predicate is on this table and therefore I can cache it. And then you have to now include information about, well what was the state of the database when I computed that cost? Because say I keep it in my memo table from things I saw yesterday, but I dumped the database and reloaded it back in with all new data, all my cost estimates are now completely off. They usually do it on a per query basis. There is starting to be some research into using deep learning or new machine learning to now derive, build models that can then do these kind of estimates and things like that. We'll talk about that in the next class. All this hinges, there's just so many different facts and things you can optimize and tweak that make this work better. How good your cost model is gonna matter a lot because that's gonna affect how you pick the join orders. How you prioritize what transformations to apply could tell you, are you gonna spend time looking at useful parts of the query plan space or are you gonna spend looking at times of useless things. All these things can play into the, well greatly affect the quality of the query plan. And I would say that again the commercial guys are much better than the SMB open source guys. Okay, so I just wanna go through some examples of cascades and mutations. So there's a bunch of standalone ones which are actually kind of cool. So these are like, they're not the optimizer generators that I talked about in the beginning where then you have these declarative rules and you feed them in and then they compile out and the engine spits out a compiled or the source code to actually apply them or drop the code into your own system. Actually, op++ does this but these other ones here are sort of meant to be these standalone optimizers. So the idea is that I have my database system and then I have it on another machine or running on the same machine, a separate service that's my query optimizer and I send along statistics and my query plan and other information metadata about my database to this optimization framework and then it crunches on it and spits back to me the query plan and I run that. So the, you guys already read about Columbia. The two more modern ones are Pivotal's Orca and Apache Calcite. So we'll talk about this in a few more slides but Orca was developed by Pivotal for Green Plum and then they broke it out of the Green Plum source code to make it actually be a standalone thing because they wanna use it for other parts of their database ecosystem. And then Calcite came out of the LucidDB project and again, it's the same idea, it's a standalone service that does query optimization. For integrated Cascade systems, again, SQL Server is based on Cascades, at least publicly, they don't publicly say this but this is what the conventional wisdom is in academia based on talking to people that work there. Tandem nonstop SQL was, this is actually one of the first distributed databases or fault-tolerant databases out of the 1980s. They got bought by DEC that then got bought by HP. It's still there, as far as I know, it's in maintenance mode. If you run any, large amount of ATM machines or banks are still running this, be surprised. This is still very common. And it runs on specialized fault-tolerant solves harder, which is pretty cool. Clusterix was or is a distributed version of MySQL. They just got bought by the MariaDB guys in the last year. And so there's a little snippet in their documentation that says they had their own custom optimizer called Sierra and they briefly mentioned that it's based on Cascades. And then our beloved dead database Peloton was using Cascades and still using Cascades. We haven't ported that code over but that's our goal for this year to bring that back. And again, but like all the fancy stuff I was saying, like the priority rules, the declaring rules in a DSL, we don't do any of that. We just have a basic search engine. Okay, so I briefly wanna talk a little bit about the paper you guys read. I'm not, I apologize, just not have time to go into details of it plus I don't fully understand it myself. Let me like, yeah, let's leave it at that. Okay, so, but it's gonna bring up a really important part that I wanna show you why you have to do something beyond just Cascades and get into the things that the German guys are talking about. So in all the examples we showed in this lecture and last lecture had made the following assumptions sort of we only looked at simple queries that they were only doing equit-joins or inner-joins. They had really simple join predicates that were only referencing two tables at a time like a.id equals b.id. And then there was no cross-product joins. There was no Cartesian products. There was again always equit-joins. But in the real world, the queries are way more complex. So you have C outer-joins, semi-joins, and anti-joins. Anti-joins are basically does something not equal something. And what happens is the transformation rules, as I described them, they'll still work, but they may generate invalid query plans. Because in the Cascades example I showed you, I assumed the join orderings or the join was commutative. I could make it AB followed by C, or I could do BC followed by A. All that were still valid. But when you start looking at more complex queries, not all those reordering are valid. So a really simple example will be a query like this. I do a select star from A, but now I'm going to do a left outer join on B, and then I do a full outer join on C with B and C together. So the query plan, ABC. It's ABC. Go with that, okay. Sorry. So in this case here, I can't reorder this join. I have to go AB followed by C, right? And the reason is because of this full outer join. Because as I'm doing, in order to do this join, I need to know the value of C. Oh, sorry. I need the value of B dot val as it comes out of this join here. But since I don't know that before I do the join, I can't do this join first. So I can't reorder this. But now there was nothing in my transformation rules in Cascades that, as I showed them so far, that took this in consideration. I basically now have to, I'm not saying it's as incompatible with Cascades, that you can then generate rules that can enforce this for you to make sure you don't have these funky reorderings, but the traditional way of doing that is kind of inefficient. So this problem is called plan enumeration. So the idea is that how am I going to generate join orderings efficiently that don't violate the ordering correctness of the query? And the basic approach is this is that I'll just generate all possible join orderings, that's as I did in Cascades, and then after the fact I'll test to see whether they're gonna produce an incorrect result. Or in the paper you guys read, they're essentially doing graph partitioning or hyper graph partitioning to incrementally add in portions of the query plan to this hyper graph, and then make sure that they only add in things that would not violate the correctness guarantees. So part of what I was saying at the beginning that it's not clear to me whether the top down, the Cascades model versus the bottom up approach is better, because there's papers that tell you, this is back and forth between one side says the other one's better, the other side says the other's better. So this problem here of how to generate correct join orderings occurs both in bottom up and top down. And so the paper I had you guys read came out in 2008 and it's called Dynamic Programming Strikes Back and it makes the case that the top down approach is too slow, is incorrect, because of this particular problem, they're not gonna be able to efficiently generate proper join orderings that'll produce the correct result. But then the problem is the first author of this paper, then wrote a second paper five years later, it says Counter-Strike Generic Top Down Join Innovation for Hypergraphs. So he basically says, no, this is, yes, we solved this problem for bottom up, but here's how to solve it for top down, now top down's better, right? And the first author here wrote a whole like PhD thesis on this. So I actually don't know what the answer is, I have no idea. Germans. No, this is super hard. Like I got, you know, it's called a rabbit hole, like you go in Wikipedia and then you just start reading something after something after something, keep following the links, you just, before you know it, you went in to go read about something about North Dakota, then you ended up reading about Hitler, like the way you get there is, so this was a rabbit hole for me, like I started reading this paper and then I see who cites them and who they cite and then I end up with this other paper, so I'm more confused now than I was last week. So, the main takeaway I want you to get from this paper is that the organizing, so if you only consider inner joins and simple predicates in the query plan, then you can represent that as a graph, a simple undirected graph and therefore the traversal is a number of different join orderings, correct join orderings is simple. So these guys are arguing that you actually need a model as a hypergraph where one edge can be comprised of multiple vertices and each vertex in the graph represents a table you're joining and so by modeling as that then you just have this algorithm basically going to iterate over the subgraphs and then incrementally adding new edges to other nodes to complete a new query plan and then every single time you add the node you're checking to see whether that would violate the correctness criteria of the query plan and then you know you don't need to add it and they can show they can do this in an 01 later on in the subsequent paper from the other German to be clear. So Guido was Thomas's advisor, PhD advisor and then Thomas went to Munich and he does more dynamic programming stuff and then Guido then wrote this paper here to say here's how to do it top down. So in this paper you guys read they can do it 01, I think in subsequent papers they can do it as well. So this is very unsatisfying because I have this big idea but I'm gonna show you examples, I walk through it but I don't fully understand what they're actually doing myself so the basic idea here is that they can incrementally add things and make sure as they add them that they're still correct and then not worry about checking after the fact whether they violated the correctness ordering and they can do that efficiently. So at some point we should, I'd be interested in learning more about this, I'll make a comment about this at the end. All right, so in the short amount of time I wanna talk about some other things. So the other aspect of this is that we've been sort of ignoring how to do predicate optimizations or how to deal with predicates we just assumed that we're using that to figure out how, what we need to join but not reason about the, what other optimizations we can apply. So as I already said the one team in the class here they're applying the Cascade search model as the search framework that we've built to actually do query rewriting without a cost model to fix Boolean logic and other things. So in the same way that we do our transformations from one expression type to another expression type in Cascades we can do the same thing with our predicates and our where causes and we can find some, there's a bunch of simple tricks we can do to improve the efficiency of the query plan without having to go through the full Cascades search. The main technique we, the probably most important thing you wanna do is always predicate push down. And so the way you actually can do this is different ways, how you actually do this can be done in different ways. So you can do this as a logical transformation in Cascades, basically you have a transformation rule to say I have a separate projection node but I said I wanna roll that in to be combined with my scan operator at the bottom of the query plan. And then later on you just determine once you transform it back to a physical plan or into a physical expression use the cost model to determine whether that was the correct thing to do. I think this is what we do in our own system. The second approach is to do this as a separate rewrite phase, basically before the query plan goes into Cascades you have rules that say I always know I wanna put my predicates at the bottom so let me go ahead and do that, right? So this is simple to do for really easy predicates but when you have more complex predicates that may be combining multiple tables into your expressions then that becomes more tricky. The last approach, which I forget who does this but I've seen it before is to do what's called late binding where you just strip out all the expressions, don't worry about them as you're running Cascades and then after the fact you can say oh yeah I have these predicates let me go add them back to the different operators. I think it's a bad idea because again you need the expressions to know what the selectivity is of your predicates and your scan operators or your joins to determine whether one's better than another, right? This allows you to reason very easily about what the query plan is gonna do but this is basically doing query optimization disconnected from what the database actually looks like. So this is a terrible idea. It's either these two approaches are what you actually wanna do. The other thing that we could do that actually doesn't come out in the literature too much is do predicate reordering. So in all my examples so far if I have a where clause where it's like a equals something or a.id equals something and a.vow equals something we just assume that we would apply the expressions in the order as defined in the where clause. What you actually really wanna do is determine which the predicates are the most selective. Meaning which predicates are gonna prune out the most data more quickly and apply that predicate first then apply the second most selective one second and so forth. The idea is you wanna cut out as much useless data as quickly as possible. So that's an easy optimization that you do and this is what the cost model can help us do about collecting statistics about the data. But another thing we actually may wanna do there may be cases where the predicate is actually very selective but it's expensive to compute so maybe I don't wanna do it first and maybe I don't even wanna do it as I'm doing scanning the data I'm gonna be able to push that up to some later part of the query plan because the cost of computing that predicate is greater than the cost of moving useless data up the query plan. So let's say I have something like this I have a predicate where food.id equals one, two, three, four and I have a predicate that says once compute the shaft 512 hash of one of my values and see whether that matches some other hash. So there's no longer just something equals something I'm invoking a hashing function I'm an expensive one too. So for this one here if my optimizer can not only consider the selectivity of the predicate but also the cost of computing the predicate then for this particular query here maybe I wanna do and this is so similar there's no join but I wanna do this one first even though that may not prune out a lot but then I do this one last because it's gonna be very expensive. So as far as I know no commercial system does this in particular because this is a built-in function so you wouldn't treat this as a black box you can estimate what the execution time this will be but nobody does the migration of a predicate up and down the query plan they'll at least reorder it based on the cost of it. And this is something that you have to incorporate in your cost model as you're doing your transformations for your predicates. And this is from a paper from 1993 from Mike Stoenberger and Joe Holstein but again they did this in Postgres but as far as I know nobody actually nobody implements the technique that they propose here. All right so now I wanna quickly go through some examples of other optimizers are out there so I've already talked about this a little bit so Pivotal Pivotal company I shouldn't say Pivotal was EMC bought Green Plum which is a distributive version of Postgres so now EMC had a database division then VMware bought Sequel Fire or GemFire and they had a database division both of those companies did not want database divisions so then they broke off their database divisions and they formed a new company together called Pivotal so Pivotal was like the database division of VMware plus EMC put together and so they had Green Plum and then they built their own standalone cascade optimizer called ORCA and the idea was that not just for query optimization of Green Plum but they also had this thing called Halk which is like a query engine on top of Hadoop that thing needed the query optimizer but rather than re-implementing it all over again they had ORCA be a standalone service that both Halk and Green Plum can connect to and do optimization, right? So the way it works is basically you send a bunch of XML to ORCA along with your query plan like the stats, what the catalog looks like and any other metadata about the database they run a cascade search on it do your transformations, do your optimization and then spit you back a query plan that you can actually run and they do actually this actually can support multi-threaded search which for now we've assumed it's been single-threaded there's no reason you couldn't do a multi-threaded branch amount search and actually in both dynamic programming and both top down and bottom up you can do multi-threaded and this supports it so we actually looked at this at the very beginning of Peloton at the time one it sort of looked like the project was dead on GitHub there wasn't a lot of active commits and then there wasn't any documentation about how to prepare all the XML stuff you had to send it they've since rectified that and it's actually gotten much better but at that point we've already wrote around so it was too late for us but there's two actually interesting things that they talk about in their paper that are something worth considering and again thinking about if you actually had to build your own optimizer these are the kind of things you wanna do so for them they have since they're building software that runs on premise it's not a cloud service so like you're gonna download Green Plum and Orca and run it on your local cluster so anytime there's a crash or the query optimizer generates a bad query plan it's hard for you to go how to reproduce that because you can't just log into the machine and start looking around and see what's going on so they have the ability to dump out the entire state of the database or sorry the state of the optimizer to when something bad happens and they can then ship that back to the developers at Pivotal to go put the optimizer back in the state it was and when the crash occurred to allow them to figure out well what were the steps that got me to the point where I failed so this allows them to debug these problems more easily and again because the optimizer is a complex beast you definitely need something like this and of course we don't do this the other thing they actually do is kind of cool is they had this thing they had a framework called TACO TAQO the testing the accuracy of the query optimizer and basically this was like a fuzz tester where they would generate a bunch of random queries and then they would pick the and the optimizer the optimizer would spit out the best plan that it found and then like the second best or the third best plan that it found and then they would actually run both query plans in this database system and see whether the runtime of the systems match the ordering that the optimizer thought that they should be ranked so the idea is that if the thing that it picked is the best query plan turns out to be the worst query plan or the second best query plan and the other one turns out to be better then they know the optimizer is not something's not right in the optimizer it's not correctly ordering them so we wanted to do this in our own system we ended up building we do have a framework written in Scala that can do something like this we just don't have this hooked up into our nightly build and test framework and there's a bunch of traces we have from SQLite and other database projects that we can then feed into this thing and see whether we're actually producing the correct ordering results we just haven't done that yet the other standalone query optimizer that actually is I would say this one is actually more widely used than Orca is CalSite so this was originally part of LucidDB LucidDB was a Java-based database system from 2006, 2007 I think they were a startup and they went under and then they as part of I guess the remnants of the project was this thing CalSite and then somebody picked it up and continued working on it and then made it be more full-featured so like Orca this is meant to be run as standalone service it's meant to be a bit more extensible than Orca meaning you can plug in different data models query plan types, cost models and rules and then it just has an optimization framework based on Cascades that will then actually I don't know if this is Cascades so I don't want to say that but it runs the search and then it can produce an optimal query plan back for you which interesting about this is that unlike Cascades where again you have these logical versus physical operators in CalSite they don't make a distinction between any logic on physical it's just always you have these operators but then you add the annotations in you tag essentially the operators to say all right you're a join but here's how you're going to join based on this using this algorithm so the search is all the same I don't know how this changes how they do the transformation rules but it's just they're adding these annotations in as they go rather than having distinct objects of the two different types so this one's used in way more systems than Orca I don't know if Orca's even used outside of Pivotal but this thing is used in OmniSci which used to be MapD it's used in BlazingSQL, HerdDB, Apache Hive uses CalSite now so this one is definitely very popular and I've heard I think Splice Machines said they were thinking about using this and somebody else as well and some systems were talking about actually now in some ways in our system we try to follow the PostgreSQL dialect for SQL people are talking about now following the CalSite dialect for SQL which is very cool all right the last one I'll talk about is MemSQL so MemSQL is a distributed query optimizer because it's a distributed system and they're going to use a stratified search with a bottom-up approach but what's interesting about this and makes it sort of different than the other ones is that at the end of the day when they generate a physical query plan normally in all the systems we've talked about so far it says once we have the physical query plan for the optimizer we go run that what they actually do is they take the physical query plan put it back to SQL and then send that out to the different nodes in the system so it sort of looks like this so this all looks normal like we have our SQL query plan we go to the parser the binder and then it spits out a logical plan and then we do because the stratified search we have a re-writer phase based on heuristics and then they have the enumerator to look at different join orderings but then when this thing spits out the physical plan it then converts it back into SQL that it then sends to the different nodes in the system and the reason why they're doing this is that because it's a distributed database you assume that the whatever machine the planner is running on this piece here that it's statistics about what the data looks like on the other nodes are not gonna be as accurate so I'll figure out some high level planning about who needs to send data to what node to the next or how we're gonna do the join across the entire cluster but then when I send a long SQL to these guys to the different leaf nodes they do all the steps all over again but only for the local data because they have a local view of what data they're gonna be processing therefore they have better statistics and they can make better decisions about how to optimally execute this query plan now the high level constructs of like how to I'm joining the table on this key I need to send data from this node to the next node all that's done in the first phase of planning here but when you land here what index to use or other low level decisions are all made locally so as far as I know this is the only system I've heard about where they take they go from SQL to physical plan to back to SQL everybody else basically just sends the physical plan around so this is an interesting decision so they wrote their optimizer in C++ and they claim that writing the code in C++ 11 with lambda functions made it way easier than not having lambda functions I don't know whether that's what they told me I don't know if I buy that okay so this again this is sort of a brain double of a bunch of different optimizers out there calcite I think is the most popular one going forward but it's in Java and we want to avoid Java in our own system so we can't use it okay so one of my parting thoughts again I fully meant this is the part of the data system I at least understand I'm trying to figure out what the right way to learn more about this if people are interested in this so we could do a reading group in the fall based on query optimization papers spend a lot of time reading stuff from the Germans but that's okay we could do maybe like a mini course in the fall or like a graduate level course I need to think about this if people are interested in this kind of stuff as much as I am then we can expand on this in the fall and again we have our own query optimizer we haven't even gotten to the tricky things about plan enumeration that I talked about I know we're probably only looking at sort of simplistic queries and all of this now also as I said this all relies on having a good cost model because if I can't determine whether one query plan is truly better than another through my cost model then who cares what I'm doing top down versus bottom up it's all gonna be crap anyway so all of this we need to have make sure we have a good cost model make sure we have a good cost model we need good statistics because that's how we're gonna estimate how much work we're gonna do from one opera to the next and whether one physical manifestation of a query plan is better than another so that's what we're gonna focus on Wednesday and that's the paper you guys are gonna read it's basically gonna show that all cost models and all query plans are crap they get it all wrong and they will show you what happens if you actually had perfectly accurate data how much better you can actually do and so I'm not well I've already sort of spoiled it SQL Server will be the best one in that paper I don't think they're labeled for the commercial systems but see if you can guess who actually does the worst all right and that'll be the I'll show you the guys on Wednesday all right any questions Take your hands and get a belt to get the 40 on spot get a grip, take a sip and you'll be picking up models aint it no puzzle, I'll guzzle cause I'm more man I'm down in the 40 on my shorty's got sore cans stackin' six packs on a table and I'm able to seize it off the cap I'm first on tap on a bottle so about three in the freezer so about to kill it careful with the bottle