 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Actually, I have one of the chips in my office for some reason. They took me on a boat on Hawaii, and they're like, here's a burnt-out chip from Spark that has, you know, oracle compression built inside of it. I'm like, okay, I have it, I don't know what I'd do with it. All right, anyway, sorry, enough, we can go on forever. Databases, query optimization. All right, so last class, we didn't get through everything. We'll go over this again. Again, I'll go a bit slower than maybe I did last time. Just walk through what Cascade is doing, and then we'll finish up with randomized search before we jump into what today's paper is about. But again, the things we discussed last class was going through this progression of how to do more sophisticated things with the query optimizers. The query optimizers are starting from just if-and-else rules, looking for patterns that then do some kind of rewrites, which everyone uses when they first start out. And then we saw how System R in the 1970s introduced this cost-based joint search, and that's sort of the backbone of how the stratified search and unified search came along. And then randomized search will be just the variation of this. So the distinction we were talking about last time was, again, the stratified search and the unified search. And as I was saying, the lines were kind of blurry. We're like, OK, Cascades in SQL Server is technically a unified search, but the way they invoke some of the transformation rules, as we'll see in a second, it's basically doing it without a cost-based search. They're doing transformations that you always want to do first, and then they do something that is more exploratory search that you would see in Cascades. And so the thing that maybe matters the most is that we want to be able to define the transformation rules, ideally in the same dialect or DSL, whatever you want to call it, that allow us to then be able to decide, do I want to move stuff to be always in this front phase that I always want to run or the second phase where I do more exploration stuff? So the stratified search is having two separate phases. One with just heuristics, sorry, one using the rules, but without a cost model, the second one is doing cost-based search. And then the unified search is the idea is that you just do everything all at once to avoid a bunch of expensive transformations or getting stuck in infinite loops. We saw again how the memo table is going to help with that. So less so about maybe the stratified search for the unified search, the thing that really matters when we understand is the distinction between top-down versus bottom-up. So the Cascades approach is top-down, meaning I start with the outcome that I want. This is the final output I want for my query. I don't know how I got there in my query plan, but I'm going to go down the tree and assemble the pieces, assemble the operators I need to then feed me into my final output. Whereas in the bottom optimization, which is what the system R guys started with, is that you start with nothing and you then add the operators you need to get you up to the goal that you want. And again, me standing up here and making hand gestures like this and this maybe doesn't really sink in and mean anything, but there are certain optimizations you can apply more easily in one versus another. At a high level they are composable, they're commutative, they could use one versus another, but there are some optimizations you can do in top-down one that maybe you can't do in bottom-up and likewise in bottom-up you can't do in top-down. Yes? Is stratified always bottom-up? No, his question is stratified always bottom-up. That's what I'm saying, like the Cascades as described by Microsoft, is that stratified because they have a bunch of rules that you always run, then later on they run the call-space search. And in that call-space search it's doing top-down, but the transformation rules are sort of operating them in the same way. They're generating these logical plans in the top-down manner too. But that would be technically that's a stratified search because you have something you always do and then you have a call-space thing that you can be dynamic about. Right. So most of the open-source systems are going to be bottoms-up. Even the guy that invented Cascades said the way you probably want to do create optimization now is start with Cascades in a sort of stratified manner, like run some initial rules. Maybe not necessarily in the full call-space search. It wasn't clear. It's this sort of off-hand comedy made at a conference. And then you want to do the bottom-up optimization to pick join ordering. All right. So again, today's class I just want to spend the first half going over again what we missed and rushed through at the last class of the unified search in Cascades. And then we'll see how we then apply that randomized search. And then I want to go quickly over some real-world examples of what some close-source and open-source query optimizers actually look like. The main is it's going to be SQL Server, CalCyc, Orca, and CalCycDB. And then we'll finish up with going over at a high level the unnesting sub-query paper you guys are starting to reading. Hopefully, that wasn't too dense on the relation algebra, but it's one of the things where I think everyone should know it. We probably should teach it in the intro class. Very few systems actually implement it entirely. As far as I know, only DuckDB, Umbra, and Hyper. My former student who took 701, she's at Databricks. She implemented almost all of it in Databricks, but not everything. But we'll walk through an example there. Unfortunately, let's take a time. We're not going to be able to cover how Hyper does their dynamic program approach or DP approach for finding joiners, but we can cover that offline if you want separately. All right, so again, this is just a repeat of what we talked about in the last class, where the Cascades optimizer is going to be the third version or third generation of a query optimizer this guy, GertzGraphy, built. Exodus came first, then Volcano. And it's the same Volcano when we describe the Volcano model, but in that system or in that paper, he also describes how to do parallel queries with the exchange operator. And then he has this optimizer generator that he built with Volcano. And then after that, he then built Cascades. But as far as I know, he just wrote the paper on Cascades. It's very heavy-handed on object-oriented programming because that was the hot thing in the late 80s, early 90s. And then as far as I know, nobody, he didn't actually implement it except for being involved in this master's thesis a few years later at Portland State. But then Microsoft hired him to go build their new query optimizer because they were rewriting everything from their fork of Sybase. And that's why, again, Microsoft famously is using Cascades. So it's going to be a top-down approach using a branch bound search. And the key idea of what makes Cascades interesting and better than what came before was that they were doing the incremental materialization of the possible ways to represent some expression, which I'll say it in a second, some operator in the query plan. Whereas in Volcano, when you search down, you landed some node in the tree, you immediately materialized everything which would explode your search base. Even though you may have run out of time to examine everything at that operator, at that level of the tree, you still materialize everything. So your memory cost of Volcano would explode. So the four key ideas of Cascades, as we talked about last class, again, everything is going to be represented as data structures. So within a task, we'd have here's the pattern I'm looking for in my query plan, here's the transformation I want to apply, and you can have additional things like a priority that says this thing should be considered more quickly or sooner than another one. And in the paper, you can actually modify these priorities on the fly, but in practice, again, I don't, Cochrane-Schiebe does it, Microsoft does not. Then you have explicit definition of these, what properties we need our operators to have to ensure that if data needs to be sorted a certain way or data needs to come in a certain compressed form, we can make sure that anything, any operator we would generate in a query plan at some level in the tree meets the expectations of, sorry, anything below us in the tree meets the expectation of the parent we're feeding into. And then we talked about this, you can reorder things on the fly to find the best plan more quickly based on what you know the query plan has done so far. And then this last one, again, is important that within the same search engine and the rule definitions, we can do pattern matching and transformations of expressions and where clauses, having clauses or joining clauses, whatever in the same way that we would do a search for converting logical operators to physical operators. And this is nice, again, you don't have to do like one pass for the where clause to optimize that and then a separate pass for the query plans themselves, all that's done in a single engine. Like I think MySQL, they treat the expressions and the where clauses separately from the query operators. So they have sort of two optimization passes. All right, so the definitions we care about in Cascades is this notion of an expression and it's just going to be some operation that we want to do in our query plan that is going to have zero or more inputs coming into us. So it could be a leaf in the query plan or it could be some middle operator, right? And we can sort of group them together and say here's a higher level thing that we want to do and we'll represent that as an expression. So for example, if I want to join A, B and C on simple IDs, I can have a logical expression that says I'm going to join A with B, then join that with C. And then if I flip the order, then that's considered another expression as well. And then I can have a physical manifestation of that expression by actually specifying here's the scan method I'm going to use or access method I'm going to use for that given table and here's the join algorithm I want to use. And then now I'm going to combine or group these together the equivalent expressions based on what we know as the relation algebra rules into what is called a group. And the idea is here is that within one container or construct or entry in our system, in our query optimizer, we would say here's the output that I expect, A join B join C, then here's all the equivalent logical expressions, you know, permutations of the join order and then here's all the equivalent physical expressions. So within one entry in our system we can say here's everything that, here's all the different possible expressions to produce the output we would need for a single group. So the entire thing itself is the group and then these are all just the equivalent expressions. And then also embed additional information about like here's the properties that I need to have coming into the data that I want here. So then we can combine expressions even further into what's called a multi-expression and this is just basically a placeholder that says there's some expression below me in the query plan and at this point, at this group in my search tree, I don't actually know what it is, right? But I'm just going to keep track of like I know something below me, can tell me how this thing would actually be executed. So for example, if I need to join A, B and C, I could have a multi-expression that says I'm going to join A and B, I don't know how and I don't know in what order but then I'm going to join whatever the output of that is with C and then you have the various combinations of the join orders and so forth and likewise the various combinations for the what algorithm we're going to use to join it, right? So if it's a bottoms up approach, you're basically starting with like the smallest atomic multi-expression you could have, like go read A, right? And then you go add the pieces or the components to that going up towards the optimizer, towards the final output. In this case there, because we're at the top going top down, we start with this multi-expression says I want some output of A join B join C and then I'm going down and then filling in the specifications of the exact details of how to do that. Again, this is how they're going to get away with not having to materialize everything all at once when you land say in a group as you search down the tree because now you say I have a placeholder for A and B, someone below me is going to tell me how to handle that but right now I can just reason about whatever expressions I'm looking at within my group. Directly. And then likewise you can make further decisions based on those priorities as we talked about, like if I'm trying to consider should I go look at what happens when I join A and B? Should I go look at what happens when I scan C? You can decide dynamically as you're going down the tree which of these paths you want to look at first. So this we've already defined, again the rules are just ways to convert logical operators to logical operators or logical to physical. And in the end the database system needs physical operators because that tells you what the system is actually going to do when it exits things. And then within a rule we'd say here's the pattern that I want to match on, either in a logical expression or a physical expression and then here's the transformation I want to do to put the query plan that I'm looking at or the group that I'm looking at into a new form. So this is the example we saw before. My pattern is to identify when I have two joins across three groups and then I would have a logical plan that looks like this and note here again I have the multi-expression that says I'm joining A and B but then down below me now I'm just saying join A with B and then below that I can specify how I'm accessing the data. So I'm going to have a transformation rule that says just rotate it left to right or I could have an implementation rule that says convert all the aqua joins that I have in here into a search merge join or hash join or an asset loop join and so forth, right? And as we said before to avoid having to get stuck in infinite loops because I could go left to right and right to left and just doing this transformation and over again we're going to use the memory table to keep track of do I know something about what the outcome of this transformation would be and therefore I don't need to apply it because I've already seen it. So instead of doing that on every individual rule for every single possible state of the query plan my memo table is just going to keep track of like okay I know that for this multi-expression I have a cost for it and I don't need to go look at what happens when I make that transformation if that transformation's cost is going to be, sorry, the cost of the query plan after that transformation is worse than the best I've seen that I know I don't need to go look at it. So as you said the memo table again the different ways to implement this I think Microsoft puts this as actually inside the groups themselves I think CockroachDB maintains this as a separate hash table but it's basically again some table that's going to keep track of for a given multi-expression here's the best physical operator I've seen for it and here's this lowest cost. You also keep track of like the properties that I care about whether the data's coming up with this operator for a given multi-expression with again what physical properties and I can know that if I'm looking at something differently even though like I'm looking at a different multi if I'm looking at a different operator like in the group if it has a different physical property I can then maybe still evaluate and go further and you would do this things like if the best cost I've seen so far is a sequential scan but I know that for the given group that I'm at I really want data to be sorted now because that'll change my cost expectations then if the multi-group below me does not enforce that property or provide that property I can still go down and look for other things. So the basic idea how this is going to work not to get too theoretical is this idea of the principle of optimality and all it says is that if we have a what is the true optimal plan then any sub-plan of that optimal plan is going to be optimal. It's sort of tautology, sort of like self-defining like if I have the optimal plan then any portion of that query plan is going to be optimal because if it wasn't optimal then that wouldn't be the optimal plan. So because of that this is how we're going to be able to do branch and bound search to identify that if I'm at some level in my search tree if the cost to traverse down to my query plan as they go below is now worse than the best plan I've ever seen so far then I know I don't need to go down and look further because there's no magic way that that cost is now going to become less because as you go down the level you're adding more physical operators you're accumulating more costs, estimates and that's just a summation so you're not magically going to get faster so we can just cut things off and avoid having to search further. So this is the example that we have before again we want to join A, B and C so at the very beginning we just start at the root and we apply logical transformations to generate different logical multi-expressions and then rather than materializing a bunch of all the logical multi-expressions what is jumped down to the first one we have and see what the cost is. So now we come down here do the same thing we have the multi-expression AB we do a logical transformation to convert it into a multi-expression on A joined by a multi-expression on B but now we need to get the cost of these inner multi-expressions so we go down here and this is doing the access on A there's nothing other than just get A there's no other logical multi-expression so then we'll materialize the different physical multi-expressions so either sequential scan on A or index scan on A for brevity I'm not showing you what index I'm actually using but you can imagine for all my possible choices of indexes you would have additional you would have multiple or for each index I could use an index scan on a table I would have entry for each of those and then again you can do ranking in the priority list to say well only consider the indexes that you know rank them rank the order in which I evaluate the indexes in my multi-expression list here based on the selectivity of them or which ones cover the most columns that I need in my query there's a bunch of different rules you can do to figure out if I'm just blindly looking at all possible indexes alright so now that we have these two different physical multi-expressions we do some cost model look up and say okay well the sequential scan is going to be the fastest it has the cost of 10 so we add the multi-expression on A into our memo table and with the best physical expression we do the cost of 10 then we go back up the tree do the same thing down on B same thing we convert that to physical multi-expressions we end up with the sequential scan and next scan on B sequential scan is still faster we get a cost of 20 and then now we bounce back up to our multi-expression on AB we do further transformations to now flip the order of the joins now B is joining A but now we need again to do the same thing and get what is the cost of the different tables well when we go down to these lower nodes here we will see that we've already done this costing for us because it's in our memo table so we actually don't even do the due traversal at this point here we can just go look up the memo table and say what is the best cost for scanning B and scanning A and then we just reuse that so now again say it's only A join B and B join A so now we generate all the physical multi-expressions and then again using our cost model we then figure out that the hash joins the fastest so the cost of the hash join plus the cost of the scans on A and B produces the cost of 80 and that goes in our memo table and then we bounce back up now go on the other side of C and do the same evaluation yada yada and so forth until we then maybe exhaust the search for the different join orders we have for A, B, and C and we end up with the lowest cost of 125 so this is clear so this is what Cascades is doing again you're starting with in this case here I'm going to join A, B, and C not specified in the order not specified how I'm accessing the tables not specified what the join algorithm I'm going to use and I traverse down and use the memo table to keep track of the best cost I have yes so this is entirely cost based but you said there were a few rules it tries to apply every single time so would you first build this table and then or would it do it somehow before so like the like what I showed here like okay I have get A and then I take the logical multi-expression and convert it to physical multi-expression the rule could be if I land here I know I'm accessing a logical expression on a scan or accessing a table so let me always run the transformation rule or implementation rule without different physical multi-expression so the rule would simply say always pick sequential scan for A no no no what I'm saying is that you always generate the multi-expression that a sequential scan and all possible index scans on A that always gets fired then you do a cost based search to say or selection which one of these is the best now whether or not you materialize all possible physical multi-expression depends on the complexity of whatever it is that you're doing because that might balloon up the huge like up you know the join or up above I'm going to say like I think I should look at the join orders first yes what's the difference between this and Volcano Volcano is all still what's the difference between this and Volcano the the difference is that when you land in say at the very very beginning so back up here I generate all the logical multi-expression all the physical multi-expression and again it seems kind of obvious that you wouldn't want to do that but I don't know it was the 80s 90s right they Volcano didn't have priorities Volcano also didn't have I don't think you could evaluate the where cause expressions in the same method as well right question this you mean like this right here right what's the question yeah so right here it's saying I want to join A together but here you're not defining how to do it you then go look up the memo table and say what's the best way for me to to have done that yeah why do we say we're applying the rule before if we're doing it while we're doing all the cost estimation this is still entirely unified there isn't a set of things that you're doing before you're doing this entire thing altogether where you're applying these transformations also estimating the cost picking the best one so yeah so his question is like it's not two-stage yeah why how can I say that you could use Cascades to do a two-stage approach so like think of like the when you first show up you convert the AST from the part SQL query into some you know you have like a group like this I know I know I'm going to access A so you start populating those things right and without I don't know how Microsoft actually does it whether they can inject themselves at any point in the tree and start doing transformations or they always start the top and go down but there's like you could say I'm going to traverse the tree and only apply these rules to put me into a form that when I do a cost-based search I'm kind of pushed in the direction where I know it's going to be a better plan and the second one which is you said in volcano it won't do everything exhausted but in a diagram you made like I put dot dot dot right like whether it's powerpoint like whatever it looks at all of them it is exhaustive though no so again we talked about the termination rules it could be a timer, it could be like how many transformations I've done it could be like I haven't seen anything better volcano did not have that timer whether or not you materialize all possible things ahead of time before you even look at them volcano did not do that there's other object-oriented programming stuff which we don't care about I was going to ask what data is on limits to make these estimations with like for example like page information like the aggregate information for a page the laws you know if you're going to skip it whether or not maximum stuff would be like scans over stuff like for the cost that's next week right that comes later but like you would in a traditional database system meaning like you have statistics you've collected by running analyze or whatever like you would you would the cost model would tell you what the selectivity of you know a low level operator like this would be like how much data I think I'm going to read if it's just a straight sweatshirt scan versus like an index scan this is all data that you have to have collected you're going to touch the data when you're creating like this what do you mean touch the data there's like a difference between like running there's only certain things you can know after you run the query obviously yes but like this analytics comes from either previous runs or estimations correct statement like these costs as I'm describing here are done using summarizations or sketches or whatever you will call it of the data that you either derive from previous runs or your own analysis I would also throw a third category and you'll see on Monday next week is sometimes there's just a sample of data that's local and you could run like you know run the where calls on that sample data just to see what the estimate is going to be right SQL server does that but like all that is it abstracted away from this part is through the cost model yes why might one want to prefer to this question why do you want to do bottom up versus top down yeah so good question so there's some weird things like you may not be able to do the do all the branch bounding stuff that we want to do because I got to get to the bottom first meaning like I can't say okay well I am at this point here and I know that maybe I'm farther up in the tree like think of like doing a lot of joins like so I'm up here and I have I can't do any pruning of the search tree below me until something gets to the bottom and produces what I think is the best possible plan so you always have to generate the full plan then you can start pruning other things and sometimes like you can't you do have to do the materialization of the logical to physical because I don't know I can't do cost estimates I can't get the the can't get the lower bound if it's just a logical expression I got to convert it to the physical operator to know what the true cost is going to be there's some tricks you can play around like like doing estimates of like you know worst-case scenario things but it's in the if you're going bottom to the top like along each step I have a physical operator I can cost things right there starting from the bottom in some way gives you more options and it also allows you to materialize physical as you go materialize what sorry materialize physical operators as you go instead of converting them there's other optimizations you can do which we don't have time to cover from like the hyper guys of like grouping things together on hyper graphs and like just costing those things separately from other pieces where this just looks at everything all at once yeah I can it's like 2006 paper it's in the reading list I might mention I don't have time to cover it but like they show that like they in practice they find the they find the optimal drawing algorithm faster than drawing ordering faster than cascades you know Microsoft has put a ton of effort and money into it yes yeah so the question is how does like the data properties come to play in any of this would be like say that say that I've picked here I'm going to merge join say not so not so I assume that my inputs have to be sorted so as I traverse down then I can say don't give me anything that that's not sorted right and then I can record that actually in the in my memo table to say okay for the scan on B here's the lowest cost with with with these set of properties and here's the lowest cost so there's other set of properties so then I can decide like if I need those properties I can check the memo tiplessly whether that cost whether that operator providing input yeah memo table I'm not showing here because it's PowerPoint but like you would keep track of like what properties you know a given multi-expression and a any the best expression for it provides and then you can decide whether that's going to give you what you need or not sorting is the most common one you could say like compression I'm not showing like we're not showing projections obviously here but like I need these columns stuff like that and then that can change based on what it's like a column store or a row store because row store could always shove it up at a column store if it's like disaggregated across different nodes it becomes more complicated would you have to store every single set of like subset of properties for every single combination of tables? The question is would you have to sort of so if you just think you're like sorting than this obviously wouldn't do it that's one or zero but you have like five no no no it would be sorting me like sort of on what column oh yeah like that as well yeah so you have to keep track of like I'm accessing this data and it's coming into me it's going to sort of in this column or I can then do a transformation to say all right well I'm looking at you know a merge join here and then have a transformation where we can then say okay populate the thing below me to make sure I'm looking at things being sorted or not or you can pass hints down and say okay at this point here maybe I don't want things sorted so don't consider it So is this memo table generated let's say maybe combinations of tables that we may want to exclude from the memo table at the very beginning? The question is are there combinations of tables we may want to exclude from the memo table at the beginning? Like for example like if I want to sort on like if I have a table with like five columns I only want to sort on one column Yeah so Microsoft does this Microsoft will pre-populate the memo table in this first stage and then sort of seed it with things that it knows it should probably consider first and then that way when you do this search you may prefer things that I already know about Yeah Yes Can I say that we all we need to do bottom up we want to do adaptivity I know we're not there for the next week's topic but like if you want to adaptivity you definitely need to do bottom up This question is if you want to do adaptivity you have to do bottoms up adaptivity when? Like while the query is running? Yeah Why? Why would you have to do it Why would you have to go bottoms up? Because if your bottom is up you start with like a sequential scan and that's your adaptivity In that case if you're doing the bottom up wouldn't it be slightly more easy to control what's happening as opposed to having to go explore all the way from the top His statement is if you want to be adaptive it wouldn't be easier to do bottoms up versus tops down Is your point like that at some point at some point we say oh like maybe Let's punt this next week to see how they do it One of the approaches we'll see to do adaptive query execution is you just inject a new physical operator See that's bad We probably won't do that Why is that bad? Because if you're injecting if you're continuously adding more costs If you're injecting another operator in the middle of the day you're adding costs to what goes above to the total The physical operator is like a I don't want to use the word sentinel it's basically a gate that checks is the data coming out what I expected it to look like and then if not you then make a decision It's like a trigger in the query plan It has no substantial cost to the query plan because it literally is like is my selectivity what I think it's going to be Is that really adaptive? Yeah that's next class That's Monday He's jumping ahead It depends on the implementation You could Whether or not you inject the activity pieces why you're doing this generating the query plan or you just say I got my physical plan you embed now in the operators your cost estimates and then you say well here's my leaf node scan I'll now inject that trigger check right above it and if I see that the data is coming out doesn't match what was in my scan below me then I'll do something else You can imagine a scenario you could include in your cost estimates to say this is going to be more expensive to do but I'll add this trigger check to make sure that I don't this is going to be really expensive to do if I if I get it wrong so I'm going to add my trigger check to make sure that I don't get it wrong and then choose the thing that's more expensive so you can play games like that let's punt on that until next week most of the systems that do adaptivity will be simple push downs of certain operations so Snowflake will push down an aggregation to a remote node above a join if it sees certain things other triggers will be like this query plan is just terrible stop throw everything away and go back to the query optimizer and then the other ones can be more clever about switching plans on the fly but nobody does that in practice as far as I know they'll do simple movements now okay I have a question so there's a bunch of implementations of Cascades Cascades is the name of the paper and as far as I know there was no implementation of it like Volcano as far as I know there was a system and code based called Volcano that we were using the actual implementations that were out there in the early 90s was this thing called Op++ out of Wisconsin I think Gignesh's office mate worked on this when he was Wisconsin because he did his PhD there and then Portland State was again with the Columbia system again that was Gertz-Graffy was involved in that Gertz-Graffy's PhD advisor was also Gignesh's PhD advisor like this guy Dave DeWitt who vented a lot early parallel Shibadeva's stuff Green Plum built this thing called Orca which we'll cover in a second that's sort of like CalCite where it's like a standalone optimized as a service similar to what we're trying to do for OpD and then here's a bunch of the other ones that are implemented in actual full systems so let's come back to Cascades other implementations actually hold on let me do more Cascades before we go to randomized algorithms so SQL Server the Cascades paper is like 94 Microsoft hires Gertz-Graffy to start building this in SQL Server in like 95 the the there is a single code base there was a single code base for the Cascades optimizer but my understanding from talking about Microsoft is that they forked it many many times and basically all the the major database products that Microsoft sells on-prem and in the cloud like what are Synapse we'll cover later Cosmos DB they're all using scopes in the line they're all using some derivation of the of the original Cascades optimizer everything is written in C++ there's no DSL and then to do transformations of the where clause expressions and else clauses and not rules running in the search engine so as we said before what their Cascades is going to do is that they're going to have separate stages that are going to find here's the rules I actually want to consider at each stage and the idea is that you start the ones that you know you always want to run predicate push down you always want to run that 1 equals 2 converts to false and things like that and then so you always do these these transformations at the very beginning and then at some point you get to this point where I want to do the cost-based search and that you can tweak and specify here's the priority for some of these transformations based on what I saw in the earlier stages that I evaluated so the very very beginning you just do basic simplification and normalization so this will be doing logical tree to logical tree transformations so the sub query stuff we'll talk about in a second rules to do that outer joins, inner joins, predicate push down and then empty result pruning like select star from table where false you can throw those things out then you get to what they call pre-exploration and this is doing not the actual cost-based search but this is applying rules to populate the memo table with things you know you think you want to we want to look at as you go down right because the idea again you don't want to blindly search in the beginning you can see the memo table like here's the things that are going to be interesting and guide the search towards those so trivial plan shortcut select one or select star from table limit zero, things like that protection normalization how to identify the stats protection normalization again cleaning up what's in the select output they do an interesting thing where they identify at this point here that they don't have all the stats they would need to in their cost model to give good estimates they will actually stop the query, stop the query planning go tell the system to go run analyze now, collect the data that it's actually missing, imagine the very beginning if you bulk load a table you need something and then when that's available then they can come back and start doing some initial current estimates and then join collapsing when possible then they get to the cost-based search and this can still be multi-stage where they're going to have the first group of transformations that they're allowed to consider and then over time is the number of transformations that they've applied because that's how they're keeping track of how long the search isn't going but then expand the set of transformations they're looking at to do more complicated things so in the very beginning it's for dealing with things like trivial plans like look-ups on a single table where there's a primary key index and stuff like that and then if you still don't find the optimal plan then you can expand that out to a quick estimation on how to do a parallel parallel plan, again now you may be looking at joining multiple tables it's more than looking at a single table and then if there's still more time in the clock as you go along then you can open up to do a larger search for the full plan and then in the last step and then this is still within the Cascade Optimizer Rules for all the different database engines or systems that they're trying to run and support through Cascades they would then have engine-specific transformations that they can apply so for example I think Synapse is like the distributed data warehouse based on SQL Server we'll cover it at the end of the semester you want to do distributed joins or broadcast joins and so forth all those things get applied separately in this last stage here because that's the building upon the plan you built up so far, yes This question is how do you pre-populate the memo table to see the search ahead of time so like it would be populating the memo table but also populating the groups you've sort of generated for this multi-expression on this joining three tables if I know something in these early stages I always want to have this be the outer table and the inner table so I'll seed that transformation ahead of time and then if I have the estimate at this point getting past this step here then I could put that cost in the memo table right away so as I said before the amounts are always going to be based on the number of transformations not the wall clock time and then what's nice about this as they bring up is that as I said no matter what hardware you're running on you could always generate the same query plan given the same database and then if the system gets overloaded because again in this environment it's not a traditional SQL Server it's like a single system where the optimizer runs in the same box in the same hardware as the execution engine so if the system gets overloaded then the time it takes to optimize the query is going to take longer because the threads are burning up running queries or doing whatever and so if you base it on wall clock if I run the query today and the system is overloaded I run the same query tomorrow I may get a different plan if it's overloaded if I'm looking at wall clock time because everything is so much slower so this guarantees that no matter what I always see I always get the best plan or the same plan and then as we said before you want to pre-populate the memo table you will join orderings again this is just your rules to make sure that things land that you're seeding the search in a way that you find the best join earlier and as this is very similar to the Oracle one the order that they appear in the SQL query is the order that will get seeded into the memo table which sometimes makes sense as Oracle did for a long time CalSight and Orca I don't want to talk too much about but I think everyone here is aware of them these are separate these are standalone query optimizers as a service CalSight is used way more than Orca it's written in Java so that limits how many people actually want to use this but the it's CalSight is an interesting piece because I think they have their own they have connections to be able to run queries and things like that it's more than just a query optimizer but mostly for the SQL parser and then the query optimizer but you basically have to define it comes with a bunch of existing rules but if you want to extend it to whatever system you want to support you can then define through their Java code here's the rules I want to transform things here's how to do cost model estimates and so forth this originally came out of a European system called LucidDB this went under and then they took the pieces out of this and then that became CalSight and again it's used for a lot of systems Green Plum is the equivalent to CalSight but maybe the less it's less pluggable with CalSight you can say here's my snowflake dialect, here's my Postgres dialect it supports a bunch of different dialects going in I think a bunch of different dialects going out Orca is more focused on its new query optimizations so this is originally written at I lose track of who bought what Green Plum got bought by EMC EMC got bought by Dell or VMware and then VMware no no Green Plum got by EMC and then VMware bought another company called Pivotal and so EMC had a database products and they didn't know what to do with them so they merged the two they spun out those two divisions of these two companies and then they formed Pivotal and then Pivotal eventually got bought by VMware Gignesh was out there for a while because his startup got bought by them so they were supporting Green Plum which still is there today it's widely used, it's a fork of Postgres, like distributed and run all that queries and then they had this thing called Hawk which is their version of Hive which is SQL on top of Hadoop but then rather than building a separate query optimizer for Hawk and a separate query optimizer for Green Plum they decided let's build a single optimizer as a service and have it support both Hawk and Green Plum in whatever system you want to hook up to it Hawk is still around and nobody actually uses it Green Plum is still around and so that's the primary user of Oracle but it's like CalCite where again you have this API where you have to implement here's what my catalog looks like here's the rules I want you to support and you plug all that in and you can use it we looked at this almost 10 years ago to use it in the system we were building at a time they had like zero documentation you send a bunch of XML files and it would spit things out and we decided not to pursue it but it's still actively maintained today but again I don't think it's used anywhere outside of Green Plum CalCite is more common so there's a paper that they wrote on Orca and they talk about a couple interesting things that haven't been a relevant through some of the discussions we've had maybe this is less of an issue if you're running on the cloud because you control the machines you control everything you can see everything but at the time for on-prem this was tricky so if someone's running your software on their hardware if your optimizer crashes or produces a bad query plan how do you actually then try to debug that and improve things so if someone sends a support ticket and it's slow they can give you the SQL query but if you don't have the right environment if you don't have the right data and you don't know what choices the optimizer made when it was doing the search it's hard to debug this so they had the ability to have the optimizer spit out a complete state of its search for a given query and then send that back to the home base for the developers to generate the query plan they can use that to figure out why it made certain choices over another again if you're running in the cloud you control everything you can figure things out and you don't have to do this but another interesting thing to do is make sure that the cost model estimates are accurate they would do this thing where they would run take a SQL query run it through the optimizer and they would keep track of the best plan and the second best plan or maybe the top ten best plans and then in the background they would run all of them and then see that the relative ordering of what the cost model thought was the best plans actually matches up with reality when you actually run the queries and they would use that to then tweak things and improve things that's like the MongoDB thing right yes but for like the MongoDB didn't have how does this MongoDB didn't have a cost model it literally was just like here they are run them and then whatever they come back the cost model is predicting the right you know physical costs they adjusted if it's off right so yes at a high level it's the same as Mongo but Mongo is not trying to verify whether the cost model estimates are correct they're just running it again I should go double check this is still what Mongo does today I guess you'll find out it looks like they might be trying to use the actual cost model okay when I talked to them before the pandemic they didn't have like any logical physical operators it's just like here's the JSON stuff all right copyrighted GP rewrote or sort of wrote from scratch their query optimizer just like everyone else they started with something that was based on heuristics and then they wrote one that based on Cascades everything's written in Go and they they're more pure to the Cascades model than maybe Microsoft was they would have a DSL that specify here's the rules and the transformations and in some cases where you can't do the transformation entirely through their DSL you could escape into Go code so here's basically what their rules sort of look like right so you have like the matching that you want this DSL and then other things you may want to apply and then this then gets transpiled into Go that they then run into the system all right we're out of time again let me run to rush through randomized really quick because again this exists Postgres does this so rather than doing top down or bottom up what do you if you just did a random walk you got to start with some query plan and that one you can just do a straight conversion of a of the AST from a logical plan into a physical plan and just permute that but the idea is that you look at a bunch of different possible query plans pick where everyone has the best cost keep track of that throw away the ones that are the lowest cost do some kind of randomization or permutation on it to then change things and then do another round and check all over again so you're sort of randomly walking to try to see whether you stumble upon the best query plan so there's an early paper in 1987 that does this with simulated and kneeling again the idea is that you you just swapping operators and like you know the join ordering of two tables randomly if the plan makes it worse then you flip it back flip away the coin and see whether you should just keep pursuing that down that path otherwise you flip it back now there's a bunch of rules you got to write to make sure that you don't do things like if it's a left outer join where you have to care about one being joined before the other you don't you know you don't put things in the wrong order so you have to check to see whether the random change is actually still a correct plan but then if it doesn't get violated then you can flip things around nobody does this what Postgres does do is use a genetic algorithm I think this was introduced in the mid 2000s so they're going to use a genetic algorithm where they're going to have these different generations of the different query plans then you pick which one is the best throw away the ones that are worse then permute the best ones to try to try to find the traits or the genes of the query plans that make it a good query plan and then hope that you sort of stumble upon the best one right so Postgres does this but you only get it when you have you give it a query that has more than 12 tables that you try to join in a single query there's a flag you specify what that cutoff is but by default you don't get this unless it's a 13-way join or higher so it looks like this say we get some random query some combinations of join orders and join algorithms first you cost all of them keep track of the one with the lowest costs up in the corner which is the best one I've ever seen throw away the one that has the worst cost and then permute some portion of the query plans are the ones you keep around and then the next generation you populate those do the same thing check which one has the lowest cost this one does so that's not near our best cost throw away the the weakest one permute them and so forth these are estimated costs why would they be realized I don't like they take these random queries and then run each random one this is all within the query optimizer only Mongo does that like runs everyone this is like using the cost one I'm supposed to talk about next week so in this case how does it come with the first generation this question is how do you come up with the first generation it's random yeah after that it's specifically picking things yes I know I have to join RS and T so let me different join orders different physical operators and then just try it out so I know in the case of Postgres they make sure that the random permutations is deterministic because you don't want to be if you throw the same query at it again you want to end up and you use it running this way you would end up with the same permutation right yes seems like a good idea why it's a good idea because if you have 30 in joins that's a lot of them so going from a random rather than approaching it from almost exhausted free top down view it's just too slow and you're unlikely to get the right join for every single one of them and you're so if you go from a random is that not true am I just spittpunning it? from the first intuition it seems like this should work there's an engineering cost to making sure that you don't permute things incorrectly so now you're maintaining code to do that and that's again that's pretty much FNL's rules for very very large tables large number of tables you know the a greedy search probably would be better again think of like joining 30 tables 40 tables yeah doing randomize seems so much better for that because you saw from around there's so many possibilities so yeah so this is a talk from the guy that actually works on this from Postgres a few years ago he basically said that the one in Postgres at least is broken it's not truly randomized as much as it should be or they're not I don't think they're they're not like incorporating the traits that are best for one drainage in the next I think that I don't say it's a hack but I think that the you don't have time to cover the hybrid one the hybrid one is the better way to do this if you they have basically adaptive algorithm that can figure out like if above a certain number joins it's a greedy based search is actually better we've covered this in the in the reading group we can cover it again I'll mention that next class I just don't have slides for it so again nobody actually does this alright with 20 minutes ago any questions before we switch over to the paper reading yes how do you do it over the logical plan first and then insert the physical components or like this question is like how are you populating this so you have a logical plan and then you you can you could just use the order in the tables they appear in the SQL query that's the order that's the initial ordering and then like you're just flipping things around but a logical plan already has like structures yes even though you have like a different you can get something completely different right you state it is like the logical plan has some structure yes and again depending on if it's interjoined those things are commutative you can swap them anywhere but like if it's an outer join or one of these correlated subqueries like that you gotta be careful about so you gotta make sure you don't violate that but like if it's if things are commutative then you can take you know the logical plan is that you join you know join R, join S, join T and you just commute that no the very beginning it is like you have to generate this first generation right but like once you like so what do you mean all the all these changes are like you're changing one part of the the tree correct yes because the idea is that like like in this case here I somehow figured out the this hash join on the hash join where S followed I R where S is the outer R is the inner you don't know whether that's the why this you know made this one the best so you want to carry that over to the next one in this case up here so that like and then you promote maybe other things about it so that if it truly is the reason why your costs are lower then that trait will get carried over over time okay so let's see how far we can get through um subqueries okay there should be no surprise anyone subqueries are important in SQL you can put them anywhere in your SQL query um I wasn't able to I tried a lesson I wasn't able to get it into a nested query into my order by clause well no but I don't think it actually does anything right I was trying to like have it return a string that the name of the table I want to look up and it runs it but I think it's just getting converted to true so you can do you can put an order by clause in the select statement in the order by you can put a select statement inside the order by clause in some cases but I don't think it actually works right and the way to think of a nested query or subquery is that it's basically like a function that like I have some outer query I'm going to invoke into it and maybe pass on some information or not about what the outer query is the tool I'm looking at in the outer query uh and then I'm going to produce some result that I can then use for for my other query and this is important because this allows people to write more expensive things in a single SQL query rather than have to run multiple queries stage it in temp tables or whatever and then and then put it all together at the end so there's this key distinction that's going to matter a lot between uncorrelated and correlated subqueries and the TLDR is that uncorrelated ones are easy most data systems will be able to handle those it's the correlated ones that are going to cause problems the uncorrelated ones basically means that the whatever my subquery is doesn't depend on anything on the outer query I'm not using any information any attributes any tuple data from the outer query to run that interquery so I only need to logically execute this interquery once whether or not the data system is smart enough to do that depends on the implementation but the ones we can we'll look at we can talk about we'll handle this right so in this example here I want to get the all the students that have the finding the student the name of the student that has the highest grade in some class or across all the students so this interquery on the select max score the students doesn't rely on anything from the outer query on students so I can just run this once materialized result and be able to substitute that in for every tuple I'm looking at on the outer query again I think for most systems they should be able to handle all cases for this actually that's not true there are some cases where they can't hit on correlated subquery the basic idea what you're trying to do is basically you want to move this up and get this to be joined up above yes which one yeah it does not thank you yes have a fix correlated subqueries are the ones we care about again this is where the the subquery is going to reference something in the outer query here so now if we do modify a query to say give me all the students that have the highest grade in their major across all the students in the same major then the basic idea is that for this I'm going to have a for loop that's going to be the going over every single tuple on the outer query and then for each of those tuples I need to do a complete sequential scan or scan on the inner subquery's results right so let's say if I start my outer query the first one I'll look at is the first tuple which is a major so then now when I invoke the inner query I'm going to scan through and now do that join where I take the outer query's major match it against the inner query's major and then get the max score so I'm going to run this entirely and produce a max score of 90 and I'm going to populate this as my output result here next for the next one again for RISC also in CompSign do the same thing start from beginning scan again get max score of 90 that doesn't match what the RISC score is RISC score so therefore that does not produce the output for ODB he's the only one majoring streets so again start from beginning scan through it matches produce my output like that right so what I just showed here is like the worst thing you could possibly do because for every single tuple in the in the outer table I'm rerunning that that join query over again for in the inner table yes the question is like can you have this inner query in a where clause again anywhere right I can have it as a from clause I could have it the projection output correct yes I can put like again you can have it in limits you can have it in you can have nest queries in limits actually I don't know if you can reference to the outer query I don't know if you can limit clauses I think having can do it like anywhere lateral join right it's basically the same thing all right so the paper you guys read we'll see how they do in a second the goal is again basically we want to lift up this this inner query to be at the same level of the outer query because then we can then convert it into a join and we know how to run those officially we know how to optimize those right and we know how to give me what we talked about last two classes we know how to pick the right best join order for these things as well so for this query here ideally we want to move the we want to move the inner query to be now be in the join clause in the from clause against the the outer query here but now we got to be mindful that yes we need the score of of the students that should not be s2 give it a bat sorry but now we want to group by major because that's how we're going to do our join down here because we want to see is for my what's the best score for my outer tuple for a given major and is it my score right so for this example yeah we can look at this and say okay we could write something reasonably easily to do this kind of manipulation you know they're saying since these nest queries can appear anywhere and in this case here it's a straight equality predicate but you can imagine like equals any, exists less than, greater than, anti-joins semi-joins like you can have all different kind of combinations and it's very difficult to to write the rules to capture everything but this is what basically people have been doing for the last 30 years yeah quick question so in the paper when they specify join they don't actually use a join to work is that just the difference in time? the question is in the paper they don't specify join if they're equivalent they just use join you could just have this be your comma and then it's a where clause from the that's that normalization step that I talked about in in like SQL server they'll figure out what's in the where clause and move what's in the where clause that should be part of the join and move that to be part of the join clause they're equivalent alright so again for the last the first paper on nested queries is like 2384 I think the SQL standard ended at I think it's SQL 99 but since then people have been writing much of rules to handle all these different combinations that they're aware of right so this is from the SQL server paper from 2001 that roughly defines the rules that they apply not going to go through all of them but this is how they decide what to when they can decorrelate nested queries in SQL server you look at the documentation they basically have up to 22 rules it's bunch of this like if it does this and this and this and this right this is defines how they're going to do unnesting but again all of these are going to be based on heuristics and rules so on the plus side is these are somewhat easily to somewhat easy to write if you know the pattern that you're looking for and because you can have complete control of like when things get moved and how they get moved it's saying like it handle all possible educations that you've got to deal with all possible ware clauses and any kind of combinations of these nested queries lateral joins is another thing that's complicated then writing these rules by hand is simply too ineffective so this is where the Germans show up in this paper in 2015 so there's actually two papers in the reading list this is the first one there's a fallout one in 2017 that covers more about how they handle external joins or other things but I just wanted to focus on this one because it's an easy read ish on how to do these correlated subqueries but they actually provide as far as they know the first general purpose method to be able to take all correlated subqueries and rewrite them into regular queries without the nesting and the goal is basically to convert all of these subqueries into joins regular like inner joins outer joins whatever it's supposed to be because again we know how to optimize those and the best case scenario we can go from doing something really stupid like running that nested query for every single 2.0 in the outer query which could be in square costs in a nested loop join we can now convert this into an hash join lookup so let me quickly go through one example again we can cover more of the hyper stuff next class so here's that same query that we have before we want to find all the students and their major if they have the if they have the highest grade for their major so the key idea that they're going to have in this paper they're going to introduce this logical concept or a logical operator called a dependent join and it's basically cross product join or cross join except that it's a there's a demarcation or it's a marker to say this thing is specifically being used because it's I'm doing a correlated sub query and the whole goal the process thing we're going to do is try to get rid of that dependent join convert it to regular joins and then optimize it like it normally would so in this case here just the outer query itself is a projection with a filter and a scan but inside this filter clause we have this the sub query inside of this so if we take this and expand this we can convert it into a dependent join because now we have on the the right hand side is the the left hand side the left hand side is the look up on the outer table and then now on the other side the right hand side this is actually the inner query and it's going to be referencing in its filter for its where clause predicate it's going to be referencing the thing on the other side that's what makes it correlated so as I said this dependent join isn't actually a new physical operator it's not something we actually implement in the system it's just an extension to relational algebra in our query plans that allow us to reason about we know we're doing certain transformations in our query optimizer because it's a correlated sub query again so you could just convert this to a cross join and add a little flag and say hey by the way I'm using this for keep track of dependencies but to keep it clean they define as a separate operator so again all you're doing is that for every single tuple in the left hand side you're going to rerun whatever it is on the right hand side and populate the output and so forth so again it's just like a cross product so what they're going to want to try to do now is try to push down the dependent join to the right hand side of the query plan where we have our interquery and eventually you want to try to get it to the bottom and then convert it into a regular join and then how you actually do this is going to depend on the semantics of the query and of the interquery to determine how you actually do these transformations so in this case here I have again the scan on the left hand side and the interquery on the right hand side so I can move the dependent join down one and now just do a a regular join on the outer query and then whatever's coming up to me from the right hand side here so for this particular query to make this work I have to introduce an additional scan that's basically going to do duplicate elimination think of like a select clause where the all the projection output list is also my group I so there's guarantees that I'm always going to have distinct set of attributes or set of values coming up for all the attributes for a given tuple right so now with this this eliminating all the duplicates on this side when I do now a dependent join on this side then this guarantees that I'm only spitting up now the the output that I need to do the join without duplicates as if I was running the right hand side query once per tuple on the on the outer table because again I'm what am I doing I'm doing a max on the student score looking up by major so all I really want is for every major what's the max score so to avoid duplicates of like there's two people in computer science and I don't want to have two entries with the same values of computer science and then score the duplicate elimination scan when I do my join will remove all that for me again this is just a logical this is keeping track of what the dependencies are from the right hand side to the left hand side as I'm going down so I want to keep pushing down the dependent join so at the next stage I can get rid of the going back here I want to get up below the aggregation so I'm going to move my duplicate elimination scan down with me but then put the aggregation above me and again at this point nothing has changed sort of what the output is going to be like everything is still the same because what am I doing here I'm getting all the unique students by score major and then I want to join it with the with the scan on the students table here so this is going to produce you know for every single student every single major here's the highest score and then when I feed that out of my dependent join thinking that this is a cross product when I now do my group by here on the major I'm going to get for one major I'm going to get the max score for it yes yes you have to identify to because you have logic say okay I know that I want to get the I only want to get the single score per major so the equivalent thing you would do here by putting the aggregation above well I need to make sure that I only get one student per major and then you add the group by to make that happen yes there's logic we have to be a reason about what the what the interquery actually wants when you do the join up above to make sure you don't have duplicates you know is there a way to standardize that or could that seem like a complicated question is the way to standardize that it's in the original paper and then it varies based on what the where clause is right this is like something equals something if it's less than or greater than nulls complicates things too right this makes everything harder we're well over time so you have to go please go we can do what we did last time we can pick up on this last class and beginning of next class alright so we push the aggregation above again we want to go further we want to get this dependent join now below this filter well that's easy to do right because there isn't actually any changes we need to make because whether or not we put the filter before after the join that's the same thing if it was a regular query right like you can do a join stupidly without a where clause and there's just a cartesian product and then above that then you do the filter same thing here right so in this case here again it's just a cartesian product for every unique student every unique major I'm going to join against every unique major and a score I'm going to join against all the students table it's going to produce all the possible combinations and then this is going to filter out that I only get where the student from this side is equals the major of the student or the major on that side alright so now at this point here my dependent join is as far as it can go I can't go below these guys because it doesn't make any sense because these are the leaf nodes I'm scanning the tables right so now I want to actually do further optimizations to actually put it into a physical form sorry put it into a form that I know how to optimize like any other query so again we're all still at the logical level we're not doing any none of these are physical operators right so I can convert this dependent join into what it really is which is just a cross join a cartesian product right so there's no where clause it's just you know everything combined with everything on the other side um but then now I have as I was saying before I have a filter above a join that's just an inner join so now I can collapse these two to a single join operator and just move these guys up I can go either even further and recognize that well all you know this is basically just for every single major uh for every single student get their major that's all this is doing here right because my group is going to handle the removing the duplications so I can then I can then convert this now into just a scan on the table s by itself but now to make sure that I had to rewrite my join that I had up here because I was referencing the deduplication scan on the major I need to get rid of that so I can also have a filter above that too so I can just combine that now into a single join where it's the major on this side join the major on that side and the score on this side equals the max score that I produce as my output here because I have the group by the major clause I'm guaranteeing that for every single major I have one score so this is the easy case and again there's some details that I'm glossing over for example he was asking how do I know that I need to push up the group by again it depends on the examination of the where clause in the expression trees so this can then be extended for all possible combinations of correlated queries um and you can convert everything into a joins okay well over time as I was saying only hyper umbra and duck db can do this fully data bricks can do some of it I don't know about other systems I haven't seen whether they make these things as well so like this is like if you're going to build a new system today this is the way to do it again we will cover the other some more cases in the next class we'll cover also then how they're handling joining picking join orders but the other thing we already alluded to today is like okay all the things I've talked about is like hey you have a cost what if your cost estimates are wrong or you just don't know because you haven't even looked at the data before in a lake house environment some new file should have an s3 you don't know what's in them what do you do so that's what we'll cover on next class how to do adaptive query optimization and the TLDR is going to be I've got to generate something I've got to generate some query plans I've got to run something but again I can put in hooks and we'll see how to get feedback from when we scan the table feed that back into our cross models and see whether that helped or not enjoy the weekend so about three in the freezer so about the killer