 All right, so today is the first part of a two-part lecture on query-adapted user implementation. So again, to me, this is the most interesting part of a database system, and this is the part I fully admit that I know the least about, because this is like the black art of a database system to have these things actually work. So today we're sort of talking off some background about what query-adapted is actually means, what the system is actually doing on the inside, and then we'll talk about some things we keep in the back of our minds as we implement a query optimizer. But then we're gonna spend most of our time talking about the search strategies, because that's sort of the key part about how you would actually implement this. So the paper I had you guys read was from this way, an influential system from the 1990s called Cascades, and it's sort of that's one search model strategy you could use for a query optimizer, and we'll see variants of this. We'll sort of see how we got to the point where something like Cascades came about, like what was sort of being done before. And then sort of motivate why we actually want to use something like Cascades. And so I'll try to go slow today because there's a lot to cover. So we might run out of time discussing Cascades, but then we'll finish that up on Monday next week, okay? Right, so the idea of what a query optimizer is doing should be sort of obvious to everyone in this course by now. But at a high level, what's going on is that we're gonna be given a query from the application, written in a declarative language, for us it's SQL, but it doesn't have to be. And the optimizer's goal is to figure out a correct execution plan for that query that will have what it thinks is the lowest cost. And I'm emphasizing the word at correct and cost here. In the case of correct, again it's obvious, but it doesn't help us if we generate a query plan that's super fast, but produces the wrong answer. So we always want to make sure that we're, however we transform our query plan into an actual execution plan, make sure that it's actually producing the right result. And then cost I'm putting in quotes is because it's gonna depend what cost actually means for a particular database system, depends on how sort of its target operating environment is or what it assumes is going on that would cause a query to perform slower than another query. Or actually even slower is not the right word either. So in our world it's gonna be computational complexity costs, what query plans will be faster than another, but this can be measured in different ways. It could be measured on the amount of data I'm gonna process, how many tuples I'm gonna produce from one operator to the next. If you're in a disk based system, it can be based on how much disk IO you're doing. If you're a distributed database, then it's how much network IO you're doing. So all these various things can, the cost will change depending on what the system is actually targeting. But what we'll talk about today, the search strategy, that doesn't change. Just this piece here is what you can slide in and out to change how you're choosing your plans. So as I said before, and I'll emphasize many times today, the query optimizer is the hardest part to actually implement in a database system well. We'll see how to do sort of simplistic heuristics that maybe get you some other way there for simple workloads, but for more complex things, like things more complex than TPCH or TPCDS in real workloads, then having a robust optimizer is a challenge. And there's been some work done in theory done in the 1980s that prove selecting the optimal query plan is an NP complete problem. So we're obviously not gonna be able to generate, even though it's an optimizer, we're not gonna be able to always generate the optimal plan, right? And so the way we'll get around this, avoiding having to do an exhaustive search over all possible query plans or execution plans for a query is we'll rely on techniques to trim down the search space. So we'll use estimation methods to figure out what the actual plan is, the cost of executing that plan is without actually having to run it. And then we'll use other techniques like heuristics to cut down on the number of plans we're actually maybe consider. Some obvious things we know we're not gonna ever want to use, we just throw them out without actually having to be part of our search. So the high level architecture for looks like this what we're talking about. So we have at the very beginning the application sends us a SQL query and in this I'm showing these checker boxes here to indicate that they're optional. So when a SQL query shows up to the database system, some systems will have what's called a SQL rewriter where you're actually gonna transform the string of the SQL itself. So you often see this in like middleware that things that speak like a MySQL proxy and the query comes up and maybe route it to a different machine. So you can do this to like route it to the correct shard by rewriting the SQL query or if there's some kind of transformation when doing table names or column names. We're not really doing any optimizations here, it's mostly just for where to actually send the query. And like I said, it's optional, not most systems don't actually do this. So then the SQL query shows up into our parser, the parser then spits out an abstract syntax tree and this is just the tokens of what's, what was actually in the SQL statement itself. And then we pass this into our binder and our binder is gonna do a mapping from names in the SQL query to database entities to internal identifiers. So the table is select star from table foo. We wanna map foo, we wanna actually check whether foo is an actually valid table by looking in the catalogs and then mapping that to an internal ID or a pointer to the object that represents that table. In our system, it's an object ID or OID because we follow what Postgres does. So now you have essentially what's called a logical plan and this is basically almost like the relational algebra of representation of what the plan is actually trying to do. And then you feed this into an optional tree rewriter and this could be doing some kind of conversion based on heuristics to optimize or do further pruning on the complexity of the query search. So the team Newton, William and Eric are doing this now. They're doing a rewriter for the expressions but this thing could be rewriting the actual query plan itself or the expressions or predicates within the operators. And again, not every system will actually do this and you don't have to. So then you now feed this into the optimizer and the optimizer that we're talking about today are doing a cost-based search to figure out what an optimal query plan is. So then you have some internal cost estimates that they're using based on statistics that they've gathered about the data to reason about whether, again, one plan is better than another. So again, this is sort of a high level pipeline and the output always is what is called a physical plan here. So the distinction we need to make because it's gonna be important as we talk today is this idea of a logical plan versus a physical plan, a logical operator versus a physical operator. The easiest way to think about this is a logical operator is almost equivalent to the relational algebra expression that you wanna do in your query. I wanna join table A with table B. So the logical plan does not specify how you're reading data from table A or table B. Doesn't say whether you're using index, doesn't say whether you're doing a sequential scan. And then we don't say how we're actually gonna do the join in the logical plan. So I just say join table A, table B. I didn't say whether I wanna hash join, assert merge join. Those low-level details about how you would actually execute a query plan are part of the physical plan. So the idea of what's gonna happen is here, we're gonna have this logical plan and in this pipeline of a query optimization, we wanna convert that into a physical plan that we can then execute in our system, in our execution engine to produce results. So the tricky thing that we'll see is that it's not always gonna be a one-to-one mapping between a logical plan to a physical plan or a logical operator to a physical operator. And the logical plan itself, the logical operators also have no notion of what the data actually looks like. So that is all encapsulated in the physical operators. So again, if I'm scanning table A, the logical operator just says read table A. Doesn't say, oh, you're reading this and it's a column store or it's compressed or it's sorted this way. All that is encapsulated in the physical operators. And again, it's not always gonna be a one-to-one mapping. So an example would be, say I have a logical join followed by a logical order by, I can replace those two logical operators with a single physical operator that does a sort merge join. Because as long as the order by key is what I'm joining on, then those two logical plans can be collapsed into a single physical plan. And sometimes it can go the other direction. You can have a logical plan explode into multiple physical plans. So this is the key thing to understand as we go forward and when we start doing these transformations, especially in Cascades, we'll be transforming logical operators to physical operators or logical to logical as well. But you don't really go back. You don't go physical back to logical. So when we talked about correctness, again, the idea here is that we wanna do these transformations or convert our logical plans into in different ways. And we wanna make sure that they always produce a correct result. So we wanna rely on the equivalently rules within relational algebra for us to identify when one plan is equivalent to another plan, right? So in this obvious thing here, I need to join tables A, B, and C. I can join B and C first and then join that with A. Well, that's equivalent based on the commutativity property of joins that I can join A and C first and then join B. And this will produce the same result. So the associativity, transcivity, electivity, all these rules in relation to algebra are what we're gonna leverage in our query optimizer to do transformations between logical to logical, logical to physical to generate more optimal plans but still producing the correct result. And then one of the overarching themes as well when we talk about these search strategies is that the software engineering overhead of actually maintaining or using these equivalencies to figure out whether we're generating correct plans is harder in some implementations versus others. Again, especially when we talk about the stratified search versus unified search, they're both gonna essentially gonna produce the same results but how you actually enforce these things can vary. So we don't wanna have to write these rules manually. We wanna make sure that our transformations can don't violate these as they apply them. So the other important thing to talk about too is that for today's lecture and for the next two ones when we talk about query optimizers is primarily me focusing on analytical workloads. Because these are where the hard query, the more complex queries are. And in O2P systems, O2P workloads, the queries are quite simple and they're often referred to as being Sargeable which is abbreviated for search argument able. To turn in the 80s, I didn't make it up, make it up, whatever. The basic idea here is that in O2P queries, most of the time you're doing lookups, you're doing predicates, are doing a quality matches on single columns or columns that you have indexes built for. So now all you need to do when the query shows up to try to figure out what the optimal query plan is, it really comes down to picking what is the best index for my query. So if I have column A equals some value and I have an index on column A, I know I don't need to do any exhaustive search to find what the right index is. I have the index, I just pick that. So really simple examples, basically what I just said now. I have table foo, I have a primary key on ID and that means in order to enforce the primary key underneath the covers, the database system creates a unique index. So now when my query shows up and it says where ID equals one, two, three, I just look at that and it's a quality predicate so I know that I can just use this index. So picking the right query plan for this query is super simple, all right? And so you see this other cases too in O2P workloads if you're doing joins of foreign keys while how do you enforce foreign keys, you have an index, so you just use the index to figure out how to join these guys together. So in many cases in sort of O2P optimized systems or systems that are targeting O2P workloads, you don't need to have something very sophisticated like Cascades or the system R approach to do query optimization, you can actually get pretty far with this, right? Doing a trick like this. And most people when they build up the database system the first time, again, unless they're doing analytical workloads, you can get by pretty far with doing this, something like this. And Oracle did this for like 20 years. We'll talk about it in a second. Mongo actually still, as far as I know, still doesn't have a query optimizer. They only recently in the last couple of years actually started supporting joins. And what they would actually do, I mean, it's kind of hacky but it is kind of clever and it would work for them. So in this case here it's obvious, right? ID equals one, two, three, I know when to pick this index. But what if it was like ID equals one, two, three and name equals something. So I have two possible indexes I could choose. They would generate both query plans by using both indexes and then pick one random, run it, see how long it took. The same query shows up, they pick the other index, use that, see how long it took and then just pick whatever one worked the fastest, right? It's simple and it worked for them. And in operational workloads or two workloads that MongoDB was originally targeting, that works fine. So and you don't have to maintain statistics about anything. So again, for simplistic workloads, this is okay. All right, so we talked a little bit about this in the beginning about doing cost estimation but all of the things that we're talking about today except when we talk about heuristics but all the cost-based search, none of this actually works unless you have a good cost model. Unless you have a way to make good estimates about whether one plan is gonna be better than another plan. And so we'll discuss this, we'll read a paper and discuss this on Wednesday next week. Just in the back of your mind, keep thinking about this or how are we actually gonna do this cost in between the different plans. And the basic idea here is that it's usually not gonna be tied to something, the cost is not gonna be tied to something in the real world. Like it's not gonna be tied to wall clock time. It's usually an approximation of the resources that will be consumed by the query when it actually runs. And then these are just a bunch of different examples of the things that we've already talked about of how we're actually gonna be able to derive a cost for a particular query. So in our case, in memory database, it's usually gonna be a size of n meter results. You could also just look at the complexity of the algorithms and access methods, how much hardware resources you're using, what the data looks like. So again, the commercial guys have very sophisticated cost models that try to include all of these different things. In many cases in sort of open source systems, it's usually just how much disk or how much memory I'm actually reading to process queries. So again, we'll cover this in more detail next class, but then this is just, so when I say, you know, when we do a cost-based search, we're relying on a cost model that's deriving a information about what they think the query's gonna do to produce some value that allows you to compare the relative performance of two different queries, right? All right, so now before we get into the search strategies, let's talk about what we need to be mindful of when we wanna build on a query optimizer. So we have to care about the granularity optimization and the timing for when we actually invoke it. We're gonna talk about how we prepared statements, how do we ensure stability of our plans, which is very important in enterprise settings, and then when we, if we're doing a search-based approach for a query optimizer, when do we stop? So we'll go through each of these one by one. So the first is the granularity and the basic concept is this, is like what is the scope of queries that an optimizer is actually looking at? So what people normally think of would be single query granularity, where the application sends a single query and the optimizer runs, and it only reasons about that one query that it was sent, right? And we do this because by only worrying about the one query as we do our search, that limits the scope of the complexity of the problem. Well, it's still hard, it's still NP-complete, but it just sort of cuts down some constant factors. So what usually happens in this case is that we're not gonna actually reuse any of the results from our search and from our query optimization across multiple queries. So that means with exception of like prepared statements where maybe I run the optimizer once, generate a query plan, I can evoke that multiple times, if I am not caching any query plans, first query shows up, I run the optimizer, same query shows up a little bit later, I'm not reusing any of the search information that I've collected from the previous time I ran the optimizer for this new query. It's always like every single query gets optimized for scratch every time. In this case also too, even though I'm doing optimization on a single query, I may also wanna worry about what other queries are running at the same time. And so to do this, I have to embed some knowledge in my cost model about what resources are be consumed in other queries, running at the same time I am. So I know Vertica does something like this. So like Vertica, if you have a query show up and it may choose a query plan for that particular query that is not the most efficient one if the query is running by the self, but it is more efficient when it knows there's other queries running at the same time. So again, this is just like a single query shows up, I run the optimizer for that and then I produce a single query plan. Another approach would be to do multiple query planning. And the idea here is that if I'm told ahead of time, here's a bunch of the queries I'm gonna run together, it can try to meld them into a single query plan where we can maybe reuse some of the data structures and intermediate results that are generated by different parts or different queries instead of every query running by itself. So you sort of take a global view of, here's what my workload looks like and try to generate an optimized query plan for all those sort of queries together. So of course now the search space is much larger because I'm considering more things at once, but I may be able to get a better global optimum configuration for my queries than I would have if I was looking at a single query at a time. You also sometimes see this in, there was a system out of Europe called ShareDB where again, you give it all the queries ahead of time and it tries to figure out how to reuse data across all of them. I don't know how many commercial systems actually do this but there are academic papers that talk about this problem. All right, the next question is, when do we actually want to invoke our query optimizer? So again, what most people think about when they think what a query optimizer is, what's called static optimization. Query plan shows up, then you generate the best query plan you can find for it and then you just sort of execute that. And then you don't go back and revisit that query plan as you run that query. You just say, I set it from Regata and it just goes. And if it turns out to be a bad idea while I'm running it, I just stick with it, right? This is how most systems are actually implemented. You can sort of amortize the cost of invoking the optimizer for more really expensive queries by using prepared statements, right? Prepare the statement ahead of time, run the optimizer, generate the query plan and then every time I invoke that same query, I just reuse that query plan. And we'll see in the next slide though, this can cause problems when you have parameters or variables in your prepared statement. The next approach to do is called dynamic optimization where you just sort of take the logical plan that you have, you don't actually run an optimizer at all and then when you actually invoke the logical plan, then you do some optimization paths to figure out what is the correct thing to do for that part of the query. You're doing optimization in sort of subsets or groups of the query plan itself. As far as you know, nobody actually does this. We'll see a little bit, something in Ingress in a second that does sort of something like this. And part of the reason why this is hard to do is because it's difficult to implement. You sort of have to be able to reenter the optimizer and do optimization on a subset of the query every single time. This also makes it hard to do blog if it's non-deterministic, right? How do I, you may run it the first time to get one query plan, you run the same query again and you get a completely different query plan because it's re-optimizing as it goes. There's ways to handle that as we'll see in Postgres in a few more slides, but again, as far as I know, traditional databases don't do this. You might see this in more stream processing systems. A approach that kind of combines the both of these, trying to get the benefits of both of them is to do adaptive optimization or what I'll call hybrid optimization. And the idea is that when the query shows up, you first run it through in the first pass to get your first optimized query plan, then you start running that query, but then you observe whether the assumptions that the optimizer made about what the data looks like when it ran through the optimizer the first time, you see whether that matches with what you're actually seeing on the real data. So for example, if I assumed that my predicate was super selective and it was gonna start throwing, it filthed around a lot of tuples, but then when I start running it, I'm getting way more tuples than I expected, then I know I probably have the query optimizer was incorrect and I may not have a most efficient plan for what the data actually looks like. So what they can do is if the error rate for the query goes above some threshold, then you go back and re-invoke the optimizer, providing some hints about, hey, this is what the data actually looks like, update your estimates, and then it generates a new query plan. This idea is about 15 years old, it came out of the University of Wisconsin. In very recent years, the last two years, the commercial systems actually now support this adaptive query optimization. So Oracle has a SQL server out of it in 2017. I don't know about DV2. And there's different ways to implement this, right? The easiest way to do this is I run my query plan, if I see my estimates are wrong, I just go back to the optimizer, throw away any intermediate results that I've already generated, and just start over from scratch every single time. The more sophisticated approaches, which I don't know whether they're in the commercial versions, is where you can try to figure out, well, I've already scanned this table, let me keep those results around, and I'll go back to the optimizer and generate a new query plan while fixing that operator, fixing the part in the query plan that I've already executed, right? And obviously that may not lead to most optimal plan, so there's tradeoffs in both of these approaches. So again, everyone pretty much does this, and this is what we'll focus on today. Historically Ingress did this, they don't do it anymore, and this mostly shows up in stream processing systems, and then now this is something that their commercial systems actually support. All right, the next thing is prepared statements. So again, the optimizer is a search algorithm, right? It's a complex heavyweight search for complex queries. And the amount of time it's gonna take could be non-trivial, right? So ideally, I don't wanna run a 10-second search to find a query plan for a query that's gonna run for one second. And so I wanna try to come up ways to reduce the amount of time I have to spend in my optimizer. So let's say for whatever reason, this query is super expensive to do query optimization on, right? It's a three-way join, that's actually nothing that's trivial, but assume that it actually is. So say I'm gonna execute this query over and over again, rather than having to run the query optimizer every single time this query shows up, I can declare it as a prepared statement, and then now I just invoke it from the terminal by using the execute command and invoking the prepared statement handle that I defined up above, right? So now what'll happen is when this execute shows up, the data system says, oh, I know this prepared statement, I have a cache query plan for it, let me go ahead and just reuse that, and it skips the optimizer. But now let's say for this query, and I've hardcoded some parameters in here, and I wanna be able to change this to do the query on different parts of the table, right? That's not a problem, right? In prepared statements, you can rewrite these to be these placeholder variables. So now at runtime, I pass values in like a function and they get substituted in here. What's the problem? He says you pass bad variables, or when you know you have bad variables. Let's say normally it's highly selected variables like going there and you pass just basis like everything. Yes, so he says like, before, in this case here, like assume these produce, let's keep it more simple. I see the query the first time, and I maybe run the query optimizer, and I generate this query plan like this. But now if I pass in different values for these guys here, that can change the selectivity of these filters, and now what was the optimal plan, the optimal plan was join A, B, and then join C, it may not be join A, C, and then B, or B and C, and then A, right? So at this point here, even before we invoke the query, we don't know anything about what people are actually gonna be passing us. So how do we actually derive what the optimal plan should be? So there's three choices. One is we just punt on it and say that we just re-optimize every single time we get invoked with new variables that we've never seen before. If we're clever in our software engineering, we can sort of keep track of where we left off in our search optimizer last time, and maybe use the best plan we've seen before as a starting point, instead of having to start from scratch all over again. The next approach is to look at the values you could possibly have for these parameters, break them up into quantiles or buckets, and then generate a query plan for the average value of each bucket. Then when an invocation shows up, you look to say, well, my variables look like this, and that falls in this bucket, and therefore I can use my cache query plan that I had before from that one. Of course, the problem there is it's a multidimensional, if you have multiple variables, it's a multidimensional search base, and how many plans do you need to generate is it's not obvious what that should be. It can be quite expensive. And then the last approach is as far as I know, for the commercial systems is the most common one is that you just choose the average value for every single parameter, and you use that for all invocations for that prepared statement. You can also maybe reason about what the predicates are, like in your query, like if you know that this thing, instead of being a greater than, if it's an equal sign, and you have a unique index on this, then you can reason about that the selectivity is always gonna be one tuple, and then maybe then you just sort of fix that to be index and everything else, to just you pick the average or something like that. So again, the main thing I wanna emphasize here is like this is hard if you have prepared statements because you just don't know what's gonna show up until later on. The next issue is that we wanna have plan stability, and so stability just basically means that if I run the query today and it takes x amount of time, if I run it tomorrow, I want it to be x amount of time, modulo maybe 10% or something. People don't like it if today's query, today the query's super fast, tomorrow it's super slow, and then the next day after that it's super fast. That kind of oscillation is not good because it's hard to pin down what's actually going on to cause performance problems. So the way you can support this in your optimizer to have better stability is to allow the DBA to provide hints to the optimizer about how it should choose a query plan. And this most common thing is you can pick the join order, you can tell something, you know, do a hash join or build an index on something in the query plan. The next approach is that you can tell the data system to use an optimizer that you vetted and to have that produce query plans where you know you're gonna have reasonable performance. So Oracle does this, if you download Oracle, you download the latest one, it's Oracle 18. Included in the Oracle binary is also gonna be the query optimizer from a bunch of previous versions. And so you can specify on a per query basis what query optimizer you want Oracle to use, like what version of the query optimizer. And the idea there is like you know that the optimizer is deterministic and therefore you know what kind of plans it's gonna generate and that'll ensure that you have the stability that you want. Of course that means that if there's some new feature or some new technique in the newer optimizer that could speed up your query, you're not gonna get that because you're pinning your query plan to be from the old optimizer. But again, the idea here is like nobody wants to have their, you know, you upgrade your database system and 99% of your queries get super fast but then the 1% query gets super, super slow. People will call and complain, right? No one's gonna call and say, hey, my query is so much faster, thanks. They're gonna call and complain my query is so much slower, fix it, right? So this is one way to get around that problem. And then another approach is sort of similar to this first one, but instead of actually giving hints, you can actually export the query plan generated from the old version of the system, upgrade it, and then feed that back in as a prepared statement to fix the query plan for that query. So SQL Server allows you to do this. SQL Server can dump out the query plan from XML and then you upgrade your system and then load it back in. And again, this avoids having the optimizer trying out something weird that may cause your query to get slow. The last one is when do we actually want to stop our searches? And so this one's sort of obvious, right? The two basic things are say, I'm just gonna let it run for a certain amount of time and stop when I see that time limit. The next one is when I, if I find a query plan that has a lower cost than some threshold, how you define that can vary. Or the last one is if I know that I've exhausted my search days entirely, then I just stop and take whatever the best plan is I've ever seen. So as far as you know, most systems do this, right? If I exhaust my query plan, my search I just stop, right? But most of the times you set a time out like this. This one is a bit more tricky to do. And how you balance this versus what the query is actually gonna do is hard. Because if my query, if it's gonna take me 10 seconds to find an optimal query plan, and that query plan is gonna take one second to run, if I could just find a query plan in two seconds, and maybe that takes two seconds to run, then I'm better off just not doing the full search. But what the right cutoff is for this, it can vary. And as far as that, most systems just sort of have a hard-coded value. I mean, you can set it in the config file, but it's just set to an arbitrary number, okay? So any questions about these things? These are the things we have to worry about if we actually wanna build a real optimizer. Okay, so let's talk about how we actually build an optimizer. So we're gonna get five different approaches. And the way to think about this is like we're going forward in time. The heuristics are the very beginning. It's forward in time and forward in complexity. So the heuristics ones were the very first ones and are the most simplest. And then the unified stratified search are the more modern ones, and they're actually the most complex to implement. So the spoiler would be Cascades, it's gonna be this last one here. It's gonna be an example of a unified search strategy. Okay? All right, so the first one is to do heuristic-based optimization. And this is where you just have these static rules in your optimizer that know how to transform a logical plan to a physical plan, right? And it's the standard optimization techniques we teach you in the undergrad class, like predicate limit, projection push downs, doing all selections and filtering before you do any joins, trying to throw out data as much as possible. The join order based on cardinality, like you want the bigger table as the outer table and the smaller table as the inner table. And again, the way, again, these are all just sort of hard coded rules of things you're pretty sure you're always gonna wanna do. Like I'm always gonna wanna do a projection push down. There's very rare cases where I actually, I maybe wanna put that, put the, sorry, there's rare cases where I may wanna put the projection at the very top and not trying to prune early, but you can just hard code rule to always do that. And we'll contrast this with the next examples and next strategies where you're doing cost-based search. There's no cost model involved in any of this. It's always do it, right? Unless there's some criteria to identify where you don't wanna do this. So this is how things were actually implemented in Steinberger's first system Ingress at Berkeley. And then this is what Oracle did, it surprisingly up until the 1990s. And it's crazy to think about how popular and how big they've gotten, for like 20 years with a curiosity based optimizer. In talking with people that actually used to work in Oracle on the old optimizer back in the day, they said it was like the nastiest, like largest piece of code in the entire system that it was very impressive, but like impossible to like maintain or modify and extend. Cause it had to deal with all these corner cases to do all these rules. And it was all written in C, right? There wasn't any sort of high level language describing what the rule actually was, right? So you may think, this is crazy. How did they get so far with the simplest approach? You can understand back in the day, at least in the 1990s, they're not running the super complex queries that people are running today, right? So like, 75 way joins were probably not that common back then. So you didn't have to worry about these things. You know, they didn't have CTEs, they didn't have window functions. I would say, so the thing I'll also say too is like, when you build a data system for the first time, this is pretty much everyone does. We did this in HDOR and VoltDB, we did this in Peloton, right? Because this is like, this is the Sargeable stuff. Like I want to pick an index to use for my query. That's a heuristic. So that's an example of what this is. So I want to go through an example of what Ingress did because I think it's kind of interesting. And again, it's more of an intellectual curiosity. It's not saying we should do it this way. It's just an example of what people have done. And again, it'll help motivate why we're gonna do things with Cascades or Starburst in a much more disciplined manner. So we're gonna use this simple database here. We have three tables, artist album appears, and this is just like Spotify or a music web app. An artist appears on albums. So there's a foreign key reference from this table to these two tables here. So we're gonna do a query that wants to get all the artists, the name of the artists that appear on my mixtape. So it's a three-way join between artist appears an album and then we're only doing the look up on my tape here. All right. So the way Ingress is gonna do this is it can't do joins. It can only execute single table queries. So it has to rewrite this using heuristics to put everything into a single table form. So the first thing they're gonna do is they're gonna decompose the query into single value queries. Meaning we're only gonna do a single look up on a single value in our where clause here. All right, so the first thing we're gonna do is we're gonna break out album, the look up on album, we're gonna move that into a separate query and we're gonna write it into a temp table one and then we rewrite the original query instead of doing a join on album, it's gonna join on the temp table that's generated by this query up here. So then now we're gonna take this guy, decompose it further, all right, and now break it out so that it's a join between temp one on the appears table and a join between artist two from the temp table generated by this query here. Again, this is just heuristics, they can do this rewriting and they're operating on the logical query plan, not the actual SQL string itself as far as I understand from the paper. All right, so then the next step is now, is now they wanna remove all of these joins and substitute them with the values that are generated from the previous queries here. So this very top one here, this one is not accessing any other temp table. This is sort of the root of the execution plan. So I'm gonna execute this one first, produce a value which I then now substitute here to replace the value that this guy's trying to join on. So I've replaced the lookup on the temp table from the album ID, I place that with 999 because that's coming from this guy. It now produces result one, two, three, four, five, six. So then I come back down here for this last query and because I have two values here, they didn't support end clauses in the 1970s, so they had to rewrite this to be two separate queries, one on one, two, three and one on four, five, six. And then you run this, then you produce your result. And you just coalesce these two and that's your answer. Crazy, right? It's kinda cool. Obviously super inefficient, but you understand, it's the 1970s. So that paper that describes how they do this talks about, oh yeah, we're operating on a table with 300 tuples, right? Because the hardware was super limited, the size of the data they had to manage was much smaller. So in that environment, this works, right? Okay, so what are the advantages of heuristic-based optimization? Well, it's easy-ish to implement and debug, right? Because you know how to write test cases that go after the rules that you define and actually see whether it actually produces the correct result. And for simple queries, this works reasonably well. And it's actually super fast, right? Because there's no search. It's just, does my query look like this? If yes, rewrite it this way, all right? Now the downside is gonna be, it's gonna require on magic constants in your code to predict whether the choice you're making is good or not, right? So should this table be the outer table versus the inner table? There's some magic, you know, you're hard coding in there to try to figure that out. And then when you start having complex queries, this is just not gonna work at all because there's gonna be complex interdependencies with different parts of the queries, different operators in our query plan that we're just not gonna be able to reason about because we're only doing these simplistic rules. So like I said, so when you build a dataset for the first time, if you have a basic query optimizer, you do something like this. And then eventually you throw it away and start over. Or you build something more sophisticated, which is what we did. All right, so this was the approach used by two out of the three first relational database systems of the 1970s, so Ingress and Oracle. The other relational database in the 1970s was IBM's system R. And they actually did what is most often used even today in systems like Postgres and MySQL and most open source database systems. So this one is gonna be a combination of the heuristic-based optimization that we showed in Ingress. But then now they're gonna include a cost-based search to do optimizations like join authors. So as I said many times with the system R project, they got a bunch of people that had fresh PhDs into a room. They took Ted Kott's paper and says, oh, let's go build this. And everyone carved off their own piece. Pat Selinger got actually the, she worked on the query optimizer. And she was at, she just retired recently, but she was at Salesforce. And I had a great student who took my class two years ago and she's at now at Salesforce working in their database system. And then she didn't realize until after the retirement party that the Pat Selinger she was having cake with was this Pat Selinger from the class that built the query optimizer, right? So yeah, so she's quite famous in the database world. So this is an example of the first cost-based query optimizer for a database system. And again, the argument back then, the same way we talked about with compilers was that the people claimed that, oh, a declarative language like SQL or Quell, they're never gonna be able to generate an optimal query plan as efficient to the one actually written by humans. So this is sort of the first approach to show that it's not the case. The optimizer can do this better than a human can do. So we won't talk about what this actually means in detail just now in a few more slides, we will. But the system are approached in an example of doing what's called bottom-up planning, where we're sort of gonna start with nothing in our query plan and start, just go from the bottom up and start adding in pieces so we need to actually compute the answer that we want. Whereas Cascades is a top-down model which we'll see in a second. So as I said, IBM implemented this in System R when they went off and built DB2 in the early 1980s. As far as I know, they took the system, our query optimizer, and ported it over to work in DB2. Eventually in the later, in the 1980s, they got rid of it. But this is the approach now that you use in pretty much every open source relational database system that's out there. So the major ones, MySQL, Postgres, Postgres has a variant of a different type of optimizer we'll see in a second. But if you have less than 13 tables in your query, you get the System R approach. All right, so let's go see how System R does this. So what they're gonna do is they're first gonna break up the query into blocks, and again, we're operating on logical operators here, not an actual SQL. And then for each of these logical operator in a block, we're then gonna generate a set of physical operators that can be used to execute it, or implement what we want our logical operator to do. So it's gonna generate all possible physical operators for us. And then now they're gonna do a search to try to construct a left deep tree that minimizes the amount of work, or amount of data we're gonna have to read from disk in that physical plan. So this is an example of some of the methods they're gonna use to cut down the search base or the complexity of the problem that they're trying to solve in trying to find an optimal query plan. So they're only gonna look at left deep trees, where the joins only go up on the left side of the query plan. As opposed to like a bushy tree, where joints can be in any part of the tree, or a right deep tree, they throw all those away and only look at left deep ones. And they're also gonna throw away Cartesian products as well, Cartesian joins, because you almost never need those. All right, so let's say now we extend our query we have before for looking up all the people, the names of the people that are on my mixtape, but now we're gonna add an order by clause. We wanna sort them by artist ID. And I'm doing this because this is gonna highlight a key deficiency of what, of system artist approach. All right, so the first thing we're gonna do is we have to choose the best access path for each table. So all that is just looking at the three tables I'm doing, I'm scanning in my, in my query. And I just use heuristic to figure out, oh, do I have an index for what they're trying to look up on? If yes, I pick that. Otherwise, I fall back to a index scan, right? So in this case here, we're doing a look up on name on album, assuming we have an index for that, then we can use that as an index scan for everything else is a sequential scan. Then now for each, for my joins, I'm gonna enumerate all possible join orders for these guys. Again, at a logical level, right? So again, like I do artist, join artist and peers in album or peers in album and so forth. I'm throwing Cartesian products because again, you can join like that, so then they'll put in all these. And this again, this is exponential. So this is all possible ordering of these guys. So then now I wanna then determine the join ordering from my list here that's gonna have the lowest cost. And so do that, they're gonna do a search, a divide and conquer approach. We're gonna break up, they're gonna do, they're gonna do estimation for the cost of doing a join in stages, get to my endpoint, which is the final result that I want, and then backtrack and figure out what was the fastest way for me to get there. All right, so it looks like that. So say at this, this is the bottom, I should rotate this, so this is, think of this as the bottom and that's the top. So here, nobody's joined together, but then for each node here at the next step, I'm either joining album and peers or artist and peers, and then I have all other possible orderings as well. And then my final destination is when the three tables are joined together. So the edges between this node and this node are all the physical operators I could use. So I could do either a server's join or hash join for both of these. And for each of these, I'm gonna compute their cost using my cost model to figure out which one is actually the cheapest. And then for each, to get from this node to this node, I pick which one or whatever one's the best. And then now at the next stage, I do the same thing, to get from here to here for both nodes, I look at all possible physical operators I could use, pick which one has the lowest cost, right? And now I get to my endpoint. So now I'm gonna backtrack and figure out which path along this, which path from these different nodes actually produces the one with the lowest cost, right? And that's how I pick my physical plant. What's missing here? How did I change the query? I added the order by. There's nothing in here that talks about sort order, right? So the way they had to get around that was you have to bake in now your cost model, some information or expectation of what the data needs to look like in your final output. And then you use that to make a change on what the cost estimate's gonna be, right? So in this case here, I would say, oh, I know my query needs to be sorted by artist ID. I have a sort merge join here on artist ID. That'll put me in the sort of order that I want. Therefore that should have a lower cost than the hash join, right? So there's no way to sort of introduce like an order by clause here in this. There are baking logic or baking knowledge about what the data, we're disconnecting information on what the data needs to look like from our search strategy and by embedding it in the cost model, which actually might not be the right, is not the right place to have it, okay? All right, so we talk a little bit about this very briefly, but I want to bring up now the distinction between the top down versus bottom up. So the system R approach that I showed you is a bottom up even though I was coming from left to right. The idea here is you start with nothing. You start with the, I have all my tables and none of them are joined or none of them are being processed in any way. And then I build up my join orderings till I get to my very top that has all the data that I want or has data in the joined in the way that I expect. Top down, which is what volcano and cascades are gonna use, I start with what my answer to be and then I search down and figure out what nodes do I need to add in my query plan to get me to where I wanna go. Now, at a high level, these two approaches are semantically the same. They're gonna produce a near optimal query plan that produces the result that I want. In my opinion, this is our religious debate. In my opinion, this is the better way to do this because you can reason about the query plan as you go down and prune out branches where you know there's never gonna be a better path to get to where you wanna go. This is also independent whether it's stratified or unified, cascades is a unified approach. In my opinion, unified top down is the way to go. The Germans disagree with me. That's fine. Again, this will make more sense as we talk about cascades and volcano and starbursts, but to understand system R is a bottom-up approach, cascades is a top-down approach. So I wanna talk a little bit about what Postgres does, at least for what you normally get in Postgres when you run with queries that are less than 13 tables. We'll talk about what happens when you actually run with queries that are more than 13 tables in a second. But the Postgres is an example of a heuristics plus a cost-based search. And so the way it works is that they have these stages, explicit stages that do different types of optimizations on your query plan and there's baked into the code these assumptions about what the output of one stage needs to be as input into the next stage. This is why, again, I like the unified model because you just sort of throw everything into a single search and it just does the optimization for you. Where in this one, excuse me, this one here is like you have to do rewriting and then you do the cost-based search. And then everything else I think is like aggregations and orderby. These are then added back into your query plan independent of the search that you actually did here. So one of the main developers of Postgres, Tom Lane, actually lives in Pittsburgh. He's CMU alum. He got his PhD here in the 1990s, not in databases like software engineering. When we first started building Peloton, we were using Postgres as a starting point and we had lunch with him who discussed like the various parts of the system we thought we wanted to reuse and one of the big ones was the optimizer. So this is maybe three or four years ago and when we started asking questions like how to fix this and the optimizer, how to change this, he's like, yeah, that's a nasty piece of code that nobody really touches because it's really sort of brittle and it's difficult to understand. And so he sort of said this like the darkest corners of Postgres were actually in the query optimizer. And when we actually look at the code, we, I agree with him. It may have changed. I haven't looked at it in several years but it was a pain in the ass to modify or extend just because, again, there was baked in the code all these assumptions about what the query plan needed to look like as you go from one stage to the next. All right, so what are the advantages of the system R approach? Well, it usually finds a reasonably good plan without having to perform the exhaustive search. Now the downside though is that we have all the same problems we have before with the heuristic based approach because that's the first stage of this. We do heuristics and then we do our search. So all the problems we have before still carry over into this one. Now, taking left deep during trees, this is a system R specific optimization. In modern systems, they actually can consider, you can set a right deep or pushy trees. There's nothing about the search strategy says you have to take this assumption. It's just what system R did. And but I would say the biggest problem and their example was what I said before about the sort order where you have to bake in to the cost model some notion of what the physical properties of the data should look like and use the cost model as a way to enforce that so that it picks one plan that it gives a high cost of query plans or operators that put that modify the data or put the data in the form that it shouldn't be in to produce the correct result. All right, the next class of algorithms are due to our next approach to doing improvisation is to randomize algorithms. And the idea here is that rather than doing a guided search through a branch of bound model or the system R dynamic programming approach, you just take a random walk over there, the all possible query plans you could have and then hope that you find one that actually produces, that it's actually more efficient, right? So this is a good example where you just keep going, keep, you can do this forever. So obviously you wanna stop after a certain amount of time or when you exhaust some cost threshold. So I'll show you example what Postgres does. But the first approach that did something like this was the 1980s they were doing simulated annealing. The basic idea is that you take a query plan that's generated using heuristic only approach and then you just start swapping things, changing query plans in different ways, like changing the join order of two tables, check to see whether the cost of that new query plan is better than the best one you've seen so far. And so what happened is if you have one that produces a better query plan with a lower cost, then you always accept that change and then now you use that and permute that one and jump somewhere else in the solution space. Or if the permutation actually makes things worse, you sort of flip a weighted coin to decide whether you actually wanna keep that and then if not then you revert it and flip back something else. If yes then you continue forward on that. The idea there is it allows you to break out of local minimums, right? But again it's super important. This goes back to our relational algebra equivalency rules that we wanna make sure that any time we do a flip in a query plan that it's equivalent to the query plan we just came from because we don't wanna permute something and have it produce incorrect results. So you sort of every single time you make a transformation you do this check. As far as you know nobody actually implements this one. But Postgres has something interesting where they use a genetic algorithm that's sort of similar to simulated annealing but where they're gonna generate a bunch of query plans in each round and they choose which one, they look at which one has the lowest cost or they say which one has the worst cost. They throw that one away and then they pick some genes if you wanna call it that of the query plans that have the best cost. Then you flip those things around then do another round and see whether it produces a better query plan. The idea is you're taking query plans that are good mating them together to produce new offspring that may produce an even better or more efficient query plan. So again Postgres only does this if you throw it a query that has 13 or more tables that you're joining together. Otherwise they fall back to the system R approach. I don't know how well this works. I actually have not seen any studies that show is this actually better than the system R approach for Postgres' implementation. So again super simple example here I'm joining three tables and the very first generation I just generate a bunch of random query plans. I then cost them, figure out which one is the best cost and that becomes the best cost I've ever seen. Take the one that has the lowest cost or sorry the worst cost, throw that away, pick some random genes from my best ones and then produce new offspring that have this. And then same thing I cost them again, pick which one is the best, throw away which one is the worst, get them to mate and then spit out new offspring and I keep going like that, right? Again the idea here is that rather than, this allows you to break out of local minims in your multi-dimensional search space. So this may stumble upon a better query plan but it's not guaranteed to do that because it's random. So again the advantage is basically what I said. You jump around and try to find things you may not be able to search for. You may not search for in a dynamic programming or a branch bound search. It has low memory overhead. If you don't keep track of any information about what you've collected from one generation or one stage to the next, there's no path in keeping track of, you don't say how did I actually get to the offspring I'm at now. You just say I flip a coin and here's where I ended up. So what are the downsides of this? Well one is that you may not have any idea of widening up choosing a particular query plan which may not be good because how do you actually go back and reproduce that? You also have to do some extra work to make sure that your query plans are deterministic and stable, meaning I take the same query, I throw it through my genetic algorithm. Today I wanna make sure that I come back tomorrow, I end up with the same query plan. And so the way Postgres does this, they basically always use, they seed the random number generator deterministically so that the same query plan always starts off with the same random values. And then we still have to implement our correctness rules to make sure that the permutations don't end us up in a valid state, okay? All right, so we'll get through Volcano and then I'll stop there, okay? Because we only have 15 minutes left. All right, so I just basically showed you the last 25, 30 years, the first 25, 30 years of query optimizers. All of these implementations were written in, like a procedural language like C seamless laws or whatever the base data system was written in. And so the problem with this is that because these are imperative languages, it's hard for us to reason about whether the transformations or the changes we're making to our query plan to optimize it are producing correct results. Again, I just said this last slide, we have to hard code a bunch of extra checks, we have to like the Postgres to make sure that our permutations are valid. So there's no easy way for us to verify that our optimizer is gonna be generating correct query plans without basically doing fuzz or soak testing. Taking a bunch of sample queries, throw it through the optimizer and make sure it produces the correct results. So the other aspect of this is that the, in a bunch of these examples, there's the separation between the transformation from logical to physical versus the overall semantics of what the query is actually trying to do. So that means that in all these approaches, they're sort of doing these operations on a localized operator in the query plan. They're not looking at the bigger picture of what the query plan is actually trying to do. They have to do this because it's very complex to do this query planning. So they sort of carving off a small piece and doing divide and conquer approach. So now the techniques we're gonna get into now are called optimizer generators. But the idea is that we wanna be able to declare our rules or write our transformation rules to do query optimization in a high level of declarative language like a DSL and then feed them into a compiler which then generates a optimizer for us that can then apply those transformation rules. So this was a big thing in the 1980s and 1990s. Again, rather than hard coding in your system, like here's my optimizer, I'm gonna have a separate config file that has all my rules and then I feed that into a compiler that then spits out the C++ code or C code that can then do those transformations. So now the two different ways you can organize what these optimizers look like is to do either a stratified search or a unified search. The other big thing too also about these newer implementations, again newer means like 1980s, 1990s before some of you were born but that's okay, is that instead of having the cost model have to reason about the physical properties of the data, like is it sort of or not? We're now gonna treat that as a first class concept, our first class entity in our transformation rules to make sure that we're putting, we can reason about what the data should look like, what the data actually looks like when it feeds into an operator and when it comes out of the operator. So the sort order one is the most obvious one where in the system R approach, it might search algorithm itself, I wasn't actually considering sort order, it's only in the cost model, did I take that, did I account for that? But now we can include this directly into our search. So stratified search is essentially the same thing as the system R approach where you have heuristics followed by a cost based search. But again the difference here is that we're gonna write our transformation rules in something that looks like a relational calculus and then the engine that we feed our transformation rules into can then reason about whether those rules are correct or not. So we're defining what it means to do a transformation based on the equivalency rules and then the engine can say, oh yes, this is a correct transformation or this is not a correct transformation, right? And so the first step is just heuristics, you're using these declarative rules where you don't have a cost based search and then the second step is that when you're doing the cost based search, right? And that's converging the logical plan to physical plan. So this is like logical to logical and this is logical to physical. So the most famous one was a project out of IBM research called Starburst and as I said, when they first started building DB2 they used the system R optimizer then at some point that became too difficult to maintain so then they started the Starburst project and as far as I know, DB2 still uses the Starburst approach, probably not using the same code it's been modified over several decades but at a high level it works basically the same way. So this is again, this is the one where I don't pretend that I fully understand what they're actually doing. Like if you go read the original paper you'll see all these relational calculus rules which is a pain to read but at a high level it's pretty straightforward, right? You define some kind of rules in their case they're using SGML or it looks like relational calculus and then it knows how to do these transformation rewrites and then we do a system R dynamic programming search once this part's done. And I think Oracle uses this in their, it's the latest, it's the way they implement their optimizer today. The advantages of this is that in practice it works well at least according to their papers and what people say and what IBM developers have told me. Of course now they, it was Guy Lowman, he did implement it and he said it worked great which take that with a great insult I suppose. But I think the big advantage you're getting from this is from a software engineering standpoint where you're not sort of having this hard coded if causes to figure out whether you have all these matches. And actually this is sort of similar to what we were talking about what Eric and his team was talking about when they presented about how they define these rules in their engine and it does the transformations. The last year the team that did transformation rules in our system had all these if causes scattered out throughout the different systems to do rewriting. But now we have a single engine that can do all this. So that's sort of what this approach gives us. So the disadvantage of the Starburst approach is that at least in the original implementation it was hard to assign priorities for transformations. So there are some transformations that may be more important to look at first because you'll get the most performance speed up but you couldn't do that in the original implementation. Some transformations are also not possible to assess whether they're good or not without making multiple invocations of the cost model. I think that was engineering that one sort of fixed. And then this one I actually still buy and they're still using the relational calculus language that was defined in the original Starburst paper maintaining these rules are a big pain because it's not C++ code or C code that most people understand. It's like this other DSL that is unnatural for systems programmers to normally write. So to finish up real quickly we'll discuss the unified approach. And again, I'm biased, this is what I like and this is what we use in our system. So rather than having a separate stage to go from logical to logical and then a separate stage to go logical to physical we're gonna have a single search model that does everything, right? And everything is just gonna be defined as transformations. So the rule engine confires, it makes sense for me to do a logical-logical transformation here or logical-physical here and I don't have to reason about anything about the, do I do one first before the other, right? I can do a logical-logical and logical-physical and determine whether one is better than the other at that moment without waiting to get to a separate stage. So now the thing about this though is that because we're gonna now be doing logical-logical transformations in our search this is gonna end up generating way more transformations that may end up being redundant. If I can do a logical-logical transformation from join A, A join B I can transform that to B join A but now I may want to, it may, the rule may fire to go back the other way A join B. So the way to avoid all these redundant computations I'm gonna use a memorization table to keep track of what rules I've fired before and I know that I don't need to revisit them in my search. Of course now this becomes expensive to maintain but that's unavoidable. So the, there's actually three, well Gertz-Graphy did the cascade model that you guys wrote about but he actually did two other cost models or search engines or search optimizer generators prior to the Cascades one. The first one was called Exodus, the second one was called Volcano and the third one was Cascades and that was his last one. This is the same Volcano when we talk about the Volcano iterator model for query processing. This is the same guy that read the paper you guys wrote or sorry, you read the paper that he wrote on index latching at the beginning of the semester the same Gertz-Graphy. And so for this one what they're gonna do that's different is that they're gonna embed now the, they're gonna embed the physical properties in the actual operators themselves and then the rules can then again reason about whether the transformation you're trying to do is will hold or is valid based on what the physical property expects to have. And so again this is also now a table of a top-down approach that we'll see in the next slide where we start from the beginning we start with the outcome we know we want and we figure out what steps we need to do to get us to that point. Now for Exodus, that was in the 1980s I don't think that went anywhere. For Volcano as far as I know at least reading from the papers is that this was never actually implemented in a commercial system. There's a bunch of other academic systems that took his compiler framework and used that or optimized the framework and used it in their own systems but no one actually ended up putting this in a commercial system whereas Cascades is being used in SQL Server today and SQL Server probably has the best query optimizer. Okay, so let's look at the table here. So again the idea of a top-down approach is that we start at the very beginning we know what we want our final result to be, right? So for us for that query we're looking up the artists on my mixtape and we want to order by artist ID we want to have artists join, peers joins album and we want to order by the artist ID and so now in my operator I'm embedding what I want the sort order to be from my data. So now I'm going to have a bunch of rules that I've predefined to allow me to do transformations to figure out how to get me back up to where I want to be. So say you have a bunch of these logical operators here, right, joining artists, the table is in different ways and scans in different ways. So I can either do a logical to logical transformations like join AB to join B join A or I can take a logical physical transformation to take one of these join algorithms and transform them into a particular join algorithm. Or sorry, I wouldn't do this logical join and pick up an actual algorithm I want to use. All right, so let's say in this one here I could do a transformation on these tables here to join doing a sort merge join and then now I can figure out what do I need to do to feed in the operators into this, to this physical operator. So I want to do a scan on album and then I have a join on artist and appears and then I can keep going down here and now say, all right, that could be a hash join and how do I feed data into this? And I can cost this as I go down, right? I can also do a sort merge join like that. So the thing going in what's different about here is that we have what are now called enforcer rules. Basically they're gonna ensure that the data looks like what we want as our input. So in this case here, this is a sort merge join. So we know that would satisfy the enforcer rule for the physical property of the data. So this is valid. But if I have a hash join like this, this is gonna generate a random sort order. So I know that violates my enforcer rule. So I can go ahead and just cut that off. And then this is a branch and balance search. So if now if I introduce things like a quick sort, I can cost whatever that is. I figure out how much that cost, it's gonna cost me to do this. Then now as I traverse down the tree and say I add a hash join, as soon as I know that this path is more expensive than the best path I've seen so far, say this sort merge here, then I can just prune that off and not traverse down and look at the other aspects of the query plan, right? So again for me, for whatever reason, it's easy for me to reason about this kind of stuff, although this approach is not perfect, then sort of the heuristics. And I think this is also easier to implement and maintain. So to finish up real quickly, what are the advantages of the volcano approach? Again, we have these declarator rules that allow us to implement these transformations. And then it's easy for us to extend once we have a basic search engine, we can extend and add new rules without having to modify any aspect of the actual search argument itself. We just declare these rules and they get fired off in the engine as it does the traversal. So now the downside is specific to volcano is that what happens is that every single time I do a transformation, right? I land on a logical operator and I want to figure out what are the other children I could consider, it fires all the rules first, explodes your search base, and then you have to traverse down to figure out what the cost of all those are, right? And once Cascades is not going to do that, Cascades is actually going to apply our transformations on the fly as needed, right? And then not easy to modify progress. Oh yeah, this approach couldn't do the, could not do the expression tree rewriting or predicate rewriting that we talked about before with Eric and Newton and William's group, okay? All right, so just as a teaser, I should have made these three lectures, I don't know why I do this. I should, we have, now there's no way I'm gonna get through this, yeah, sorry. All right, so just as a teaser, like what Cascades is gonna do that's different than volcano is that it's gonna be the same kind of thing where you have all these logical, logical, logical physical transformations, you still have the enforcement rules for the physical properties, the difference now, though, is that they're gonna materialize those transformations on the fly rather than all the once, every single time you land on an operator. And it's gonna do sort of this grouping technique to allow you to then reason about portions of the query plan without exploding off exactly how it actually would implement it. So I can take, if I wanna join three tables, I can say, well, let me pick my join order for my first two tables and not worry about how I join the other one, where in the volcano approach, you can't do that. Again, it'll make more sense next class, but I had you guys read that master's thesis because there is original Cascades paper from 1995. I think it's terrible, it's a bad paper, like it doesn't really explain exactly what it is. That master's thesis I had you guys read is like, that's, to me, those 30 pages are the best explanation of how Cascades actually works, right? So that's why I had you guys read that. All right, so let's stop here. We'll pick up next class, but in the paper you're gonna read over the weekend from Monday is a updated version of doing the system R dynamic programming search, the bottom of up approach, written by the Germans, the hyper guys. And so they argue Cascades is a bad idea and that the system R approach is actually the right way to go. I don't have a strong enough opinion yet to see whether they're right or not, although the Cascades guy says, don't use Cascades either, or don't use Cascades in the way that we're using it, which is fine. All right, any questions? You know I'm good at hiding. Got a bounce to get the 40 ounce bottle. Get a grip, take a sip, and you'll be picking up models. Ain't ain't no puzzle, I got some, cause I'm more a man. I'm down in the 40 and my shorties got short cans. Slaps and six packs on a table. And I'm able to see St. Ives on a label. No shorts with the cost, you know I got them. I take off the cap, my first attack on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom, baby.