 All right, so for this part of the course now, I talked about this a little bit about this in the networking class before we had spring break. But we're now, in the course of the semester, we've gone now back to the top of our conceptual database system. And now the lecture is going forward. We're going to go down the stack again. And now let's talk about how we actually execute queries. So the lectures before this, we were talking about how to actually store the database. Storage models, compression indexes. And now we're going back around and talk about how query is actually going to execute query on this data. And the reason why, again, why I frontload the semester to talk about concurrency control and query compilation is because we want to, again, be thinking about these things all throughout the semester as we talk about various things we're doing. So in this class, we're going to talk about now how do we actually build a query optimizer. How do we take a SQL query from the application and generate a query plan forward that's efficient and be able to execute it? So we're going to start talking about a high-level background about the high-level things that we care about in a query optimizer and what query optimization actually means. And then we'll talk about the different design decisions at a high level that we're going to need to have when we implement our query optimizer. What things do we need to care about? And how can we actually implement things? And then we're going to spend most of our time talking, though, about the search strategies. So the paper I had you guys read was a master's thesis from the 1990s, which seems kind of a weird thing for me to assign you. But that thing is actually the best description of the Cascades optimization framework that I can find anywhere. If you read the original Cascades paper, it's a bit dense and it's not really clear. So Cascades is a search strategy. But there's a bunch of other ones that we're going to talk about as well. So this will be, I'll talk about the history of those strategies and what people have implemented. And that'll get you guys to understand why the Cascades model, in my opinion, the unified search model is one of the best approaches to do this, because you'll see why the other ones are insufficient. And then on Monday's class next week, we'll actually spend more time to actually talk about how Cascades actually works and how you implement it. Because this is what we use in our system. All right, so I shouldn't have to define this. Everyone should understand what it means to do query optimization, but we can talk about it a little bit. The basic idea is, again, the application is giving us a query in a declarative language, meaning they're telling us what the result is that they want. And it's up for us to figure out how to generate a query plan that actually compute that. So for a given query, you want to find a correct execution plan that has the lowest cost. And in this opening sentence here, I'm emphasizing the two words correct and cost. So correct means that the database system produces the answer that the person wants with the query. It doesn't help us if we generate a query plan that may be the fastest thing in the world, but if it comes back with wrong results, then it's essentially useless for us. I talked a little bit about approximate query processing when we talked about compression earlier. It's a way to give an approximate answer rather than the exact one. That's a totally different area. It's sort of related to this in that it's not correct, but the user is telling the database system, I don't care if you're not correct. And most of the time, when someone gives us a query, they want the correct answer. So we need to make sure we find a plan that generates one. And then I'm putting cost here in quotes because the cost is going to be a relative term that will be used internally by the database system to help us figure out what our execution strategies are. And it depends on the operating environment that the system is running in or the application is running in will determine what our cost actually is. If we're on a disk-based system, it's obviously how many pages or blocks we read from disk. If we're in an in-memory system, it might be the number two polls we read. But if we're in a distributed system, then it might be how much data do we send or read over the network. And if we're running on a mobile phone, then it's how much energy the query is using. So the search strategies that we're going to talk about today are independent of the cost. And that's something you define in the cost model that the optimizer uses, which we'll talk about in more detail Wednesday next week. So I will say that the query optimization is widely regarded as the hardest thing to actually implement well in a database system. The problem itself, how do you take a query plan and generate the optimal plan for it is known to be NP-complete. Now, for real simple queries, it's not that big of a deal. But when you start talking about 75-way joins or joining 75 tables, then you start to have problems. So because the search space is non-polynomial, no optimizer is actually going to ever really produce the optimal query plan for a particular query. Instead, we're going to use a bunch of techniques that we'll talk about today that can apply our domain knowledge about databases and applications to come up the way to not have to do an exhaustive search over everything. We want to try to limit the search space. And likewise, because we want to figure out what the cost is for a query plan before we actually start running the query, we have to come up with a way to estimate what the real cost is going to be. Again, this is the cost model stuff we'll talk about later. So at a high level, query optimization sort of looks like this. So you have your application. They submit a SQL query to the database system. And in the first stage, you could do what's called SQL rewriting, where you basically have rules that take the SQL string that the application gave you. And you could do some additional conversion on it to put it into a different form. This is optional. As far as you know, most systems don't actually do this. But this is something you could do. If, say, you want to rewrite something just based purely on SQL, you can do this. This appears mostly in middleware systems or proxies, where you want to rewrite the table name to put it to a real table name, like a logical name to a physical name. So then we take our SQL query, we run that through our SQL parser, and then this is going to spit out essentially a logical plan that's comprised of the abstract syntax tree. So it's just basically taking all the tokens that's in your SQL string and then converting them into a form that's decipherable by the database system, like a tree structure. And then we run this through the binder. And the binder is basically now going to take all the SQL strings, sorry, the strings of names of things that you're referencing in your SQL query, and we'll map them to the internal identifiers that the data system uses to keep track of tables and columns. If I'm doing a query, select star from table foo, the binder would then look up in the catalog and say, give me the internal ID for the table with the name foo. And then it annotates the logical plan, the AST, now with that table ID that it can use to then look up in memory or find the actual object it wants to operate on. Then in the next phase, you have what's called a tree rewriters or an expression rewriters. And this is where you can do additional optimizations that are static, things like rules to rewrite the query plan by operating directly on the AST rather than the SQL statement. The team last Monday that presented their work on the optimizer, they talked about doing rewriting of expressions to extract out the and clause that was redundant in a disjunction. So that's essentially what you do this here. But this is optional, you don't actually need to do this and you can still optimize the query. And then the thing we're focusing on here is now the query optimizer. This is gonna be a combination of accessing the catalog to get information about what your table looks like and then it has a cost model to produce estimates about what it thinks the execution time or storage overhead of executing query will be, chooses the best plan and then spits that out for execution. So now when we talked about the code gen and the query compiler and stuff, that essentially would happen here. So the query compiler has to take in a physical plan that the optimizer spits out and then it can compile executable code. So we're focusing on this class, we're focusing on this piece here. So the first thing we're gonna talk about is the distinction between logical plans and physical plans. So this is gonna come up all throughout the lecture. So you can think of the logical plan as the high level operations, almost like thinking of like relational algebra operators that describe what the query is supposed to do. I wanna join table A and B. It doesn't say anything about how you actually do that join. It says I wanna, the query says I wanna do it. And then what's gonna happen is in our optimizer, it's gonna try to generate a optimal equivalent physical plan for a logical plan. So if I wanna do a join on A and B, the logical plans say just do join A and B, but then the physical plan would say do a hash join on A and B or do a hash join on B and A. Right? And so what'll happen is, sometimes a single logical operator could get exploded into multiple physical operators and then the reverse can happen. Multiple logical operators can get coalesced into a single physical operator. So say like my query plan wants to do a join and then followed by an order by on the key that I'm joining the two tables on. Well, that would be two logical operators in relational algebra, but in my physical operator, I could do a sort merge join that would then combine together those two logical operators into a single physical operator and it would produce the same output. Right? So it's not always gonna be a one-to-one mapping from a logical to physical. It depends on what we're actually trying to do. And a key thing that's gonna show up later on when we start motivating why we wanna use Cascades or the Starburst method is that the, what operator you're allowed to convert or transform a logical operator, what physical operator you're allowed to convert a logical operator into could depend on what the data actually looks like. Right? So if the data needs to be sorted, if the data, we expect the data to be compressed, these are the kind of things we need to be mindful of in our physical plan, right? Well, we don't care about in the logical plan because it doesn't know anything about how the data is actually laid out. Right? Because, you know, relational algebra doesn't deal with these things. So one of the things we're gonna be able to exploit to make these, our query optimization stuff work is the fact that there are certain equivalencies that we can rely on in relational algebra that will allow us to transform a logical plan, one logical plan to another logical plan and be guaranteed that they're gonna always can produce the same result. Right? We can rely on the community property of relational algebra to say, if I join B and C and take the output of that join and join A, well, that's gonna be equivalent to joining A and C first and then taking that output and joining A. So we can rely on these properties, sensitivity, commutativity of relational algebra to again infer whether we're actually generating a correct result. Now that the physical operator stuff will be important later on, like again, if my output needs to be sorted, I need to make sure my data ends up being sorted. So real quick, we start going into the weeds on these things. The one thing I wanna say real quickly is that for all OTP applications, query planning or query optimization is actually really easy. And this is because most of the queries in an OTP application are being known what is called Sargeable, which stands for search argument able. Somebody made it up in the 80s, just go with it, okay? And so the reason why query planning is really easy or much easier to do in OTP environments is because most of the time the queries are Sargeable, meaning you can figure out exactly the one index you need to use in order to execute that query. Most of the OTP applications are gonna be something like select star from table, where ID equals one, two, three. And there'll be an index on that ID column. So all you need to do is look in the where clause and say, oh, I'm doing an equality predicate on the ID clause and I have an index on the ID column. So that's the one index I wanna use. And I don't need to do an exhaustive search to look at join orders and things like that. So these type of queries are much easier to deal with. In case if you do have some joins, they're almost always gonna be on foreign key relationships. Select Andy's customer account and join it with the orders table with all the orders he's purchased. There's gonna be an index on the foreign key inside the orders table, so I just pick that and do an S-loop join and it's really easy to do. And the way we can implement this is with really simple heuristics. And this is what pretty much most people do when they start building a new database system. This is what everyone starts off with when they build a new query optimizer, right? Because it works reasonably well for OTP. So in this lecture, the next lecture, and when we talk about cost models, we're really focusing on OLAP queries that are way more complex and it's not super obvious exactly here's gonna be this one index I wanna use. Because as we saw this in the case of TPCH, there aren't gonna be indexes that we can use for our queries. So now we gotta figure out other things. We need other tricks we can do to optimize our queries. I talked about this a couple of times already, but cost customization is gonna be a big, big part of making sure our query optimizer works. So we can have the most fancy search strategy or search algorithm we could want, but if our cost model is crap, then it's gonna generate crap plans, right? And the reason why we have to estimate what the execution time is for queries because it'd be too expensive for us to try out every single one and then see which one actually works out the best. Now this is actually what MongoDB does, at least as a last year, they might have changed it, but they didn't actually really have a cost-based query optimizer. They would generate every query plan, execute one, see how long it took to come back, try it out a couple of times, and then try out other ones, whichever one is the fastest, they just end up using that. And for operational workloads, that's like OLAP stuff that's probably good enough, because the otherwise you need to have a cost model that allows you to estimate what the system thinks is gonna do or the cost of executing that query. So it depends on a whole bunch of different things, right? Depends on what other queries or background jobs are running the system the time, the moment you're running, depends on the amount of data each operator is gonna spit out or emit as it computes, depends on what algorithms or access methods like indexes you're gonna have, how much resources you're gonna use, what the data actually looks like. But you need to consider all these things potentially in order to generate a accurate cost model. Yes? What would be enough for MongoDB to run like several scams and choose the one with the lower cost? I mean, like, there's still value. Like I said again, so. Is it like, why would it be enough for MongoDB to like just pick the one with the lower cost? It's like, it's a dynamic. So his question is, why is it okay for MongoDB to just do the sort of a basic search algorithm where you just pick one query plan, run it, see how long it takes, keep track of that, and then maybe try another one and see if that gets any faster. Why is it okay for it to do that? So one, up until very recently, they didn't support joins. So it really just has to do with picking what index to use and maybe what they didn't really do shuffles either. Index selection, right? That would be the simplest one to do this. Do you think their like work load generally are not really complicated? Yeah, so MongoDB's workloads, up until recently were quite simple, right? Doing a look up on something, on a collection looking for a document, I rather than me try to figure out what the cardinality is for indexes, just pick each one at random and see which one is the fastest, right? So again, we're gonna discuss more about this next week. Again, the way we need to think about this is the cost model is the objective function we're gonna have in our search strategy and it'll define whatever it is we're trying to optimize for. For an in-memory database, it's typically gonna be the size of the intermediate results. It's what we're gonna depend on, right? How much data are we actually processing? And that ends up being a good sort of high-level aggregation of possibly low-level things like the amount of CPU or network I'm using, right? But certainly if you're having a disk-based system, whether you're doing random reads or sequential reads, that can make a big difference. Again, we'll see this more next week. All right, so now we wanna implement an optimizer. There's four things we have to actually consider, right? Of how we're actually, the things we need to care about when we're building this thing. So we'll talk about each of these. Optimization granularity, timing, how to handle prepared statements, and plan stability. So optimization granularity is essentially the sort of, what is it we're actually optimizing, right? And so in the first case here, it's a single query, meaning like the application sends a single SQL statement, and I'm gonna do, I'm gonna optimize that SQL statement or generate a plan for that single SQL statement as if it's the only thing executing in my system. Meaning I don't consider any other query at the moment I'm doing my optimization, right? And the advantage of this is that it's a much smaller search space because I only need to care about what do I need to do to optimize my one query here. The downside is that now you can't actually reuse any of the information or knowledge you gain about how you optimize that query. It's difficult to apply that knowledge to other queries running at the same time. Now you can handle this with prepared statements, but that has other issues we'll see in a second. And if you wanna now count for, as you generate the query plan for this single query, or account for other queries that could be possibly running at the same time, right? And you could have resource contention, like if every query takes all the memory in the system in the run, then you're gonna run out of memory. In order to handle this, your cost model needs to account for what are the other queries are running at the same time and what are their resource demands. So you guys missed the vertical talk on Monday after class, but one of the things they talk about in their system is that in their query optimizer, they keep track of what other queries are running at the same time, and if they know that they'll run out of memory for the query that they're actually planning at that moment, maybe they'll allocate it with less resources, and that can actually dynamically change that as the query executes. The other choice for your granularity is to do multiple queries. And the idea here is that you're given a batch of queries all at once, and you generate a global optimal plan for all those queries, right? In this case here, it would be a local optimal, and in this case here, you could generate a global optimal. As far as I know, no system actually really does this, except for in the context of streaming systems or continuous query systems where you're given the queries ahead of time, like you wanna get an alert when the stock price dips below some amount, and that's always running in the background. So you provide these things ahead of time. It's challenging because the search base is much larger because now you're just not optimizing one query, you're optimizing all the queries together, and your cost model needs to account for that. The only one example I can really think of outside of streaming where this is useful is to do a technique or optimization called scan sharing, where say you have a query shows up almost in a sequential scan on an entire table. For this, assume we're doing a disk based system. So one, the first query is gonna do a sequential scan and read every single block in your table, and then say it gets about halfway through, another query comes along and wants to do the same sequential scan rather than that second query starting from the beginning and falling along with the other one, it then said jumps ahead to where the first guy is running, scans along and reads the same data that it's reading, and then just knows that when it reaches the bottom and the first query is done, the second one's gotta loop back around and get all the things that it missed the first time through. This is technically usually not done in the optimizer, but it could be done, because if you know what other things are running at the same time, this is typically usually done at runtime, but the idea is the same, trying to generate a query plan that can account for other things running at the same time. The next is the timing, essentially when do we actually do our, when do we actually run our query plan or optimizer? And what most people think about is doing in the, it was called static optimization. This is like when the query shows up in the system at that exact moment I'm gonna generate, I'm gonna run my optimizer and generate a query plan for this. This is what pretty much everyone implements because it's the easiest thing for humans to reason about and write code for. The plan quality is obviously dependent on the cost model accuracy. If your cost model is way off, as we'll see next week, then you're gonna generate bad plans. And this can be expensive to do, because we're trying to do this in real time, like someone smits a query and we wanna actually generate the plan at that moment. To avoid this overhead, you can amortize the cost of running the optimizer for every single query by using prepared statements. But as we'll see in the next slide, this has other problems. The next choice is actually to do dynamic optimization. And this is where you generate an initial query plan that doesn't need to super optimize. Then you start executing it. And as you're executing it, you select what you actually refine the plan further and try to optimize the operator you're actually executing on the fly. So if I'm doing a join and I'm gonna have to scan two tables, maybe I recognize that halfway through that I really wanna be scanning this table using this other index instead of this the one I'm using now. So I'll go ahead and sort of restart the operator and use maybe a faster access method, right? So this is really hard to implement. We'll see one example where you can do this in ingress, but the way they did it is more or less a hack. But so as far as I know, this is not that common. I think Green Plum talked about doing something like this and possibly Vertica as well, right? It's really kind of hard to do, right? Cause you kind of need to like stop what you're doing as you're executing operator and then change your strategy. The last approach is use the hybrid optimization, right? And this is basically you compile using the static method first and then as you're executing, if you recognize that your cost estimate is exceeding some threshold for what's occurring in the real system when you actually run the query optimizer or as you run the query plan, then you sort of go back through and invoke the query optimizer again, try to rely on what you've already done so far to maybe free some portion of the query plan and then re-optimize the rest. Again, far as I know, I don't know if anybody does this. This is actually my, I think this is actually might be what Green Plum does. All right, so as I said before, this is the most common approach to static optimization and this is problematic for some things because if we're trying to do this as someone sitting at the terminal or the applications evoking query, right? We can't start executing the query until we actually have the query plan, right? So if we have to take 30 seconds to generate a query plan for a query that's gonna run in one second, then that's not a good trade-off, right? Maybe it'd been better to do planning for five seconds and have a query that runs for 10 seconds, right? Because in the aggregate, we're still better. But one way to avoid this is use prepared statement but prepared statements have problems. So let's say I have this really simple query here. I'm doing a three-way join between table A, B and C and I have these predicates here where for each the ABC table, they have a vowel column and I have some inequality expression or greater than expression I wanna run. So every single time I invoke this query in my application, I gotta re-plan this entire thing even if I execute it over and over again. So the way to avoid this is use prepared statements where I can take that query now, tell the data system, hey, I'm gonna execute this thing. It gives it a handle, my query, and then when I ever, I wanna execute it again, I just say execute my query, right? What's the problem with this approach? I've hard-coded these values in here, right? So this works great if I'm just executing this one query over and over again. But let's say instead of executing with value greater than 100 or value greater than 99, I want something else. Well, now I can replace this with variables or placeholders where I can inject values at runtime. So now when I invoke this query, I can pass in like an RPC call, I can pass in the values that I want to substitute in here. What's the problem with this now? What's that? Exactly, right? So the question, so he said the cost may change, right? So at this point here when I'm actually trying to generate the query plan, right? The big question I need to figure out is what's the join order I wanna use for this query? And so before when I was had this, I could figure this out what's the actual optimal way to order these things? But when I have placeholders, I don't actually know what these values are because I haven't been told because I have placeholders in my query. So I don't know how to actually optimize this thing, right? So there's three choices that had a handle list. The first is that you just run the optimizer again for every single vocation of your prepared statement, right? And the naive thing to do is just again to run it from scratch every single time, but then essentially defeats the purpose of a prepared statement. So you can try to reuse the existing plan, like start with an initial plan or the last time you ran the query, start as that as a starting point and then try to maybe refine it further for the exact invocation you're dealing with. But this is non-trivial to do. It's not easy to do. The alternative is to maybe then actually generate multiple plans for different possible values of these input parameters, like maybe take buckets, like desiles or quantiles and say take the average one for each bucket and generate a plan for that. Then when the application invokes that query, you look at the input parameters and say, well, it looks like these other possible values I've seen before and I have pre-generated plans for it. Let me use those. But now the problem is the number of possible plans you have to generate could be exponential, right? Because you may not have exactly the same bucket line up or use the same exact buckets for every single invocation. And then the last approach is essentially just take an average value for every single parameter. You look in the catalog, look at the statistics you've generated for each column and just take the average one and then generate that one query plan and hope that it's actually good enough for every single time you invoke it. So most systems actually do this last one. I think this is what Oracle and IBM do. I think SQL Server does the middle one. And actually I think Postgres probably does this one as well. Yes? You just get an object function from the optimization and just put those up in the Google Function to get the output. And your statement is, can I just take the objective function? Yeah, object. The cost model? Yeah, the cost model. And do what, sorry? And just put those up and then get the results back. So the statement is, can I just take these? Take this my query. This is my query. This is my query plan. I have a paired statement. This is the query plan for the prepared statement. Yeah, and generate the object function for the query plan. And when the arguments change, just put the arguments into that. So the statement is, generate this query plan, right? And then at runtime, when I invoke the query, take these parameters, substitute it in here and then run this through the cost model? It's not gonna work. It won't work, right? Okay. So the last thing is what's called plan stability. And so this is really important in enterprise environments because the basic idea is that you don't want wild fluctuations in the performance of your database system, the performance of your query optimizer. Meaning if I have a query plan, if I have a query and I run it through my optimizer today and it takes 10 seconds, the next day I don't want to take one second and the next day after that, it takes a thousand seconds, right? Ideally, I wanna have stable run times. And so if we now need to be able to optimize or upgrade our system, right? We wanna go from Oracle 10 to Oracle 11. We wanna make sure that, again, we don't have wild fluctuations in our query plans for some percentage of our application, right? It's okay to upgrade to a new system and then have everything get faster, but if it's 75% of your queries get faster than 25% get much slower, people will notice that and people complain. So we wanna have a way to make sure that our optimizer over time can may not generate the most optimal plan, but at least it's stable. So the first way to do this is to provide hints to the optimizer, essentially the DBA can extract or export the query plan, think of like a prepared statement and then go in and annotate that query plan to say, oh, you should use this index for this table scan, use this join algorithm for that join, right? And then load that back into the system and then now every time you evoke the prepared statement, the optimizer relies on those hints as it generates the query plan. The next choice is actually to be able to support all different old versions of your query optimizer and cost model in every single new iteration of the database system. So Oracle is actually pretty famous for this. So if you buy, I don't want the latest version of Oracle is 14 or 15, if you buy Oracle 14, they actually include in the binary the optimizer for Oracle 13, 12, 11, 10, all these older versions. And you can select as the administrator to say, I wanna use Oracle 14 with the Oracle 12 query optimizer. And the reason is because the DBA would have vetted the query plans of the earlier version of Oracle generated and they know that it's gonna produce reasonable results and they don't wanna switch over to a new version without making sure that everything is tested. The last approach is to support backwards compatible query plans. The basic idea here is that you can export out as an XML file, your query plan, and then when you upgrade to the new version of the system or make a change to the system, you can then load those query plans back in and it'll be fixed for the prepared statements and that way you know they ran well before and they'll run the same on the newer version. So we're not gonna really worry about any of this in our discussion today because it's really how to actually build the optimizer but these are some things you need to keep in consideration if you wanna actually run this in a real system. And this is also the reason why the optimizers in most data systems are like the most untouched piece of code of the entire system because nobody wants to break something or have some weird regression in performance. For some classic queries you didn't think of when you're trying to speed up something else. So plan stability is an important thing and I think it sort of goes beyond just performance and software engineering. All right so now again we can start talking about the search strategies. So I'm gonna talk about five different approaches here, heuristics, heuristics and cost models and randomized algorithms. The two we're gonna focus on the most at the end are the stratified search and unified search. And again, the Cascades model you read about is an example of the unified search model. So sort of thinking of this is again going through the history of query optimizers and you'll see why we ended up with where we're at today. Now also the Cascades model is from 1990s, Starburst is from late 1980s but these are still considered to be state of the art. You'll see this on the paper you read next Wednesday, SQL server uses the Cascades model at least it's purported to and it actually is phenomenal, it is much better than everyone else. Even though it's based on a search model, a search strategy from before you guys were born. It's sort of like quick sort, right? Quick sort, no one has come up with a better version, it's from the 60s, it's good, all right? So the first strategy is used heuristics. And as I said before, this is pretty much what everybody implements when they build a new database system and they need to have a query optimizer. This is the approach everyone uses and this is what we ended up using up until last year. So the basic idea is that in the code itself, you're gonna define some static rules that can transform the logical operators in a query plan into physical operators, right? As I said before, the logical operator would say, I wanna join table A and B, a physical operator will say, I'm gonna use a hash join algorithm to join table A and B. And so there's sort of four standard things or rules that everyone's gonna implement, right? So you have higher level things like you always try to do the most restrictive filter as early as possible in the query plan or you provide to do all your filters and before you actually do joins. This is obvious because you wanna reduce the amount of tuples you have to consider during the join. Standard things like predicate pushdown limit and projection pushdown and then deciding how to order the tables when you do joins based on the cardinality. So the key thing to point out here is there's not a cost model in here, right? We can just have all of this code in the system that can run and just apply these rules statically. Well, everyone have a worry about comparing two different plans. So this was the approach that actually was used in the first version of Ingress, which I'll talk about in the next slide, and actually was used by Oracle up until the mid-1990s, which is actually pretty impressive. And so you think about how did Oracle get so big and so popular and so wealthy when they were using the most naive version of the query optimizer? When I said the query optimizer was one of the most important parts of the database system. And it's important to remember back in the 1990s, think of applications having support like SQL 92. SQL 92 has basic selects, joins, filters, predicates, aggregates. It doesn't have all the complex things that we have in SQL today, like window functions and CTEs. So the complexity of the queries that they had to support back in the day was not nowhere near as challenging as they are now. And also the size of the databases are much smaller back then as they are now, right? The query optimizer really, really starts to matter when things get really big and get more complex. So I've been told that Oracle got really far with their query optimizer. It was an impressive piece of code, but the issue was that it was near impossible to extend or modify even though Oracle had a large team of people working on this. So they eventually abandoned it and they switched over to the stratified search model, which we'll talk about in a bit. So for this class, the running example I'm gonna use as we go along is from this sample database, it's sort of modeled after Spotify or iTunes, where you keep track of music. You have a table of artists and a table of albums and then you'll have a cross-reference table called Appears that has foreign key references to do these. So the idea is here, you wanna keep track of for every album what are the artists that appear on it. So let's say now if we wanna go through an example of how the Ingress Optimizer worked, let's say we're gonna use this sample query here where we wanna do a three-way join between artist, Appears, and album and we wanna get all the artists that appeared on the mixtape I dropped two years ago in the database group, right? We wanna get them to show up through this join. So Ingress is gonna use heuristics to rewrite all the join queries that they have to execute into single-table selects. The reason why they do this, because Ingress in the 1970s didn't support joins. So they would use heuristics to transform the query plans into multiple query plans that all would do single-table selects and then mash those results together to produce the join, right? So in the first step, what they're gonna do is they're gonna decompose this join tables, join queries into single-table, single-variable queries, right? So take for example, the first part we wanna join on album. We're gonna extract out the lookup on album where we do the lookup on the name of my mixtape and then we're gonna have the output of this query be put into a temp table. And then we're gonna rewrite the second query now instead of referencing the album table, they're gonna reference the temp table I'm generating here, right? And then now what they're gonna do is do this rewriting again, take this guy and now I'm rewrite that into two separate tables or two separate queries again where the first one is doing the join with the first table generated here and the third query here is doing a join with the second query here. So now they're gonna actually start to execute this. I said they don't support join so the way they're gonna be able to handle this is that as they execute going from the top to the bottom they're gonna take the output of one query and then feed that as a input parameter to the second query and so on, right? So say we execute this first query and it produces an album ID 9999. And so now in the second query we're gonna extract out where we do the join on the temp table and just replace that now with where album ID equals 9999. So think of this as a for loop on this table and this query here for every single output we're gonna then modify or rewrite this query here and substitute the input parameter with whatever value we've generated above it, right? And we do the same thing for the next guy. So this guy will produce an output say it's artist ID 123 and 456. So again for for loop over those results and then we'll generate two separate queries here that substitute the artist ID with the output of this query here, right? So they're essentially doing query planning on like a per tuple basis, right? Because for every single time you invoke this you can run it through their heuristics to do query optimization, right? Selecting what index to use, right? So this seems super inefficient, right? But you have to understand like back in the 1970s the Harvard is really limited so they weren't really operating on complex databases. In fact the ingress paper where this sort of example comes from talks about operating on a table with 300 tuples, right? That's nothing. So it'll work in that environment but if you go to a billion tuples this is certainly gonna be slow. Yes? So these two queries like two scans over the table. So this question is do these two queries represent two scans? Absolutely, yes. But why they don't just check all of these two values at the same time? Yeah, so his statement is I couldn't rewrite this to be artist ID in 123, 456. Yes, you could do that. I don't, they probably didn't support in back then. Or disjunctions. And then I produce my output here, right? So the advantages of the heuristic based approach is that it's super easy to implement within reason up to a certain point because essentially it's always gonna be almost a straight mapping from the logical plan to the physical plan. I need to do an index scan or I need to do a lookup. I have an index, uses index, right? It's also easier to debug because there's no sort of complex search algorithm. It's gonna, again, you almost read the code going from top to bottom. You know exactly what rules are applying to generate the query plan. And for the Sargeable queries are real simple things. This works reasonably well. And that actually is gonna be really fast because there's no cost model. It's always gonna be these static rules when we just transform our query plan. Of course, now the downside of it, the reason why this is a bad idea is that it's really hard to extend and maintain. And it's nearly impossible to generate good query plans when there's complex dependencies between different operators, right? Joins, for example, or nested queries, CTEs. There's no notion of cost in these rules. And so it's not gonna have a way to iterate over them and try to figure out, oh, this is the right plan and this is a better plan than another one. Another key aspect of this is that what I call magic constants have to be essentially embedded inside the set of these rules to help it make decisions about sort of hard-coded thresholds, right? When is the right time to use an index versus another one? So as I said, this approach was used by Oracle and Ingress, so they represent two out of the three original relational database systems that were developed in the 1970s. The third system was IVM system R. And you heard me talk about how great system R was multiple times. And this is because they were building a new system from scratch. They got a bunch of people with PhDs in a room and everyone carved off their own piece and they sort of built that in the system. So one of the key things that came out of it, one of the major contributions that came out of system R project was their cost-based query optimizer. And that was led by Pat Selinger. And so this is the first example of a cost-based query optimizer or planner in a relational database system. And again, back then they were competing against the IMS and IDMS from the Code of So guys. And they argued that the database system would never be able to generate a query plan that was better than one that was written by humans. And so those other systems didn't have a declarative language like SQL or Quel. The programmer had to write the algorithms how to do things. There was no query optimizer at all. So this was the first attempt to show that oh, in the same way you don't wanna write assembly, you wanna use a compiler, you don't wanna write your data processing algorithms yourself or query execution yourself, you wanna use a query optimizer. So the key thing about how the system R approach is gonna work is that it's gonna use what is called bottom-up planning. And so the basic idea of this is that we're gonna start with nothing and then we're gonna sort of search up into the tree and build out our query plan to reach our final goal which is the correct output that we want. If you come from an AI background, this is also called forward chaining and they're gonna use a dynamic programming technique or divide and conquer in order to do this search from the bottom up. So as I said, this was what was implemented in system R in the 1970s. And then the first version of DB2 borrowed some code from system R in the early 1980s. And so it also used the same model. And this is also pretty much the approach that's used in, as far as I know, in almost every major open source database system today. This is essentially what Postgres does, what MySQL does, and as far as I know what SQL does as well. So again, the idea is that you're gonna have the same static rules that we saw in the last example, the heuristics. You're gonna do some initial transformations and then you're gonna have a search algorithm to try to figure out how to refine that query plan further. So what system R was gonna do is gonna break up our query into blocks and then you're gonna generate logical operators for each of those blocks. And the way it's gonna do that, so that's sort of done the heuristic part. And then for each of the logical operators that we spit out, then we're gonna have our search algorithm to try to figure out the correct physical plan for this. So it's gonna iteratively construct a left deep join tree or search tree, left deep query plan tree that is designed to minimize the total amount of work that it takes to execute this thing. And so a left deep tree is when you have all the joins occurring on the left side of the query tree. So you say you join A and B, then the output of that join is then fed into a next join operator that might join C. So it's not coming down the right hand side would be a right deep tree or a bushy tree would be, you could join A and B and C and D and then the output of those two joins are then joined together. Everything's always done as a left deep tree because that was the optimal thing to do for the hardware they were dealing with back then. And it also cuts down the search page number of different query plans you have to process. So let's go through anything but what System R does. So I'm gonna use that same query before where I was doing a look up on my mixtape but I'm gonna add a little extra pit here where I'm gonna say I need my output to be sorted by the artist ID. And this will come up, this will matter when we actually see how the algorithm works and you'll see how the cascade model or the unified search model gives of this in a more in a better way. So the first thing to do is now you gotta look at all the tables I'm gonna access and choose the best access path or access method for them. So in this case example here for the artist in the peers table I don't have an index, assume I don't have an index so I'm just gonna do a sequential scan but then for my album table I wanna do use an index to look up on the name column. And this selection here is based on heuristics. So then the next step is now to iterate or enumerate all the different possible join orderings we can have for these tables. So you could join artists in a peers followed by album or it appears an album followed by artists. Basically every single possible permutation of this doing natural joins as well as Cartesian products. Those are technically still valid join approaches as well at a logical level. In the case of system R at least in the original version of it they pruned out anything that was a Cartesian product so they said that they were useless. That's actually not true so much in today's time because there are some workloads where Cartesian products are useful like doing deduplication, right? Trying to do data cleaning you actually may wanna do this but for our purposes here to assume that system R would prune them and we don't have to consider them. So then now what you need to do is you wanna determine which of these is actually the right way to join these guys based on what join algorithm you wanna use, right? So you can sort of again think of this as a search tree. Say this is the bottom, right? This is where we're starting with where we have the three tables artists appears an album. We haven't joined them yet and then over here is the goal where we wanna be, we wanna have an output of a query where artists and appears an album are joined together. And so we're gonna go from now from this side to this side and try to figure out what the optimal path is, right? So what you do is, and also assume that all the other join iterations that I showed in the last slide are just listed down here. I'm not showing them to keep it simple. So each of these edges now in our search represents a physical operator or physical execution of a join. So we say we wanna do a hash join or a search merge join. And then we specify how we're actually joining these together. So for this here, at this point here we'll have a logical operator that says we have an album is joined with appears and artist still hasn't been joined yet. Whereas the top one artist is joined with appears and we haven't joined album yet. And so we have sort merge join edges and hash join edges for both of these. So now we use our cost model to say to do this operator how much, what is the cost of actually executing it? In the case of system R it was all about how much data you read from disk and in memory system it could be how much data I'm actually processing. Or how much data is this thing gonna spit out? So let's say again we run a cost model for each of these approaches and that these are the two edges that have the lowest cost. So then now we start from these two operators here and we wanna say one of the different ways we can join them do a join that gets us to our final output here. And the same thing we have different join algorithms we can use and each of those have different costs. So then we end up picking one that has the lowest cost and then we can now work backwards and say what's the path that got us here? So what's the problem with this example here? Was that? He says he searched everything. No, you don't necessarily have to do that but it's a divide and conquer approach. But the key thing that's missing here is that I said in the last slide that I added that order by clause. That means my output needs to be sorted. But in the original system or implementation and what I'm showing here there's no notion of sorting, right? These operators don't say anything about how what the data needs to look like, right? So the only way I can still handle sorting in this environment is I have to now put in my cost model some logic that says oh by the way the final output needs to be sorted on artist ID. So if the operator generates an output that's sort of an artist ID, then I'm good to go. That's what I wanted, right? So the problem with this approach is now you have logic in the cost model that is reasoning about what the query plan actually is and what the data actually looks like and what it needs to look like as well as logic in the search optimizer or the search strategy to actually figure out whether something should be in that sorted order or not, right? So in this case here, the hash join may actually be faster. This path to the top may actually be faster if I'm just looking at purely on a join basis. But the bottom one here will generate my data after the sort merge in the sorted order that I need for my query. So I don't have to do an extra quick sort over here or an external merge sort to sort the data after the hash join. I get directly sort of output from this. So this actually has the lowest cost. But again, from the search strategy itself, that information is not conveyed in any way. I have to bake that into my cost model. Yes? So in this case, what would have been responsible for? So this question is, in this example here, what would be responsible ordering? I'd have to add a extra heuristic to add in an order by operator here in order to produce that output. But it doesn't have any information about it before. Correct. That's the whole point of this. I'm saying this is insufficient to do exactly what we want. And I'm purely based on the search strategy. I have to bake in my cost model information about the physical properties of the data. OK. So again, the advantage of this is that it usually finds a reasonable plan without having to do an exhaustive search. The downside is that we still have that initial phase where we're doing heuristics. So we have to then hard code our static rules, which can be difficult to maintain. In the case of system R approach, left-deep join trees may not always be the optimal thing in memory environment. It's OK in their world, not so much in ours. And then this last point here is that there's no point in this implementation of this search model that can reason about the physical properties of the data that we expect in our output. So another class of algorithms going forward is to do a randomized algorithm. So this is a bit of a deviation from the normal history of query optimizers. As we'll see in a second, as far as I know, the only system that actually implements this is Postgres. And it's a special case scenario for them. The basic idea is that instead of having a branch and bound search or a divide and conquer search algorithm, we're essentially going to do a random walk over all possible query plans and just hope that after a certain amount of time we stumble upon a good one. So you sort of keep running doing this random walk until you reach some threshold, say, this is good enough, or you run out of time and you say, I'll take whatever the best one I've seen so far and I'll just execute it. So I'll talk about two approaches on how to do this. So the first paper that sort of talks about using a randomized algorithm for query optimization came out in Sigma 1987 where they were doing simulated annealing. So the basic idea here is that you use your heuristics, as we saw before, to generate an initial plan. And then you're going to have these multiple rounds of taking random permutations of a query plan and switching something in it. Like, if I want to join A and B, I'll do a random permutation and decide to join B and A. And then what happens is for your permutation, you ask the cost model, what's the cost of running this query. If it's better than you always take that change, if it's worse than what you've seen before, then you flip a coin and decide whether you actually want to keep that change and keep walking in that direction. And the idea there is you can break out of local minimums by accepting a change. At that moment, it looks like it's bad, but it actually may be better because it lets you get to a query plan that you would not normally be able to get to. The tricky thing, though, is of course, again, we have this problem where we have the baking rules and our implementation to make sure that we reject any query plan that would violate the output correctness. So again, sorting is the big example here. So if we do a permutation that produces an output that's not sorted, then we reject that and then try another permutation. Postgres has a genetic algorithm-based implementation that is used to select the join orderings for queries. So Postgres essentially has two optimizers. It has the regular one that you normally get that's based on the system R approach, but it has this genetic algorithm that gets invoked. If you give it a query that has more than 12 tables, then they fall back and execute this thing. So the basic idea here is that you generate an initial plan and then you have these multiple generations or multiple rounds where you generate much of random query plans. And then you pick whatever you want, the top five or so that are the best for that generation. You then mash them together and trade traits or genes between them and then do another round of measuring the cost of that new generation generated. And the idea is that by mashing together different query plans that seem to be good, you may come across there's maybe certain aspects of one that are really good and aspects of another that are really good and you sort of cross-breed them together and you end up with one that's super good. Yes. The question is, how do you match together two query plans? Say I want to join three tables, A, B, and C. And so one query plan might say, hey, do a hash join on A and B. And then another query plan might say, do a nested loop join when you join the output of A and B with C. And maybe that's the right thing for that second join. And the other one uses another join algorithm for that second join. But the first join algorithm is correct for the first guy and wrong for the second one. You mash those two together and you may get the best of both of them. So it's basically a randomized swapping out of self-proclaimed? Yeah, so the statement is it's a randomized swapping out of self-proclaimed. Yes, that's how genetic algorithms work. Yeah. So again, you just keep repeating this until you run out of time. And of course, again, you have to make sure that whatever it is that you're using to mutate your different query plans only generates valid query plans. They only generate correct results. So the advantages of randomized algorithms is that it allows you to jump around the search base without having to exhaust a search. And they have the ability to break out of local minimums and has low memory overhead to do this because you're not going to maintain the history of every single generation, every single permutation that you ever had. The only thing you need to keep track of is the best query plan that you've seen so far. And this is going to be a big problem. This memory overhead will be a big problem in Cascades because they're going to do a ton of transformations and you need to keep track of what they've seen so far. The disadvantage of this is that because now it's random, it's now difficult to determine exactly why the optimizer chose a certain query plan. It may just seem sort of like black magic that it spit out something. If we want to make sure that we generate stable plans, as I talked about before, then we have to do extra work to make sure that the randomized algorithm is actually deterministic. So Postgres does this. Postgres seats a random number generator. So if I run the genetic algorithm on a query today, when I run it tomorrow, assuming the database is still the same, that it will still generate the exact same query plan. And then, again, of course, we have to implement our correctness rules to make sure that we don't generate a query plan that is producing invalid result. So this last piece here, I keep bringing this up over and over again from the last three types of query optimizers, because this is a big overarching theme about how you actually implement these things. This is hard. And this is because writing these query transformations in a language like C++ or Java, it's hard to do correctly and it's hard to reason about whether we're actually generating proper query plans, because there's no way to formally verify that the rules are doing what they should be doing for a particular query. And the only way to test this is actually just build a Fuzz test generator, we're just randomly generating a ton of queries and keep throwing them into your query optimizer to see whether it actually ever produces invalid results. But you can't be certain that it's actually doing the right thing. So as we'll see next, the better approach is actually to have a declarative domain specific language or DSL, where we can write our rules in a way in a high level abstraction and then have a rules engine that we can then verify and understand, apply those transformations and to produce proper query plans. So the way to think about this is rather than writing C++ code to generate all these transformations, we can write them in a sort of an intermediate language and then have a rules engine actually apply them. And because we can reason about the rules we write in our intermediate language and we can reason about our rule engine being correct, we know that it'll generate query plans that are correct. So this is the category optimizers that are the state of the art now. These are essentially known as optimizer generators. So the way to think about this is again, you write your rules in a language and there's a rules engine which then invoke them and perform the transformations for you on your query plan. And the difference is now we'll be able to embed the constraints we have about physical properties for the data that each operator is generating and ingesting can now be included in these rules and we can now embed that information now in the plan itself for the different operators. So no longer do we have to put logic in our cost model to say the data needs to be sorted, make sure it is, can now be directly enforced in the plan itself. So there's two approaches to do this. There's the stratified search or it's essentially doing planning in multiple phases and then there's the unified search where it's a single search space that can do transformations from logical to logical and logical to physical. Whereas in the stratified search, you do logical to logical in the first stage and logical to physical in the second stage. So stratified search is essentially what I said. So the basic idea is that we're gonna do rewrites on the logical query plan using transformation rules and for this one we don't need a cost model because it's standard things like predicate pushdown where you know it's always the right thing to do and you don't care about whether there's another query plan that doesn't do predicate pushdown that could possibly be faster, right? And then after we do this, then we can do another stage where we do the cost-based search that actually now convert our logical plan into a physical plan. So the first one's logical to logical, the second one is logical to physical and you have several rules for each of these. So the most well-known stratified search optimizer is called Starburst that came out of IBM Research. So as I said, when they first built DB2, they took the query optimizer of system R, that got them going for a couple of years and then after a while it became a pain to maintain and there were certain obvious limitations about what system R optimizer could do. So a team led by Guy Lohman who retired a year or two ago at IBM, they came out with Starburst and the idea is again, you have two stages. When the first stage you can do rewriting using essentially rules that look like relational calculus and then in the second phase you can actually go now back and do the system R-style divide and conquer approach to come up with the proper join ordering for things. And then the knowledge about what the data looks like permits throughout all these things. And so as far as they know, this is the DB2 still uses the Starburst query optimizer that developed back then in the current version of the system. And this is at a high level, this is essentially what Oracle does today. Yes? Is that up the off-stage or one-year scale like it's a full query? No, it's a logical plan. But it's like you put this? A logical query. It's a logical plan. Think of like a nested query. So you could have one SQL block would be the outer query and then the inner query would be another SQL block. Technically it's still part of the same query plan. It's just the way you're demarcating it. All right, so the advantages of something like Starburst is that this works well in practice and you get good performance. At least in the original version of the Starburst as described in the papers from the 90s and 80s and 90s, they had this limitation where it was hard to assign priorities for transformations, meaning it just sort of did them in random order or in sequential order, right? But there may be some cases where you maybe predicate pushdown is the thing you always want to do first and you give that higher priority than some other optimization. The first stage also had problems where there may be cases where a logical, logical transformation you do actually would maybe consider what the cost is and this was hard to do in their approach and then also sometimes you always had to go back to the cost model multiple times, which got to be expensive. And then again, this may have gotten better since then but the original paper talks about how it was really hard for humans to write these rules because it was written in relational domain calculus which is sort of unnatural for a standard system developer to write. They may, I assume they fixed the system but I haven't talked to Guy about it and I haven't seen him in a while. Okay, so in the remaining time I want to talk about the unified search model, right? And this is what the cascade approach that you guys read about does. And the idea here is that rather than having two separate stages, we're gonna have a single stage that can do both the logical to logical transformations as well as the logical to physical transformations all together, right? Everything is just written as transformations and so the rules then you can notice how to process them as needed and knows about how to make sure things are always gonna be correct. And so the major downside of a unified search model is that it's gonna have way more transformations because it may go logical to logical or logical to logical to physical, right? It may try all the different combinations of transformations and a lot of these could end up being redundant. So to avoid having to go to the cost model over and over again for the same repeated transformation they have to use a memorization table to keep track of the cost model estimate the last time you checked your transformation and consult that to sort of speed things up. So now before Cascades there was another query optimizer called Volcano that was written by this guy Gertz-Graffy. So this is the same Gertz-Graffy from the Volcano iterated model when we talk about query processing and he wrote the locking, latching paper you guys read for B-trees, right? It's the same dude, yes. Is the memorization procedure that across queries or just for simple query? His question is in the unified search model is the memorization across queries or is it per query? It's per query. For every single query we start with a fresh memorization table. All right, so Volcano was actually the second query optimizer that Gertz built. So there's a system before this called Exodus which I think was sort of like the system R approach and the lessons he learned from that he then developed this query optimizer called Volcano and the lessons he learned from Volcano is what he ended up developing as Cascades, right? So the big thing that happened in Volcano is that he's gonna treat the physical properties of the data as first class entities in our query plan. So again, we make sure that we consider them when we do our transformations. And the idea was that they wanted to make it really easy to implement new transformations and new equivalence rules in a system using a DSL without having to worry about going and modifying the core engine. You write the transformations in a separate place and the rules engine can be implemented separately from that. So now the unified search model in Volcano is we doing a top-down approach. This is different than the bottom-up approach we saw in system R. The idea here is that you start with the goal or what you want, right? The correct output of a query. And then you work from the top down and you start adding in the different operators of the query plan and you estimate the cost as you go down so you know how to avoid going down a branch that may not actually generate a good result for you, right? And then if you're coming again from the AI background, this is called a backward change. So as far as I know, the Volcano model was never actually implemented in academia. They implemented the Volcano and then in the paper they tell about a bunch of other academic prototypes that other universities that are using Volcano. But as far as that, nobody actually implements this. It's the Cascades model is what everyone cares about. So real quickly, it's important to discuss the distinction between the top down and the bottom-up approach, right? Because this is not the same as whether it's a stratified search or a unified search model, right? They're actually mutually exclusive. So again, the top-down approach, you start with what you want and then you work down from there. And so Volcano and Cascades are implementations of this that also have the implementations of a unified model. The bottom-up approach is then you start with the bottom with nothing and then you build up the query plan to get to the goal, right? And this is what System R uses and what Starburst is using. But there's nothing about the top-down or bottom-up approach that doesn't say you couldn't use either one in a unified search model. As far as I know, this is what they just decided to go with. All right, so let's see how Volcano works. And this will look a lot like what Cascades does but we'll just go Cascades in more detail. So again, at the very beginning, you start out with what you want your result to be. So I want a three-way join between an artist appears an album and I want my data to be sorted on the artist ID. So then now we'll start invoking rules to do transformations to go either from a logical operator to a logical operator or logical to physical. So logical would be join A and B can be converted to join B and A and logical physical will say join A and B to make it do a hash join on A and B. And so here's our logical operators. We have our join operators and then we have their access methods so how we're gonna read the tables. And then now I'll start from the top and I'll start saying, well, what do I need to do to get to these bottom guys here? So let's say that I wanna do a sort merge join on the album table and the output of the artist and appears and then going down from here I can say I can do a hash join to get me to this operator here. And then now I have a full plan I can evoke my cost model and say, what's the actual cost of executing this? And then likewise for the to do this join I could use a sort merger join instead. So the other important thing about volcano and the unified search model is that we're gonna have these enforcer rules that we can then use to require that the input to an operator has to be in a certain form or have certain properties. So we have a enforcer rule that says the any data we're given here to this top operator here, the root has to be ordered by the artist ID. So that means that if I have a hash join operator here I know that this output is not gonna be sorted on it so I can merely prune it and say this violates my enforcer rule and I don't need to look at anything below it. I can stop it right there. Or I could have say over here a quick sort, say evoke an order by with a quick sort algorithm on the AID and then I could have any joint algorithm could feed into that, right? And again, I'm estimating the cost as they go along so I can recognize that having to do this quick sort followed by hash join is more expensive than having to do just a straight sort merge join. So I can kill this branch right here and not look down any further. Yes. This question is how do you predicate push down? That's part of a transformation rule. It's logical to logical. You can push it down here. But you know that you always So his statement is, since you always need to you know you always wanna do predicate push down does it make sense to just do it as a heuristic first then run through the search model? Let's discuss that in the context of Cascades, right? It'll make more sense when you have groups and things like that. Okay, in the second time I'm gonna skip this real quick. The advantages of a volcano optimizer is that we're gonna use declarative rules to generate these transformations and it doesn't actually matter the rules engine doesn't care whether it's a logical or logical or a logical physical. It's the same type of rule written in the same DSL and just applies them as needed, right? And then we can reduce all the extra steps we have to do or extra invocations of the cost model by using the memorization table. So the disadvantage of volcano and what Cascades solves is that the all the possible equivalent classes meaning like the different ways you can permute a logical operator are all expanded ahead of time at the very beginning and this explodes your search space. And what Cascades is gonna be able to do is actually determine that on a fly these are the different permutations I could possibly have and only do that as it goes along. So it doesn't actually have to expand look at things that may not actually generate a good query plan. Now that'll make more sense next time. And then the sort of related to his not exactly what he was saying how do you predicate push down but how you actually do rewriting of predicates was difficult to do in volcano. All right, sort of rushing to this end but to give you again the idea here that I wanna show you and have appreciation for the different ways you can implement a search model and then we'll see on Monday in Cascades and Orca which is an implementation of Cascades through Green Plum about how you actually apply this technique in a modern system. So what are my main parting thoughts on this? Should be clear that I think query optimization is very hard. There's a running joke in databases that says that if you fail at getting a job or doing well in query optimization you can always fall back to rocket science because rocket science is easier than query optimization. That's from David DeWitt. So this is hard and doing it well and doing it efficiently is hard to do. And again the Cascades model is just one way to sort of simplify the development of these things and make it easier to maintain and extend as you wanna support new features. So I would also say this is also the reason why the NoSQL guys they claimed they didn't wanna do SQL because they knew that if you had to do SQL you had to implement a query optimizer and they did that and they just skipped it entirely. Of course now they're going back and adding some of them are adding SQL and they're most as far as I can tell most of them are doing a heuristic based query optimization or the very first thing we saw in the beginning because that's where everyone starts because it's easy, okay. All right, so Monday's next class is more optimizers. First blood part two, which is a Rambo reference that no one gets at this point because it came up before you were born, that's okay. So we'll talk about Cascades, Orca and Columbia. Columbia is the master thesis you guys read is the implementation of Cascades. Orca is another implementation of Cascades from the Green Plum guys. And then we use in Peloton, we use the Cascades model as well. All right, any questions? You have a question, yes. Can you go back to slide 232? Very specific, which one do you want, 32? Yeah. All right. Yeah, sorry about that. So my question was like, oh, what is the logical field of logic? What is, what is what? So like, we're right, that a logical career, like, ah, yeah, that's right. So like, again, like, do I join A and B or join B and A? That's a logical, logical transformation. Yeah, do like produce a multiple plans for selection in the next phase. In this phase here? Yeah. So like a predicate push down is an example of a logical transformation you would do that you don't need the cost model because you know you always want to do it. The joinery stuff is you need a cost model and that's done in the second stage. Can I take it as like the most optimal logical plan? That you get from this stage? Question is, did this produce the most optimal plan? Yeah. It was optimal logical plan? Yeah. No. Because I can join A and B or join B and A. At a logical level, there's gonna be one that's better than another. All right, any questions? Enjoy the snow. Mm, I need something refreshing when I can finish manifesting to cool a whole bowl like Smith and Wesson. One court and my thoughts hip hop related. Ride a rhyme and my pants intoxicated. Lyrics and quicker with a simple moan liquor. Tits on my city slicker, play waves and pick up. Rhymes I create rotate at a weight. Juke weight to duplicate filipines at a skate. Mikes at Fahrenheit when I hold them real tight. When I'm in flight, then we ignite. What starts to boil? I heat up the party for you. Let the girl run me and my mic down for oil. Wrecking still turns with third degree burn for one man. I heat up your brain, give it a suntan. To just cool, let the temperature rise. To cool it off with same eyes.