 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stones. I put heads to bed, lick shots and rap with flame. But also, too, I always blame it on me. As long as it's not murder one or something illegal, you blame it on whatever you want on me. Like, Andy said to do this, you want to do something crazy. Within reason, for data and stuff. The other thing I'll say is, too, I'll send out an email. We'll do what we did last semester. For everybody in the class, you put your CV in a directory and then we'll send it out to all the data companies. So you get a bunch of interviews over the summer. So that'll fix you, OK? All right, so query optimization. Super important, super hard. Let's jump into this. But quickly, for administrative stuff, project two, again, the first version is due this Saturday. Again, on the website, there's a link to a Google Form. Just put in what you've done and the link to it. We'll look at it and give you feedback. I did update the schedule. And this is live on the website now. The project updates will not be next Monday. It'll be next Wednesday, the 5th. So Monday will be the lecture on cost models. Wednesday will do the presentations from the groups. And then the following Monday will then talk about the real systems. All right, cool. And if you want to talk about project three, please come to either my office hours or send me an email if I have it in chat, OK? All right, query optimization. So as I said before, this is the hardest part about database systems. I fully admit this is the part of database systems I know the least about because it's so hard. And so we won't go too deep into the nitty-gritty details. But I just want to give you a quick survey of why this problem is hard, what people have done, what real systems are doing. And then we'll see how to extrapolate from that what is potentially one better approach than another. And so at a high level, what we're trying to do here is that because, again, SQL is declarative, someone is writing a query that says, this is the answer I want. And it's up for the database system to take that SQL query and convert it to a physical plan that can execute. And the goal is in the query optimizer is that it's trying to find a correct execution plan that has the lowest cost. So correct is obviously emphasized because it doesn't matter if we have the fastest query plan in the world if we produce incorrect results. I assume we're not doing approximate queries execution. If we have incorrect results, then who cares how fast it is because people are going to plane. And then I'm putting the cost part in quotes because that's going to be a subjective value that will depend for one system to the next based on what are the priorities of that system, what the operating environment is. Again, it depends on a bunch of different factors. And it's typically not going to be tied to a real work cost like execution time. When you use some kind of stand-in like the number of tuples, or this activity, the cardinality of the operators, that'll be a stand-in that we use to determine whether one plan is better than another. So we can't, again, this is an eternal metric that'll be used within the database system itself and not something we can compare against in other systems. So we can't take whatever the cost estimate from my SQL query plans and apply it to Postgres. Some systems like DB2 can spit out, but also like here's the estimate execution time. But most systems do that. Most systems do not do that. So the reason why this is so hard is because it's an NP-complete problem. So just figuring out the joint order within the query is going to be incomplete. But then if you take the whole entire query plan itself, that's just even harder. And so even the misnomer is going to be that even though we're going to call this query optimization or the query optimizer, no database system is going to produce the true optimal plan for a query. Because it's just going to take two way too long to run and try to figure things out. It doesn't help us. Would not be good if we have query shows up and we spend 10 hours trying to figure out the optimal query plan, but then the query runs in two seconds, right? So we're going to use a bunch of heuristics and estimation techniques to try to not only guess what the real cost is going to be, the estimated cost of each query plan, but they want to use a bunch of heuristics to cut down and throw away things in our search base that we know we're not even going to want to even consider to make the problem more tractable, right? So, yes? If you have a really small query in terms of what? He said select one, yeah, sure. Well, see it's in a second. There'll be a notion of startable queries where I know no joins. I have an index that exactly matches what's in my where calls. It's that, yeah, that'll be optimal. Absolutely, yes. Or select one. Actually, that didn't even go past the, some systems don't even go past the optimizer for that, right? So the question is if the query is simple enough, we will be able to do an exhaustive search. Yes. But it's, you know, it's, and oftentimes I would say too, it's not just, oh, is the search algorithm efficient enough? The cost model estimates are going to be a big problem as well, and that we'll cover next week. And the challenge is going to be that the errors that you make about, you know, what do you think the selectivity in the lower portion of the tree, those things just get magnified as you go up, because it's now errors on top of errors on top of errors. So once you get beyond like five tables in your joins, it's a sh**, it's a complete mess. And again, we'll see a paper next week from the hyper guys when developing this thing called the join order benchmark based on the IMDB dataset, where they do a bunch of joins and you see how bad they all get as the number of joins go up. But the point I was going to make also too about, we're not finding the optimal plan, even though we're going to call this the optimizer, back in the old days in the 1970s, they called this part of the system we were talking about here, compiler. I think snowflake in the documentation even refers to it as the compiler. And that's because again, think of like in the 1970s when they developed SQL, you know, they were coming in the world of like, oh, C, that's a high level language. That's better than assembly. We'll have a compiler generate machine code for us. Same idea, SQL is a high level language. We'll have a compiler that generates the physical plan for execute the query. So some people say query planner, some people say query optimizer. That's what I say. There's also some systems that call it a query compiler. We're all talking about the same thing. All right, for the next two weeks, this week and next, we're going to talk about sort of different ways to implement optimizers. We'll talk about some methods next class on how to do sort of static based rewriting. Then we'll spend time on how we actually do enumerations and essentially to traverse the solution space to try to find the right, the best physical plan. And this probably we'll spend most of our time talking about this. And then we'll also finish up next class on going over like here's a bunch of the newer systems like CalSight, MemSQL, SQL server, and how they do certain things. And then on Monday next week, we'll do cost models, okay? All right, so today we'll kind of background what query optimization means at a high level. That's what the basics of implementing an optimizer and sort of key decisions we have to make as part of this process. And then we'll go through sort of a chronological history of query optimizer implementations going back to the 1970s. And it seems bizarre like this is 2023. Why am I spending time talking about how they did it in the 1970s? Well, because that's pretty much how people still do some things today, right? Like the way IBM invented to do call space query optimization back in the 1970s on system R is more or less what Postgres does today. Yes. I googled it because I was curious about machine learning for query optimizing. Well, her statement is someone at Berkeley is doing, at Sanjay Krishna, I'm assuming, he's now at Chicago. Yeah, their statement is, there's somebody at Berkeley had a PhD thesis on using machine learning for due characterization. Yes, there's been several approaches on this. We'll talk a little about this next class. But to get there, we gotta talk about this stuff first, okay? All right. And then the, so the paper you guys read today, that was, I mean, it's an old paper, something like 1998 or something like that, from a surgeon at Chaudhury, who's like one of the top people at Microsoft Research in the Davis group. And it's an old paper, but again, it's a good survey of what helped these things actually work. And even though you bring in machine learning, it's still in the day that the way we're doing transformations, the way we're doing rewriting, the way we're doing, figuring out join ordering, it's the same problem. Just whether or not we can use ML to solve it. So here's a broad overview of what the system would actually, the dataset would look like doing, leading up to the query optimizer. Right, SQL query shows up. There's an optional step where you can do rewriting at the SQL level, like literally, like looking for string matches and rewriting stuff. Then your SQL query, this is rare. You only see this in some middleware systems. Then the SQL query shows up. You're gonna parse it, generate an abstract syntax tree, and now you just have a bunch of tokens in the SQL strings from the strings. You run it through the binder. The binder is gonna map table names to object IDs, column names to object IDs, and so forth. And this can get quite tricky when you start doing a bunch of nesting on the lateral joins as we talked about. Then there'll be a rewriters step where the annotated tree comes in, a logical plan, and then you do a bunch of transformations based on rules to figure out how to rewrite stuff. Again, this is what we'll see in the heuristic-based approach as we go along. Then you have your cost-based query optimizer, and this will use a cost model to estimate the execution cost of one query plan compared against another, and then at the end of the day, it'll spit out a physical plan. The physical plan is what we can actually run, right? Like physical plan B, logical plan B, I wanna do a join on A and B. The physical plan B, I wanna do a hash join on A and B. And then whether or not we're doing just in time compilation or transpilation, you can take this physical plan and then generate machine code for it. The next slide is basically what I just said. So there's a difference between logical plan and physical plan. The idea here is that the logical plan is gonna represent the relational algebra expressions that are represented by the SQL statement. And then ideally, we're gonna produce a physical plan that converts the logical operators into physical operators that we can then use to actually execute in our engine. Again, these physical operators specify exactly what algorithm we're gonna use, what parameters, on what access path, and so forth. So it's not always gonna be the case that most of the times it'll be a one-on-one mapping between a logical operator and a physical operator. Like if I wanna do a nested loop join on table A and B, there'll be logical operators who do the scans, the logical operator do the join, and then I convert that to a physical form. I'll have a sequential scan for A, index scan for B, and then a nested loop join for joining them two together. In some cases, though, you can merge. Like if I do a logical join and a logical order by, I can convert that into a single physical sort merge join operator. And in other cases, they can explode. But for simplicity, for today's class, we'll always go one-to-one, right? And then for relational algebra equivalencies, this is stuff we covered in the undergraduate level, undergraduate course. But again, the basic idea is that we can rely on the commutativity and transcivity properties of relational algebra to reorder things in our physical plan, and then we know we're producing still the same result. Like when I join A, B, and C, where B, C gets joined followed by A, that's equivalent to a join on A and C first followed by B. And we can exploit that in figuring out what's the best order to do our join, to produce the fastest plan. The plan at the lowest cost. I don't want to say fastest because it isn't always just about speed. There might be other considerations, like network traffic or memory usage. To his point here, how can you find, can it be guaranteed that you're finding a physical plan, or sorry, the most optimal physical plan? For O2B workloads, this is really easy because most of the times they're going to be what is called Sargeable. Some weird term they've mentioned in the 80s. I mean search argument able. And that just means that we have a direct mapping from a sort of the access methods we're trying to, we need it for the table and maybe the join method. It's a direct mapping to a, to an index that already exists. So if I, so I have a table like this on foo. Has a primary key, has a name column. My query comes along where it literally is just select name from foo where ID equals 123. I don't have to do any cost-based search. I knew immediately in my where cause I'm doing a look up on ID. And I have a, that's a primary key. So I just know I can just use the index on this, right? Even if you had things like, if there's an include column in your index on ID, where a name was included in that, again, same thing. You just, you just simple heuristics to decide what's the best index to use for this, for this query. For the, in our world, in the OLAP stuff that we're talking about in this class, this semester, this is not always gonna happen. Almost never it's gonna happen. Cause we're gonna have, when you start doing joins, you have to take a cost in consideration. And so we've been using the term cost model, cost estimates. Again, this will be the backbone for how we're gonna decide whether one query plan is better than another. Either at a logical level or a physical level. We can ignore that for now. But the basic idea is that there's gonna be some statistical data we've collected or extracted from our data set, from our databases. And we can build histograms, we can build sketches, additional metadata about the minimax value, like the Zomap stuff, or what the distribution values are. And we can use this to decide whether what's the ex, the expected execution cost of either a single physical operator or the collection of them in a query plan. And so the, every database system is gonna have their own way that they do these estimates. This is where things go really wrong, as I said. This is actually where I think machine learning is a better approach than maybe some handwritten statistics. But again, we'll cover all this next week. Just know that as we're talking about these different optimization strategies, when we get to the cost-based ones, this is what they're relying on. And you can see this is a Postgres now. You can, if you log in a Postgres, there's a table called PG statistics. And you can see for every table, they maintain a little histogram of the values. And every time the garbage collector runs, they try to update it. And this is what they're using as the background for deciding the selectivity of operators. All right, so I'm gonna go through five different design decisions we have in our query-optimized implementation. And I'm just gonna go through a quick smattering of it. Here's all the things you have to consider if you're building an optimizer from scratch. And then we'll jump into, again, the history of different strategy implementations. So we'll go each of these one by one. So the first thing we're gonna deal with is what is our optimizer gonna examine? What's the scope of its optimization search? And the two choices are just, am I operating on a single query at a time or multiple queries at a time? Most systems work this way with the top one, right? You open up the terminal, you run one query in, you submit one query, the query, the data system gets that, parses it, runs the optimizer, spits out a single plan, right? Like 99% of the queries in the world, like 99.9% of the queries in the world are gonna work this way, right? Because you don't know what are the other queries you've maybe combined at the same time, because the system's only looking at that one connection for that one query showing up. Typically also too, it's not gonna account for other queries that could be running at the same time, even if they're coming in at a different connection, right? There are some tricks you can do, like scan sharing, as we talked about in the intro class, where you know there's some cursor ripping through a table that you need to read that same table and you can piggyback off theirs. But by then, this is actually during the execution time that makes this decision, not in the query optimizer. By the time you get that point, the query optimizer is done. Some systems can be very sophisticated in their cost models and account for other queries running at the same time during query optimization. I think DB2 can work like this. But at the end of the day, they're still only doing optimization for your single query. The alternative is that if you have multiple queries at the same time, or showing at the same time, and you know that we executed within, roughly within the same time window, that you could try to optimize them together, do like a global optimization. Because what might be the best plan for each single query globally might be the worst choice because they all contend with each other. And if you can do this, you can actually start scaring, be more aggressive in sharing results. So again, I don't think there's, there might be, DB2 might do this, but I don't think anybody else does. This shows up in, sometimes in the stream processing systems, or like, or continuous query systems where you define some query you want to run all the time. And then you can try to, the optimizer can look at those and say, okay, how do I, how do they compose? How do I fit them together? We, last year we, or two years ago, we had the guys from DBT come give a talk. This is super common in the real world now. This is what data scientists use, instead of like, for pipelines. It's basically Jenga templates for SQL queries to define like how you do transformations in your data warehouse. And I asked them, because you have all the queries ahead of time, do you do any additional oppositions globally? And they don't. But that would actually be a right, where you could do this, because you know the system wouldn't know all the queries ahead of time, know how they're gonna be executed in a certain order, and then optimize them together. But again, nobody does this. Yes. Do it like for data, like? DBT or, or like, just in general for this. Yeah. Nobody does it. Only academic systems. Yes. Do you mean running like similar workloads? Do they at least have like? They're aware of it, but they don't do anything with it. Yeah. Yeah. DBT would even be within the same, like the same company, you know all the queries are gonna execute ahead of time. Right? And nobody does any global opposition. It's hard, right? And you can only do it, this would only work if again, if you're given, hey, here's the batch of queries I'm gonna execute at this time and this interval and this pipeline. This is like, again, I open up the terminal, I send one thing. And that's why everyone writes it like this. All right. So the next question is, okay, what do we, how is the system actually deciding how to optimize the query? And so the top one is static optimization, which pretty much everyone does, where query shows up, parse it, plan it, run it through the optimizer, you have generated a query plan and then the optimizer is done. And you run the query and no matter what happens that you run the query, you stick with that plan. Right? Some systems can do dynamic optimization where they generate most of the query, but it may not all of it. Then they start running it and then there would be decision point they can decide whether to enable certain optimizations or others. So Snowflake can do this. They can decide whether to push down a group by aggregation for your query plan at runtime. So you run the query and it says, okay, if some threshold is met, then I will do group by aggregation early, otherwise I'll do it up above where normally it would be. And they make this decision based on how much data is coming out of it. So this is more common this is more common in the cloud systems or the data lake systems where you maybe don't have good statistics or any statistics at all on the data that's sitting in your parquet files in S3. Databricks can do something similar like this too as well. Adaptive query optimization optimization looks a lot like this where the idea is you use your static optimizers before generating the query plan but you can either generate some parts of the query plan of multiple versions of it and then decide as you're running if one query plan, if the current query plan you're choosing turns out to be inefficient you maybe can switch as you're running to another sort of query plan tree. That is rare, that's hard to do. The most common one is there's just some basic threshold that says if my estimations are way off kill this whole query, throw away any results I've already calculated go back to the optimizer and tell it hey you were wrong, here's why and try to generate a new plan. The more sophisticated approaches only in academia try to keep some of the work you've already computed and reuse that when you come back but again as far as they know, no system can do this. All right, the next issue we got to deal with is prepared statements. I think I teach prepared statements in the intro class, everyone should know what a prepared statement is. Okay, all right. I'll go through it real quickly. All right, so say this is the query we're gonna execute all the time, right? And we have some constants in it. And so instead of every single time me sending this query, I got to parse a plan and optimize it. I can declare it as a prepared statement in this case I'm calling it my query. And then now everyone to execute the query I just use the execute command in Postgres or in SQL standard and that does run this query, right? And I'll reuse the cache query plan that I generated when I first prepared it. Postgres doesn't work that way exactly. They'll do lazy binding for the query optimizer but think of it working the same way. I have a cache query plan and I don't have to parse it every single time. I'll cache it once and reuse it. But of course in this example here all the values are hard coded, right? And so to make this more useful for other parts of the application that want to do the same query but on different data I can put in these question marks that are basically placeholders for variables. Sometimes there's dollar sign one, I think Postgres is dollar sign one, could be a question mark in JDBC but they all work the same way. All right so now the challenge is gonna be what should be the join order for the optimizer to generate for this query for A, B and C, right? And so my plan here I join A, B first followed by C but then how can I make this choice that I'm trying to pick up the optimal join order because if these parameters aren't known when I prepare the statement, they're only known at runtime then I gotta generate the query plan every single time I invoke the prepared statement which then defeats the whole purpose of having a prepared statement, right? So for this example here it seems kind of trivial but again like if it's a real complex query that I've defined as prepared statement if the optimizer takes 10 seconds to run for a query that takes one second to execute then this is a bad trade off. I'm not getting any benefit of prepared statements. So there's four choices for this. The first is you just reuse the last plan that you generated after you've run it so many times. That's what Postgres does. I think if you prepare statement, you call prepared statement for the first five invocations of the prepared statement Postgres just optimizes it from scratch. Then on the last one it then caches that and reuses that going forward. And I don't know whether, I forget whether there's a threshold to say okay well let me go back and try to optimize it again, right? The next approach is to be sort of clever as I was saying just now like there could be a threshold to decide when should I go back and re-optimize it. And ideally I want my optimizer to be aware of the last plan I've had in my cache and use that as the starting point in my search because I can use that as an upper bound in my search process. I don't think anybody does this because most of the query optimizer code is not re-entrant. I mean I can't do like a branch bound search, stop, pause what I'm doing and then come back and pick it up where I left off because they just don't maintain that history. The next approach is to have multiple plans. I think this is what Oracle and SQL Server do. The idea is that if I have those statistics, I have those histograms and what the value is going to look like for these different parameters, assuming they're exactly columns and not a derived value, then I can maybe have different query plans for some portions of the buckets so that when my query shows up or I invoke the prepared statement at runtime and pass in some parameters, I can look at those parameters and decide, okay, here's the query plan that matches into the bucket that they belong into. And of course the problem with this one is more parameters you have, the number of possible combinations of bucket values could explode and becomes really expensive. And again, so I think SQL Server and Oracle have basic implementations for this. Another easy alternative would just be, take the average value of every single parameter and then use that as the input for when you invoke the optimizers, say generate a query plan for this prepared statement. I think prepared statements are super interesting. Everybody does something different. And I think this is actually, this is a research paper here. We can come investigate this, but we can take that offline. All right, the next thing we've got to deal with is plan stability. So in the database world, faster databases is nice. If people want that, everybody says they want that, but oftentimes what is more important than just raw performance numbers, it's stability in your performance numbers. So it doesn't help us if we do all the things we talked about this semester, make my query, make my execution run really fast for 99% of the queries. But then there's some 1% of the queries where some days are super fast, the next day they're super slow and then they switch back and they're fast again. That will drive people crazy, nobody wants that. So in many cases, like, when you see these database companies, here's my TPCH number, TPCDS numbers, look how fast it is. The metric oftentimes that people really care about is how stable are those numbers for my regular workload. People don't want to use oscillations because then it's hard to figure out, like, okay, is my application slow because I did something in my application or is the database something doing something weird? And how do you debug that? So how do you maintain plan stability? Well, it's gonna be usually can I, is there a way to prevent the query optimizer from choosing a different query plan from one day to the next? I basically want to generate the physical plan and then maybe freeze it for that query, whether or not it's a prepared statement or not, it doesn't, you know, both cases I care about. I want to then freeze that query plan and be able to reuse it. So the first way to do this is through hints. So this is where now in the SQL statement itself, the developer or the DBA can tell the optimizer, here's what I want my query plan actually to look like. Whatever you're doing, ignore it, do what I want. So Postgres famously does not support hints. Like this is actually a design choice in Postgres that they said we don't want to do any hints. Every other major database system supports hints. Postgres does not, but you can get hints with an extension called PG hint where you put a little SQL comment up here and you can tell it I want to do a nest loop join for T1 and T2, I want to do a merge join for T1, T2, T3 and this leading parameter here just means this is the join order I want you to consider. So join T1 followed by T2 and then followed by T3. Right, and you put this again as a comment you pass in as the execution string for the SQL statement to the DB system. But again, you have to have the PG hint extension enabled. Oracle has their own thing, right, again, same thing. Here's the order, use nest loop join, use index. Right, in this case here you have to put it after the select, whereas in PG hint you put it before. Right, all the various major data systems have their own version of these things. The next choice is to fix the optimizer version. And this is where you have, you've vetted all the query plans that your database system has generated for those queries in your application. But then you go ahead and upgrade it to a new version of the database system. Right, from Postgres 13 to 14, Postgres 19, or sorry, Oracle 19 to Oracle 22, whatever, right. And it may be the case though that the new version of the system, if you upgrade, starts generating bad query plans for some of your queries. But maybe only 1% of them. So what you can do is, in some systems, you can specify that you want the database system to use an older version of the query optimizer to generate the query plans for that particular query because you know you'll get back the plan that was fast from before. So Oracle does this. In every version of Oracle they ship, in the binary, is every previous version of the optimizer they've ever created within the last decade or so. So you can install the latest version of Oracle 22, but you can tell it, I want Oracle 12 optimizer for these queries. What are you shaking your head? What are you shaking his head as if it's terrible? You don't mean to be a local version of the optimizer. He said, you need to maintain multiple versions of the optimizer. Yes, but again, he owns a Hawaiian island. He's making a lot of money. People pay a lot of money for this stability, for this feature, right? Yeah, I don't want to do it, but it matters, right? I personally don't want to do it, I'm glad, but I'm, let's just be recorded. Okay, so I always say, the problem is like, if you stick old query plans or queries through the old optimizer, as they put out new features in the system, it won't be able to take advantage of it. So we did have a, we did a deployment at Autotune, when it was the university project for the French Bank, and they were running, I think Oracle 18, and they had it set to Oracle 11 optimizer or some queries, and Autotune figured out, oh, that's just wrong, switch it, and then things got faster. So, but again, this is something you mainly have to do. The system will do this for you automatically. The last one is to basically override the entire query plan itself, where to say it before I do my upgrade, sorry, yes. It changes all the time. The statement is because in case of Oracle, they're maintaining different versions of the optimizer, that they would not want to make major changes to the optimizer? Yeah. Like, I mean, going from like version 11 to version 18, but they could make big changes, right? Because if anybody wants the old version, they tell them that it's a version 11. Like, suppose I'm all making. You see, but if you're not making major changes, then you could keep things, if the code base is similar enough, not a radical rewrite. Yeah. Sure, I mean, that's a software engineering decision. The query optimizer. Right, the query optimizer, they don't want to change the query optimizer to account for UDS. Yeah, again, the query optimizer is the hardest part of the data system, right? Like, I don't consider myself qualified to work on them. And so it's really hard to find good people that can do this stuff. For better or worse, I've been told that it's a, for some of the previous companies that they like hiring people that do PL research, can be good at query optimizations, and then high energy physics is the other one. So if you don't make it in high energy physics, or if you find high energy physics too easy, you can switch to do query optimization. All right, so, to your point, yes, like, they don't, nobody's doing, no major company's doing major, like, complete rewrites of the query optimizer for one major word as a next. That would be suicide. All right, so the last one to begin, you can override the query plan that the data system wants to generate with the one you want it to use. Like, so again, if you go from old version to system to a new version, you dump out the query plan of the old version that you liked, and then you upgrade the new version and you say, don't do whatever you wanna do, here's the query plan that I want. So SQL Server can do this, and it looks like this, right? Here's my SQL query at the top, and they have used plan, and then there's this giant XML thing, which I've cut off, because it's long, where you're literally telling, this is the representation of a query plan in SQL Server in XML form, and you're telling SQL Server, use this, and it'll do it for you. All right, so the last thing we gotta deal with is when do we stop searching, right? So the most obvious one is wall clock time, and this is what most systems do. Again, if I, if you figured I finished my search exhaustively, then you're done. But otherwise, if you keep running, I could run forever, or a really long time. So typically, there's a some knob you can specify, like only run for 100 milliseconds, and at which point, terminate the search, and whatever the best query plan you have, just use that, even though it may not be optimal. Other cases, there's a cost threshold to say, if I find a plan that has a lower cost than some value that I can specify, then I go ahead and just stop, and then an exhaust wouldn't be, as an example, if I think I find everything. Another one would be, if I haven't found a better plan in a certain amount of time, or certain iterations, then I can go ahead and quit. Again, this is also the classic search optimization techniques in computer science, and not anything specific to data systems, right? Like, yeah, there's nothing special here. All right, so let's go through now a bunch of different approaches for how to build a search query optimizer. And as I said, this is going, we're gonna be going forward in chronological history, starting from the 1970s, and moving forward to, actually to the 90s, because that's when these weeks sort of figured out, the Davis community figured out that here's the sort of, the stratified versus unified approach. And then this will be a segue into the cascade stuff that we'll read and cover in next class, right? And again, the big debate's gonna be, do you wanna use a stratified search going top to top, bottom to the top, or a unified search with cascades going to the top to the bottom? That's sort of the big debate of how you wanna build a query optimizer, okay? So we're also going forward in time, forward chronologically in time, we're also going forward in complexity. So we're gonna start off with the simplest approach, and then we'll finish off with the more complex things. All right, so the most simple way to build a query optimizer, and this is what pretty much any new database company that has started from scratch will start with, is to use heuristics, or rules. And you basically have hard coded rules and code that look for certain patterns in the query plan, in the logical query plan, and will define rules that then do, to convert them into the physical form. All right, so the most classic thing would be to do you predicate push down, right? You identify that you have, you have some wear calls items, or some filter operators, and you push them to be in line or right above the sequential scan. All right, for joins the, it's gonna be something really simple, like figuring out the joiner, like based on what the order that the tables are referenced in the query, or you can do something really simple, like a greedy search on cardinality estimates, right? So this is what the two out of the three of the major relational databases that were built in the 1970s, right? The three would be Ingress, Oracle, and System R. So two out of three, Ingress and Oracle did this approach in the very beginning. Every new database system that's come along that is like a new startup written from scratch is gonna be doing this. MongoDB I think still even does this, as far as I know, right? So I wanna go through what Ingress does, and I'm not gonna say this is, I'm not saying this is the way to do it, I actually think it's more of a historical curiosity because I think it's really clever what they did given the hardware limitations that they had at the time, how to figure out how to join things, all right? And then they'll motivate why we wanna do the more sophisticated methods. So going forward, I'm gonna use a really simple three-table database, right? We have an artist and album, and then we have appears, which is a cross-reference between basically all the artists that appear on a given album. And so the query we're gonna do is get all the names of the artists that appear on my mixtape ordered by their artist ID. All right, so we wanna do a three-way join here. So the thing about Ingress, these are the initial version in the 1970s, they couldn't actually do joins. The system could not execute joins. So they had, in the query optimize, they had to rewrite queries to support things that ended up being joins, all right? Because all the queries they could do were only single-table selects. All right, so the first step what they gotta do is they're gonna decompose the original query into single-value queries. So what I mean by that is, say we're gonna take this one query here, say the first part of it, like doing the look of an album, and we'll have our predicate where we do a lookup on my remix, but then they're gonna write it into a temp table. And then they rewrite the original query to now do, instead of doing a lookup on the album table, they're gonna do a lookup on this temp table here that they defined in the first query. And then now they'll do the same thing. They'll then decompose this query because it has the join into two separate queries. And the first one is doing a lookup on peers, so the peers table, but doing a lookup on temp one, and then it's gonna write the app out of this into temp two, and then this query is gonna do a lookup on the artist table, do a join with temp two, that was defined in the first query. So now, again, they don't support joins, so what you have to do is you gotta rewrite the queries by substituting the values that are being generated in the previous queries in the dependency chain and injecting them into the query plan to do single lookups on a scalar value. So in this case here, so say there's a, you know, the name Andy's drill remix is unique, so when I execute this query, I'm gonna produce a single value. So now I can take this value and inject it into this, the query on the album table or the peers table, and instead of doing the join against the other table, I put in the album ID that I got from the first query. Now I run this, I produce two tuples that are sorted by the artist ID, and then take this query now and do the same thing, put in the values to remove the joins. And I run that, combine these two results, and that's the answer to my query, right? So I'm showing this at the SQL level, I think the original version of in Ingress was doing this at the logical level. And what's interesting about this as well is that it's sort of like the dynamic or adaptive optimization stuff we talked about before, because they're running the query optimizer, which is just a bunch of if-then-else statements, they're running this on a per tuple basis, right? So like here, for every single value here, I generate a separate query, and then for each query I go to the optimizer and get the physical plan for this. So they're running the optimizer for every single query that you generate in all of this. Again, so I think this is cool, you would not want to do this in a real system today, but again, given that what they were dealing with in terms of Harvard at the time, and if you read the paper, they talk about, oh, yeah, it wouldn't be great if we can do a scan on a table with the 300 tuples, right? Which is nothing in a monosystem. So these here's the base approaches. They're nice because they're easy to implement, and as I said, this is what Bustub does now. It's what most systems start with, because again, you can apply the knowledge as humans know about databases, we can apply it to codify the rules and do these transformations. And there's this sort of straight mapping between the logical operators and to the physical operators. And for simple queries, like the surgical queries, this works great, it's fine. The problem is gonna be that you'll find that there's gonna be magic constants all in the source code to deal with the decisions that it wants to make. Predicate pushdown is an obvious one, always push that down. But for like joint ordering, if you want to even consider what ordering should it be, if you want to take some constant consideration in your heuristics, there's gonna be some magic constant that you have. It is not to say that the cost-based search we'll talk about later, they're certainly gonna have magic constants. Postgres has a ton of them, every system has a ton of them. But at least this one is very explicit in the code that you're gonna have these magic constants. The other challenge is gonna be you're not always, it's gonna be very hard to find the optimal plan for more complex queries because there could be dependencies between one transformation to the next and you would have to run right in your code. If this transformation is applied, then I can try this other transformation, or this other opposition, otherwise I can't. And so the code becomes this sort of big tangled jungle. And this is actually what Postgres looks like too. Postgres has this problem where it has one of these if and else's where they apply these transformation rules. They do a bunch of before they do the joint-based search or the cost-based search, and then they do a bunch of afterwards. And then the code, you make sure you apply the rules in the right order because otherwise you could end up with a funky state. So again, if you're building a new system today, this is what everyone's gonna start with. This is what Oracle did for the longest time. Oracle started in the 1970s, they didn't get rid of this until like the 1990s. And they got huge, they made a ton of money, they basically dominated the database market by this time without having any of the extra stuff beyond what I've talked about here. Again, they didn't do it the Ingress way that I just showed, but basically the idea is the same with these if and else rules. So if you go read the, again the unauthorized or authorized biography of Larry Ellison, there's this page here where Stonebreaker starts talking about Ingress as a query optimizer from the 1980s, which does cost-based search, like they improve what I just showed, and what Oracle had, and Stonebreaker's complaining that their optimizer was total shit, but Oracle's marketing people said, no, no, no, we don't have a heuristic-based optimizer, we have a semantic optimizer, and that's somehow better than the system R approach. And all they were really doing is literally like the joint order we define in the order that the table's gonna appear in the from clause. That was it. And then in the 1990s, they rewrote everything. I've talked to people that actually worked on the Oracle query optimizer. At least in the 90s, and they said it was like, it was a lot a lot of code, a lot of C code that was a total nightmare, because there's all these if analysis for these weird corner cases from all the different customers. Okay, so as I said, Ingress from Berkeley and Oracle, they were doing heuristics at the very beginning at IBM from system R, they invented the cost-based search. Again, so this technique I'm describing here, this is the backbone for how pretty much every system does it today. If you use the stratified search with a query optimizer generator, like in Starburst, at the end of the day, it's still basically doing the same thing. It just hasn't a cleaner API for defining the transformation rules, but at high level, they all worked the same. So again, the IBM system R project was where they took Ted Codd's paper. They got a bunch of smart people in a room. They just had brand new PhDs and said, let's go build this thing. And Pat Selinger was the one of the ones that she took the cost-based optimization problem and build the first optimizer for relational databases. So they're gonna be doing a bottom-up search. So that means they're gonna start with nothing in the query plan, and then you're going to start iteratively or incrementally building into the join operators you wanna do until you get your end result of what the query output should be. And I'll show in a few slides what that looks like. So again, system R does this, the other version of DB2 does this, and then most of the open-source systems are gonna do something very similar to this. This is what SQLite does, Postgres, MySQL, and everyone. All right, so what they're gonna do is they're gonna break up the query plan into blocks, and then they're gonna generate all the logical operators for each block. Then now they're gonna go through and figure out for each logical operator what's the set of physical operators that can use to implement it. And then you need a bunch of heuristics to decide how to throw things that you know you don't want to consider to reduce the prune down the search base. So most famously, they're only gonna consider left-deep trees and not gonna consider right-deep trees or bushy trees. Even though it may be that the bushy tree is gonna be the optimal query plan, because again, in the 1970s, the hardware's so limited to make the problem more tractable. Even today, it's complex. You throw away anything that's not left deep, and you throw away anything that's a Cartesian product join instead of an inner join. All right, so we go back to that same query we had before. All right, so the first step is we're gonna choose the best access path for each table. So we just look at the game. What tables we need to access? What are our where clause looks like? I decide, okay, here's the best way to do a lookup on these tables. Again, this is just all rules that we can decide this. Then we're gonna integrate all possible join earnings for these tables in any possible combination in either as inner joins or Cartesian products. But again, they'll immediately throw away Cartesian products. And then this thing goes on forever for all possible join orders. Then now the last step is then do a dynamic programming search to figure out what should be, for all these possible choices, what's the best join order to consider for the query plan? And then what's the best physical operator to use for each logical join operator? So it looks something like this. So again, we're doing bottoms up search. So we're gonna start from this, where we have done zero joins. And then the thing at the top is our goal. This is what we wanna get to. So in the first step here, we say, well here's the different types of joins we can do for all the different tables, either doing a hash join or a sort merge join. And then we have these logical operators represented. Here's the output of this physical operator that we just produced. So we do this for the first stage here. And then for each of all these, when we use our cost model to estimate what the cost is for executing these physical operators. So then now for every sort of path from one logical operator to the next logical operator in the next stage, we're gonna choose the physical operator that has the lowest cost. So you'll throw away all the ones that have the lowest cost at this point here. Then you do the same thing for the next stage going up. Generate all possible physical join operators, choose the one with the lowest cost. And then now you're gonna go back track through from the top going down, and then choose the path that has the overall lowest cost amongst all their paths. So again, it's a divide and conquer approach. Instead of doing the sort of exhaustive search for all possible different combinations, we only look at sort of collection of paths going up. So one key problem about the system R approach is that there's no notion in their representation of the query plan. There's no notion of the physical properties of the data. So our query had an order by clause on the artist ID, but there's nothing in this query plan tree that is aware of that. There's no, it's just joins. There's no order by operator here. It knows nothing about is this hash line gonna generate sort of data or not. So then after you generate this join order, you then had to do extra stuff to figure out, okay, I need to tack on an order by clause here to put my data in a sorted order. Or another hack would be, I can embed in my cost model the consideration of whether the data it needs to be sorted on and is this operator producing sorted data. And therefore if I need the data to be sorted and this operator does not make it sorted, then I set the cost like infinity. So I make sure I don't choose it. But that's a hack, right? Because it's not all physical operators. Is it compressed? Is it sorted? And maybe the case that I could sort it later on, but if I'm just looking only at the join ordering, I can't do that. So we'll see now when we get to a volcano and the next class will be all cascades. If we do a top-down approach and we include these enforcer rules or these properties in these physical nodes as we do these transformations, we can be aware that data needs to be sorted. It needs to be of a certain type or have a certain property. And then now we can discard things as we go along where we that don't guarantee those properties. Postgres does this as an afterthought, right? You need to embed in the cost model or you can do it like additional checks. The top-down approach can look at these things holistically. That'll make more sense as we go along. What if there is three-table join and obligation and join another table of both optimizer, optimize the four joins like the other one first, optimize the three joins, and then optimize like the other two separate packages, and then we can change it. Yeah, so his question is, what if there's an aggregation in the query plan? How would it handle that? So in that case, in the system approach, they would treat that as separate blocks. And you would optimize the first one and then optimize the second one, and then it's just in the gather. So it cannot find the optimal plan. In that case, yeah, it cannot find it. For this example here, it could not find the global optimal plan if you have a pipeline breaker like that. Yes? My understanding was the system R style does keep track of interest in the world. But that's it. So he's saying the system R style does keep track of instant orders, but that's like a hint to say, consider these things first, right? I don't think it's like a, like it's guiding the search, not setting what the search can do or not do. That was my understanding of how it works. So I guess you have one which produces the order you want, versus one which doesn't, and then you have to account for the extra cost that's been sold at the end. Yes. Yeah. But the, I don't think the original search, the join enumeration of search does that. I think it's tacked on at the end. Am I, to the system, I'm trying to describe the original system R one. Like it was, it got the basic idea right, but then had to tack on a bunch of these other things that were limitations to it. That was, that's my understanding how it works. Yes. Yeah. So his statement is, the bottoms up approach is a greedy based algorithm. And so it's not guaranteed to find the optimal. That is correct. Yes. But again, we'll see things when we talk about top down in volcano and Cascade. It's like, you have to do a bunch of similar things to cut things off. Cause instead of, like, otherwise you just, you know, the search is exhaustive. Nobody wants to do that for a complex query. So there's basically similar rules. You cut things off to avoid having to search forever. Yes. If you traffic the best plan which doesn't produce the ordering. Okay. And the best plan which does produce the ordering. Okay. And then they consider like, God, okay. Again, the ordering of the data, not the ordering of the joins. Yeah. Yeah. Okay. So he says that they have, they would generate the best plan that does it, does the guarantees the order by the one that doesn't. And they see whether the, after you've done that's figured out the best pass through the query plan, you look to see, is it better to attack on the, the extra, the order by versus having the order by being produced by a certain or joined. Yes. That makes sense. Thank you. All right. So the bottom up approach we've already talked about, right? You start with nothing and they build up the query plan to get the outcome you want. The top down approach we'll see in a second with Volcano and then Cascades the next class. This is where you start with the outcome that you have and you traverse down and add things interrelated to the query plan to get you to the, to that starting point. So you have a complete and valid query. Then we'll, we'll cover this more detail in a few more slides. So for Postgres the, the everything I've described so far, like the system approach, that's basically what they do. If you have a query that has less than 13 tables, I'll talk about what they do when you have 13 or more, but they're going to be doing a heuristics followed by the join-based search, followed by some more heuristics to put things back and make sure things are sort of set up correctly. As I said, the Postgres code is beautiful, like at least for C, but the query optimizer is one, one sort of the, I say the, the work of the code or like it's the most, I'm gonna say it's least sophisticated, but it's the, it's the dodgiest one, maybe that, like it's just, it's sort of like iteratively worked on and there's a bunch of rules followed by more rules followed by the cost-based search followed by more rules. And then they tacked on this genetic algorithm stuff, we'll talk about in a second, as Emerson's like an afterthought, right? So the big challenge of working on this is that the query optimizer for Postgres is that you have to make sure that you do the transformations for these static heuristics in the right order, otherwise the query plan could end up being incorrect. The very beginning when we started building our own system at CMU my first or second year, we had lunch with the guy, one of the original people working on Postgres in the 1990s, one of the key contributors, he's actually CMU alum, he lives in Pittsburgh. We asked him about the query optimizer and he kind of like shook his head and like, yeah, that's the daciest part. So again, because this is just hard. And I'm not knocking Postgres out, I think it's a fantastic system of course, it's just this one part I think is not as, not as sophisticated as some of the other open source implementations like CockroachDB that's out there today. All right, so the pros and cons of this approach, the positive is that it usually finds a pretty good query plan for most queries within reason, without having to do an exhaustive search. The problem is gonna be is that it's gonna have all the problems we had for your stick based approach, where it's a bunch of handwritten rules. If you make the limitations that system R does, where you only consider left deep trees, you're not gonna guarantee to find the optimal query plan. And then at some point something has to figure out or a reason about is this query plan producing data in the right sort of with the right physical properties either in the cost model or tacked on at the end. All right, another approach that you could take is do a randomized algorithm. It's basically you do a random walk over the all possible solutions you could have for your query plan. Make sure you only consider things that are actually valid or correct, that produce the right result. And you just keep track of like here's the best query plan I've seen before to my search at some point you terminate after our timeout. And then you hope that you find something that works reasonably well. Oftentimes and at least for our own research and seen in other papers, random searches do pretty good against machine learning stuff. So this is a reasonable approach. The only system that I know that actually does this in practice and in production is Postgres. They have a genetic optimizer, we'll talk about it in a second. So in the second time I'll skip this but there was a paper in the 1980s you can do simulated annealing. The basic idea is you do random swaps in your query plan to see whether you stumble upon a better plan. You flip a coin decide what portion of the query plan you wanna switch, you switch it, see whether that makes things better or not. If yes then you keep going with it otherwise you throw it away and try again. Of course you always have to make sure that the query plan is every step of the way to make sure that it doesn't generate an incorrect query plan. So you have to write these hard coded rules to make sure that the physical properties of the data are guaranteed. So let me talk about Postgres query optimizer, the genetic one. Again, this only gets triggered, this code path only gets triggered if you have 13 more tables in your query plan. There's an operator that specifies what that threshold is by default it's 13. So the way it basically is gonna work is do you wanna do a bunch of random queries? Start doing the random flips like in the last slide. Figure out which one is the best, which one is the worst, and then you take the best one and you try to extract traits from it and carry them over to the next generation so that there's some aspect of the query plan that's improving because there's some combination of the portion of the query plan turns out to be better than what you've seen before. And you keep going until you time out. So it works like that. So say I'm joining three tables, R, S, and T. So at the very beginning I generate a bunch of random orderings with using hash joins versus nest loop joins. And then I run through my cost model, compute the cost for all of these. This one has the lowest cost. So I keep track of that as the best plan I've ever seen before. I throw away one that's the worst and I use some portions of the remaining query plans, use those traits, cross-breed them and permit them together to generate a bunch of new query plans, do the same thing, generate the cost, find the one that has the lowest cost that becomes the best I've ever seen, throw away the worst, keep traits from the other ones and carry those forward. And the idea is hopefully you eventually, again, you stumble upon one that's better. So the advantage of the sort of randomized algorithm is you jump around, it works, sometimes it can work pretty well, right? And it's easy to do in its low overhead actually to maintain the state because if you don't have to maintain the history of all the previous generations, you throw them away and you start over, throw them away, which you've done before, only keeping the best traits going forward. Same thing for simulated annealing. The challenge, of course, is gonna be that figuring out why the database system decided to choose a given query plan is gonna be hard because it's a randomized algorithm. In the case of Postgres, they're very careful to make sure that they pass in a deterministic random seed for each query so that if I take the same query and run it through the genetic optimizer, it doesn't generate a different query plan every single time I run because that would break that stability stuff that we talked about before. So they make sure that if you run the same query over and over again and it hits the genetic optimizer, it's always gonna generate, guaranteed to generate the same query plan. Still have to implement all those correctness rules that we talked about before to make sure that you don't generate query plans that are just way out of whack. So two years ago, we had the guy that works, one of the guys that works in the Postgres query optimizer come and give a talk at CMU during the pandemic and he explicitly says that the genetic optimizer, genetic algorithm optimizer in Postgres doesn't really work that well with the piece of the genetic, for the genetic algorithm part. The way they carry traits doesn't really work. They think it's broken and it pretty much is a random walk. And that he was somewhat, seems like he was disappointed with what that actually is. But again, the code is there and if you have more than 13 tables, you'll get it. All right, so in all the different implications we talked about so far, it was humans writing the actual code to check for the rules, check and then apply the transformations. But writing these sort of checks and the transformations is difficult to do in a procedural language like C, C++, Java, Rost or whatever, right? Because the way we're trying to operate over sets or over relational algebra, and that's hard to do in procedural code. So what we want a better approach to do is if we have like a DSL, a higher level language that we define for our database system, that we can define what our transformation rules are, essentially what the patterns we are looking for and then the transformations that we want to do and then have something then take our DSL and then compile that into our procedure code machine or like a C code or whatever. And we use that to do our rule check and transformations. And that would be a better approach because now it's easier to maintain with this high level DSL, it's easier for us to assume and to reason about what the transformations are when these patterns are checking for it rather than having to read if then else statements over and over again. So people realize that optimizers are hard in the late 80s and writing all this procedure code to do these transformations was a big pain. And so there was this movement to build what I call optimizer generators where you would again define at a high level language what these patterns are, what these rules are, run it through this specialized compiler and then it spits out that the optimizer code for you that you could run, right? So one of the first ones was IBM Starburst. Again, this was a, Starburst was a system that was a follow-up like a distributed version of DB2. And then Starburst eventually the, my understanding the core framework for the query optimizer did make it into regular DB2. Exodus was the precursor to Volcano which was the precursor to Cascades. It's all the same German guy that wrote all this stuff. And again, the high level idea is that it's an optimizer generator. You know, we say we used to call Volcano model, it was an optimizer generator plus a system that would implement the thing that the thing could spit out. And so I would say that this is the way if you're gonna build a modern query optimizer today, this is how you would do this. You would define all these rules in a high level language and then have this thing generated for you. So that way again, you're not writing the tricky code that's in like a Postgres today. So the two ways now to do this, what the optimizer is gonna spit out, the overall theme or the approach to doing crowd optimization is either we do stratified search, the heuristics plus the cost-based search we saw in system R, it's typically done with a bottoms up approach, or unified search like in Cascades where I'm gonna holistically look at the entire query plan, do all my transformations in sort of one fell swoop in one stage and not worry about sort of rule-based approaches and cost-based approaches, have everything be all running together. So stratified search is what we've already talked about. Again, the idea is we have a bunch of rules that we fire off in the beginning that don't require a cost model to convert a logical plan into another logical plan, right? And then now in our DSL, we can define what transformations are allowed for given what cases and we can define what the physical properties need to be guaranteed from one operator to the next and our rules engine can enforce this for us. You do all these heuristic-based stuff before, basically the same thing we did in system R, we're now setting writing C code again, you write it in their DSL, then you do the call space search to map the logical operators to the physical operators. So as I said, Starbus was the first version that did this at IBM. I'm not aware of anything that predates this. And so they would represent the rules in this high level language that would then converts the logical plan into something that looked like relational algebra or sorry, relational calculus. And then they do other transformations on the relational calculus query plan. Then they would convert the relational calculus into the physical operators that then the system can then execute. So this is the approach that's used in the latest version of DB2 because my understanding is they, I don't know what portion of Starbus they kept to throw away, but they brought the query optimizer over into into DB2. There's a great blog article I'll post on Piazza from James Hamilton because he worked at IBM and IBM was struggling to get DB2 to run well. And it was a big blemish on IBM that they had a real crappy database system and they brought over the Starbus team to make the new version of DB2 work better. I think the initial Windows or Unix version of DB2 is based heavily on Starbus and they brought over the query optimizer as a big part of it. So the great thing about Starbus is that it works well on practice because they get its standard rules for simple things. You can come up pretty good very quickly. The downside is gonna be is that the writing these things are gonna be hard because it's gonna be, you have to read about rational calculus. I don't even teach really rational calculus anymore because nobody ever needs it unless you go build something like this. So if CMU is generating a lot of database students and they don't know rational calculus then what hope does the world have? I don't know whether it's still heavily based on this but I suspect it is. So the other big challenge is gonna be is that there's no way to set prioritization on transformations as far as if I remember correctly. Like because you do the heuristics first then you do the call space transformations. You, there may be the case you wanna do certain call space transformations first. Because there's a higher priority and you'll get a better query plan out of them. But at least my understanding with this because they have separate stages they couldn't support that. All right the last one will be the unified search. And again this will be volcano now and then segue into Cascades next class. The idea here is that we don't wanna have separate stages. We don't wanna do logical to logical first and then logical to physical. We wanna do everything all at once. And then we now it's easier for us to specify here's the priority as we apply these transformations to guide us to the optimal query plan or what we think is the optimal query plan more quickly than again having separate stages. The downside is gonna be because we're going top down that we may end up repeating certain transformations over and over again at the lower portions in the tree. So to avoid having to recompute redundant information we're gonna make heavy use of memorization to retain any estimations we've done for lower portions of the tree as you do transformations so that if we do another transformation that we've seen before we don't have to go through the process of costing it again. We know we know what comes below it that we've done it before. So Volcano is the sort of first real system that does this. Cascades is what came afterwards. Again Microsoft's SQL server is based on Cascades because they hired him. They hired Grit's Graphi to go, if he wrote this paper, they hired him to go rebuild their crap miser to use Cascades and that's why they famously use it. So I say academic prototypes here at least for Volcano. I think CalCite lists that they're using Volcano but for us I can tell they're actually using Cascades. They're doing the top-down rules engine approach with memorization. But so here's what it looks like we're doing top-down. So again you start with what you want the result to be. So I want to join artists, appear as an album and I want it to be ordered by the artist ID. And then now I'm going to have this rules engine that's going to get fired for different patterns I have in my logical operators to then convert them into additional logical operators or into physical operators to execute. So I would expand this down say here's all the joins I could possibly do. Again, I'm truncating this and that's all, sorry. And then I'm going to have rules again from logical, logical, logical, physical. So a logical logical would be join A and B. I can convert that to join B on A. Logical physical would be I join A and B and I could do a hash join on A and B. So you only go from logical to physical it doesn't make sense to go from physical back to logical. So you don't do that. All right, so then now I'm going to start at the top and then look at what's the next physical operator I would need the next stage to get me to the next logical operator below. So I could transform say album and appears, I could transform that into a merge join, come down here, see how I got there, traverse down further, do another transformation to convert this join into a physical operator, come down and get a cost, get the cost for all the doing the scans below which I'm not showing, come down here, do the same thing for this guy, get the cost for these scans below me. And then now I can even figure out what the best path I've seen so far. And I do a branch of bound. If I find a path that is more expensive than something I've seen before, I can throw it away. Or now if I also embed the physical properties in my query plan and I see that I'm violating that physical property as they go down, then I can stop there because I don't need to see anything below that. So I've come back at the top here. If I know that I need my node here, the data needs to be ordered on artist ID, then this hash join, I don't even need to consider anything below it because I know it's not going to produce any data that's sorted for me. But I could have a quick sort, an order by clause here, and that's fine, I'm allowed to traverse down there. But if now if I say well, what's the cost of everything below me in the tree, I know what my upper bound could be, and that's more expensive than anything I've seen before, then I can just truncate the tree here and not go down any further. So again, this is a quick, quick overview of what sort of the difference is. We'll go more detail next class when we talk about cascades. But this is a high level idea of what they're doing. But we'll see how we do these transformations, how we maintain the memo table, how we do expansions. We'll cover that next class. All right, so the walking officer. I like cascades a lot. The Germans use the starburst approach. He complains to me, whatever. Because when we were bidding our day, he said we were using cascades. Again, from a pure algorithmic standpoint, the search algorithms for doing the top down approach, we'll see next class, sorry, the bottoms up approach, the stratified search. Algorithmically, those perform better than cascades. But not every system does this. And from a server engineer's perspective, I like the idea of having all the rules to find in one location and one single engine to take care of everything. That's more of a personal preference rather than anything backed by the science. All right, so we'll, again, we'll cover all this in more detail next class. But again, the big thing I understand going to is this top down versus bottoms up approach. We'll debate that more on Wednesday. All right, so the main takeaway when you get from this talk is that query optimization is hard. This is why all the NoSQL systems at first didn't do any query optimization. Because they're like, oh, you don't need SQL, SQL is slow. Well, it's slow if you have a bad query optimizer. So they just sort of won't do query optimization. Eventually they've all realized that's a mistake and they have to go back and add it. MongoDB did something that seems kind of stupid, but actually it's kind of clever when you think about it. They didn't have a query optimizer. What they would do is generate all possible query plans, run them all, see what everyone comes back first, and that's the one they would pick. Right? There is a pulse-force extension. That does this? Like they just run it. They run them all, honestly, what everyone's fastest. Yeah. And so like, and then you basically have a special, okay, well after like 20 times, then I'll run them all again and see what comes back first. Right? And I think, far as I know, unless I check, Mongo still does not have a cost-based query optimizer. So we've had a lot of speakers come in the last three or four years to CMU, mostly, or Zoom, to come talk with their query optimizers because it's the part of the system that I know at least about. I find it most interesting. So today I made actually our YouTube playlist. Here's all the talks from previous years that have, from companies talking about the query optimizer. The one I can recommend more than anything else. I was actually debating whether to sign you guys to watch this video instead of reading the paper on Cascades is the one about Cascades from Microsoft. Yeah, watch this, don't read the paper. Yeah, I'm serious. So the paper I'm having you guys read that I was assigned, but it's not even the original Cascades paper. The original Cascades paper is terrible, right? Cause it's like, it's from the 90s, so it's talking about how object-oriented it is. Like a crap like that, it doesn't matter. Like you cannot take the original Cascades paper and actually implement it, right? But you can watch this talk. This talk does a better job. Like the paper I had you guys read is a master's thesis not even from Gertz, from somebody at Portland State. And it's only like 30 pages that I assigned to you guys to read in the master's thesis because that describes what Cascades actually does. There's a bunch of stuff at the end how to optimize it, we don't care about that. Yeah, just watch this video, that's better, that's better. I'll update it on Piazza. But still write the review for this. Write the review for the video. For the video, yes. Yeah, okay, so Niko works in SQL Server there for a long time on the optimizer. Like for the longest time, the researcher would say Cascade or SQL Server has the best optimizer but nobody knows how to actually implement it. This thing describes it. And there's a lot of cool stuff in it. And then the other guy talking says our his son went to CMU but his son didn't work on query optimizers. Okay, all right. So again, this is a crash course on how to build a query optimizer. We'll go more detail about getting top down versus bottoms up next class. And I'll try to talk a little bit about adaptive query optimizations. I think that's super useful. We used to have another lecture about it but we skipped it. And then I'll spend more time, also time talking about like, here's what actually real systems actually do. But you'll see what SQL Server does but I'll cover like Databricks and SQL, CalCyte and others. Okay? All right guys, enjoy the weather, see ya. That's my favorite all-rounder. That's my favorite all-rounder.