 Let's get started. I got cold. I was all ready to get my speed over thing going on. Now it's cold again, so we have to wear a flannel. OK, so today we're going to talk about query planning and optimization. And as I posted on Piazza this weekend, I started working these slides last week. And then I realized there's no way to cover everything we wanted to cover or I would want to cover in a single lecture. So we're going to split it up between today and Wednesday and talk about query optimization. We could probably spend another week on it as well, because there's a whole bunch of other more advanced things that we're just not going to get to. But for today's class, we're going to focus on how do you build an optimizer and perform the search. And then on Wednesday, we'll talk about how do you, in your optimizer, how do you predict the cost of executing a query. So for today, we'll provide some basic background of doing query optimization, query planning, in a relational database system. And then we'll talk about some basics of things we need to be mindful of. And then we will discuss different strategies for actually performing the optimization search. And then we'll say a little bit at the end about adaptive query processing, but we'll probably run out of time. So OK. So the basic premise of what we're going to do with query optimization is that the applications can provide us with a declarative query. For our purposes, it's going to be SQL, but it doesn't have to be. And we want to be able to look at our database and figure out the execution plan for that query that's going to have the lowest cost. And I'm putting cost here in quotes because it really depends on what the expected execution environment of the database is and what kind of workload you're trying to support. So the simplest thing could be I want to minimize the number of tuples that I have to access or process during my query. But you can imagine that if you're running on a cell phone, you would want maybe a cost, the lowest cost to be the query that uses the least amount of energy. So we'll talk about more about cost estimation and the different factors you can care about in the next class. But the main thing to be mindful of is that this idea of a cost is completely internal to the database system. And it's not really useful to the outside world. So what I mean by that is the cost is going to be not in terms of wall clock time, like, oh, this query is going to take 10 milliseconds, 100 milliseconds, things like that. It's going to be some internal number that allows the database system to compare the cost of one plan versus another and make a decision about whether one is better than another. And it doesn't matter whether that corresponds to some real metric in the outside world. So I would say that the doing query optimization in a database system is probably the hardest thing to actually implement. And I would say I'm going to clarify. Actually, the hardest thing to actually implement well. You can come up with sort of simple rules, as we'll see in a second, to approximate what a good query plan is. But if you want to support more complex things, especially when you do analytical workloads, it's very, very hard to do. And it's been proven to be NP-complete problem. So part of this is that you've got to think about what's going on. So when the application submits a query, you need to do the planning right then and there, for the most part. So it means if your query is only going to run for one second, but your planning takes five seconds, then that's not a good trade-off. So if we can maybe spend a couple milliseconds to figure out a good plan that maybe takes two seconds to run the query, we're OK with that. So we're not always going to find the optimal plan. We want to be able to find one that's good enough. And the way we're going to avoid having to do an exhaustive search, which would be exponential, would be that we're going to use estimation techniques to guess what the real plan would be or the real cost of the plan would be. So that means that we're going to look at the cardinality and selectivities of our queries and our database and make some approximation about what we think it will take to execute the query. And obviously, we have to do this because if we had to actually run the real query and figure out how long it took, if we were going to look at 1,000 or millions of plans for one query, then that would take forever. And then the other thing we're going to do is we're going to use heuristics to limit the search base. And we'll talk a little bit about this as we go along. But the basic idea is that, yes, you could look exhaustively at all possible combinations of executing the query plan. But instead, what we're going to do is just use heuristics to prune things very easily. The easiest one to think about is you almost never want to execute a Cartesian product join. So you just ignore all of that. So the architecture that you would learn when you take the intro database class looks something like this. And for the most part, this is what we're going to follow. So you have your application sends in the query. The first thing it's going to do is hit the parser. And that's going to convert the string syntax of the SQL statement into an abstract syntax tree. And then this tree gets passed into what I'll call the planner that then transforms the syntax tree into a relational query plan, or a relational algebra query plan, which then can then be fed into an optimizer or could then chug over this logical query plan and eventually produce a physical query plan that is then passed to the executor. And it's here in the optimizer that it interacts with what I call the cost model of the database system to predict, for a particular physical query plan, how many resources or what's the cost of actually executing it. And then make decisions about whether one plan is better than another. So for today's lecture and on Wednesday, we're really going to be just focusing on these two boxes here. There's not really any interesting thing to say. Here, this is just taking, like, Yak or Bison, and converting SQL into tokens and then converting the tokens into a tree. So there's nothing really interesting going on here. There's no way to really speed things up here. If you're going to write a database system from scratch, you almost never want to have to write this. You wouldn't try to find one you can reuse, like an appellaton thing. We're using Postgres' parser. And when we did h-store, we used hsqldb. This part sucks. And it's always better to find an existing implementation. I will say, though, there's way more Java ones than there are C++ ones. So if you're writing a Java database, there's a ton of options out there. So we're really going to be focusing on these two boxes. And then we'll be hand-wavy about how we do cost estimation for today. Then we'll pick up on that on Wednesday. All right, so the one. Yeah, sorry. Yes. Yeah, one window. Yeah, so his question is, why can the optimizer go back and come back to the planner? So it's sort of like an iterative process. You could say, all right, well, I converted my plan. I've had an nested query. I converted it into two sub-queries. And maybe I want to go back and do additional planning. This is more like transformation rules to say I can go from one logical plan to another logical plan. You can do this multiple times if you want to. Yeah, I should have been clear. Sorry about that. All right, so, but I guess actually, you know what, maybe for today, just assume it's a straight thing and we'll just focus on this. All right, so what's going to happen is that, again, we have the logical plan that we've generated from the parse syntax tree. And we want to generate the physical plan. The physical plan is what is actually executed by the database system. So the physical plan is going to specify, if you want to access this table, you would use this access method, like either a sequential scan or an index lookup. And then if it's an index lookup, you specify what index it's going to be, how you're going to scan it, what you expect the output format or the physical format or the data to look like. It's going to be compressed. It's going to be sorted. All the low-level details are actually running the query itself. Whereas the logical plan would say things like, I want to do a scan or I want to read this table. It doesn't specify how to read it. It just says you want to read it. Or I want to do a join. It doesn't say how you should join two tables. It just says that you want to do it. Whereas the physical operator would say, you want to use a hash join with this hash function and another information. So the key thing to remember here is that there's not always going to be a one-to-one mapping between logical plan operators and physical plan operators. So if I want to join table A and B, I can join it with a nested loop join. I can join it with a hash join, merge join, and other things. So what's going to happen is we'll have these transformation rules that will take the logical plan and then generate multiple physical operators through that logical plan. And then we run our optimization search algorithm to find the moment the lowest cost. Yes? It seems to be about data and optimization. Where does the company push selects to the bottom? Where does that come into optimization? So his question is, what's the distinction between the plan and the optimizer? Correct. It doesn't really matter, I would say. Within the processing pipeline, what I'm saying here is the planner could basically say, I have an abstract search tree. I want to make a relational algebra expression. And then I can convert that relational algebra expression into a logical plan. And so when you generate the logical plan, you could, in the way it's implemented, you could say, well, I know that my selections or filters should be always pushed to the bottom. So I wouldn't even bother making a bad logical plan. I'd just make a good one right then and there. Or you can just generate the dumbest logical plan directly as a straight translation of the relational algebra expression and then pass that to the optimizer. And your rules are doing the transformations that you just talked about. Predator pushdowns and other things could be done in here. It doesn't really matter. So again, the main thing here is that the logical plan specifies what we want to happen based on the SQL statement we got. And then the physical plan is actually how we're actually going to do it. One of the things related to what he was asking about is doing the translation between the logical plan and the physical plan or the logical plan from another logical plan is that we're going to rely on the relational algebra equivalency rules that you learned in the intro class to make this all work. So basically, you can have a relational algebra expression be converted to another expression that is functionally equivalent. And it may be have different trade-offs when we actually run it, different performance characteristics, but they end up producing the same answer. And this is going to be really important when we do these transformation rules because we want to be able to shuffle things around, move operators in different ways around the tree. And we want to make sure that we produce the correct answer. It doesn't help us that we produce an optimal query plan or find a really good query plan that ends up producing the wrong result. So we're going to rely on relational algebra equivalency rules to make this all work. So we're not going to discuss how you actually implement this. We're not going to discuss what the rules are. These are the sort of things that we discuss in intro class. But again, you understand commutativity and associativity rules are the same kind of things that we're going to apply when we do transformations here. OK. So one quick observation I want to make is that for this lecture and on Wednesday, we're really only going to be focusing on OLAP queries. And the reason is because doing query planning in an old TPD database system, if you're only supporting sort of fast transactional workloads, then query planning is really easy because the most of the queries are going to be labeled as what is called Sargeable. So this is a term that is used often when you look at database optimization literature. And it basically means search argument able. And the idea is that the arguments that are being used in the query in the where clause in the predicate can be mapped directly as input to do a lookup on an index. And therefore, you don't need to do any complicated query planning. So for example, if I have a query, select star from the table where id equals some value, and id is the primary key, and therefore unique, and you have an equality predicate, you know it's a one-to-one matching between the input and the attribute. That's considered Sargeable because you know exactly what index you're going to use, and you know exactly how you're going to do a lookup, and you're going to get exactly one tuple. So when you have a query comes in that looks like this, you don't need to use any fancy search algorithm. You don't need to use whatever, all the things we're talking about here. You can say, oh, you want to talk about this. You want to access this table, and you're doing a lookup on exactly when it's primary key. I know exactly what index you should use. So this is super easy to do, and this is essentially what we did in h-store. BoltDB has a little look. They can support the system R stuff that we'll talk about in a sec. But this is for 99% of the queries that we dealt with in h-store. This approach would work. When you do have joins in an OATP system, they're always going to be on foreign key relationships. So they're also going to be sergeable as well. So for example, let's say that I want to get all of my order items that I bought from Amazon. So I would do a lookup on my user account in that table, and then do a join with the order items table, which should have a foreign key relationship with its parent's table. And therefore, I'm, again, doing another index lookup to get the few number of tools that I need. So I don't need to do any complex query planning there. I can do, again, do an index nest loop join to get exactly what I want. So again, all of this is going to be done in simple heuristics. This is really easy to do. So that's why we're only going to focus on the OLAP stuff for today. So we said a little bit about cost estimation in the beginning. And this is pretty much all I'll say about it for today. But the basic idea is that, again, for a given physical query plan, we want to be able to estimate how much resources and what's the cost of executing it at runtime. And there's a bunch of different things that we can consider and make the best choice. Obviously, things are like, how much data is it going to produce or read? How much CPU or IO we expect it to consume? Other aspects, like what's the property of the underlying data that it's accessing? And how does that relate to what the output should be? If a table is already clustered on a particular key and my query wants to do an order by on that key, well, that would be a pretty fast query because I don't need to do any extra sorting because my data is already sorted. One aspect that is very complicated that we may or may not discuss on Wednesday is how a query will interact with other queries running at the same time, which is very hard to do. So for example, if I know I have a really long running query and it's taking a minute to run and I have a short query that shows up, there may be a query plan that I could choose that if my second query was running in isolation without anybody else, that would be the fastest way to go. But since my long running query is still going on, then maybe I want to choose another plan that interacts with that other query in a better way. So these are all the things that we want to consider, but for now we can just assume something simple, like it's the fewest number of tuples we have to access and that'll be enough to guide us for all the things that we're talking about. Okay, so when we first now talk about, before we start jumping into the actual search algorithms themselves, let's talk about what are some basic design choices or implementation choices that we can have when we create a query optimizer. So we're gonna talk about the granularity of the optimization, so when we actually perform the optimization, like the timing of it, and then how to achieve what is called plan stability. And I'll talk about what that is in a second. All right, so the first choice we have to deal with is what should be the scope of our query plan or optimizer and how many queries should we try to optimize at a time. So the simplest way to do this, and probably what everyone is familiar with, is do it at a single query at a time. So an application submits a query request, it comes in, and then we ignore whatever else is going at the same time and our optimizer just spends this time trying to come up the best plan for just that one query. And so this is nice because the search space is much smaller than having to deal with multiple queries at the same time, but the disadvantage of this is that we're not gonna be able to reuse the results or piggyback our query on other queries that may be invoked at around the same time. It's also gonna have to be to account for what else is going on in the system if you wanna be careful about playing nicely with a multi-tenant system with other applications or other queries going on at the same time. And that you'd have to bake into the cost model rather than having that be directly integrated in the actual search algorithm itself or the optimization algorithm itself. The alternative is to do multiple queries at the same time. Let's say that within a five millisecond batch I get a bunch of queries and my query optimizer can do planning for all of them and take all of them in consideration at the same time. The advantage of this is that if there's queries that come in within that batch window that are doing exactly the same thing, maybe just a minor difference on some input parameter, then rather than having the database system execute the first query, get a result, execute the second query and get the exact same result, I could piggyback them on each other and reuse the same query plan and know how to combine their results or split them off to produce final answers at the end. You'll see this more when we talk about scan sharing. This is actually something that people, that the commercial systems can do, but I don't think any of the open source guys can do. The basic idea is that say you have a query that's doing a long running scan, then you come along, the next query comes along and it wants to do the same kind of scan rather than just starting from the very beginning and just like the first guy did, it just jumps on the train, if you will, where the other guy's already running, keeps scanning it, it knows that there's a part of the top that it missed and it has to go back around and keep getting it. Yes? So say it again, if you have, yes. Exactly, yes, yeah. If I know that it makes a big difference in a disk-based system because the major cost is going fetching the page you need from disk. So rather than have a contention of like this guy needs this page and this guy needs this page, I'll just combine them and maybe my query plan could apply the same predicates on the data that comes in as we fetch and reuse that or maybe we're doing completely separate things and I can just reuse the fact that we read the same page, right? So there's different granularities within doing multiple query optimization, you can apply this technique as well. And like I said, this is mostly for, you mostly see this in the commercial systems. Certainly because doing, creating an optimizer that can support this is quite challenging. We'll see this later on when we do, there'll be a lecture on scan sharing. There's some specialized academic data systems that can support like, give me all the queries you plan on executing and I'll generate a giant query plan and I know how to have different entry points to get to the data that I need rather than just sort of producing everything in a giant heap and then picking apart what you actually need later on. Okay. The next issue we have to deal with is when do we actually perform the optimization? So again, the one that you're probably most familiar with or the first time you think of doing query planning is called static optimization. We would teach this in the intro course. The basic idea is again that when the query shows up goes into the parser, then the planner and the optimizer and then you generate the query plan for that query and you never revisit the decision that you made for the lifetime of the query, right? You select the best plan you can find at the very beginning and that's what you go with and you just execute it to the end. So the quality of this plan is gonna be again highly dependent on the accuracy of your cost model because there's nothing about what the, at runtime if you recognize that I made a wrong choice because the data I felt was laid out in a certain way is actually in another way, in the static optimization approach you have no way to react to that and change. So it's very important to have a good accuracy and we'll see on Wednesday's paper that they usually don't. The other thing you can do is you can actually amortize the cost of doing static optimization over multiple queries if you support prepared statements. When you call prepared statement, it does the query planning and stores it in a cache and then it gives it a handle, like a name of the prepared statement and then use the application can go back and re-invoke that same query and use that same plan that I compiled the first time and put substitute parameters, substitute values that are used in the where clauses so that you get the actual data that you're looking for. This has other, I mean using prepared statements has other disadvantages like you, since you don't know what the selectivity of the predicates are, since you don't know what the values you're gonna be using as in the where clause, it has to make a guess about how good things are gonna be. So prepared statements are nice because you avoid the overhead of doing the parsing planning and optimization, but if it's a very complex query then you could have, it may not be the optimal plan or what was the optimal plan yesterday may not be the optimal plan tomorrow. Dynamic optimization or adaptive query processing is the idea that we're gonna be able to select what physical operators we wanna use to execute a query on the fly as we execute that query, right? And what will happen is that as we process the data if our assumptions that we made in the beginning turn out to be wrong we can choose a different plan as needed. And the basic goal of this is that you want the adjustment cost to be less than the just finishing the execution with the plan you already have, right? If I have 50 milliseconds to go in my query and it's taking me 100 milliseconds to readjust my plan even that readjustment reduces the execution time by 10 milliseconds, down to 10 milliseconds then I'm better off just letting it go. We'll talk a little bit about the yam but most database systems don't actually implement this because this is actually pretty difficult to do and it's also very difficult to actually debug. Where this is more useful is in what are called continuous queries or stream processing systems where you specify this is the query that I want and it's sort of invoke or trigger if anything is satisfied and therefore it's for a long running query that's always executing and therefore you may wanna do this adjustment because how you plan things when you deploy the system may end up being different later on. And then the last approach is to use the hybrid combination of the two. You initially start off with a static optimization generative plan and then you check to see that over time if my expectations or assumptions are way off I'll just replant things. And again you wanna make sure that the cost of redoing this is not the same as this. The cost of redoing it is less than the cost of just letting it finish. Okay, so now the last issue we gotta deal with is what's called plan stability and this is really important in enterprise situations. The cost of a plan stability basically means that I want my optimizer or my database system to use the same query plan from one day to the next and have the same performance properties. You mostly see this when you upgrade different versions of a database system and the optimizer might be improved to some queries but actually made worse than others and so you need a way to tell the database management system hey, here's the query plan I really want you to do and make sure that you pick this one and not some other one so that my queries don't all sudden get super, super slow. And all the different database, this is mostly for the commercial database systems that have to care about this, right? And they all do different things to enforce this. So the easiest way to do this is to provide hints in your, when you actually get a query is to tell the optimizer certain things like hey, for this join use this algorithm, for this table use this index, you sort of like in like SQL server you can specify this as like XML file and you get loaded into the database system when you create prepared statements. And the idea here is that again, we're specifying to the database system exactly how it should plan the query and sort of other things like maybe one is to run a search algorithm but for the most important parts where we as the human are telling it what we want. The next choice is to actually tell the database system here's my query set and I want you to use the same optimization algorithm that you used in an earlier version of the database management system. So Oracle does something like this. So when Oracle, if you have Oracle 10 and you buy Oracle 12, in the Oracle binaries will also be the code for the optimizer as it existed in the earlier versions. So you can tell the database system here's my queries and make sure you use version 10 not version 12, right? Yes? This question is, are there any systems where you completely specify a plan? Yeah, but no SQL guys, right? No SQL guys can't, don't have a declarative language you have to do everything yourself. In the relational system, absolutely. You could like, you could say do prepared statement get back a query plan it's like usually like XML file or in some format and they have tools a lot of you like tweak it, right? I don't think there's, I suppose if you know the format of the XML file you could construct it by hand, right? But that'd be a kind of waste of time, right? Usually what happens is like the DBA will create a bunch of prepared statements get back all of the XML files for the plans and tweak anyone that it needs. Or then like if someone says hey this query I was using last week is now super, super slow DBA will dump it out, maybe tweak that by hand. Did that language tell you they explicitly specify how do you map? Yeah, sure. So the statement is in high performance computing there are DSLs, if you will. DSLs that when you can specify exactly what the system should do, right? I would say no. And the reason is why because database systems are always gonna know way more about what your data looks like than what a user looks like. Now it depends on the database system to have really accurate statistics. But for the average person they're probably not gonna be able to generate a plan as good as what the database system could do. You pay a lot of money for DBA to do the hints things or tweak things manually, right? But in practice I mean that's expensive. Yeah so this question is can you query arbitrary statistics in the database? Yes, so database systems have a catalog where they maintain all the statistics like the number of tables, the number of rows per table you can get access possibly to the cardinality of different attributes or different columns in the table. All that should be exposed to you in what's called the information schema. It's sort of this ANSI standard format of how you provide metadata about your database to the outside world. Now all the different database systems are gonna have specific extensions to it that contain things like catalog statistics. And the statistics from one system to the next may be dramatically different. So in the case of MySQL for example, they do statistics at sort of the core screen level like the actual database instance. So it'll say here's the number of pages that I read. Doesn't tell you what table it from. It doesn't tell you what database had the most reads. It just says here's the high level stuff. Postgres gets a more fine grain. You can do it in visual table basis within a database. The commercial guys have really low level fine grain statistics. Like you can even get back to like DB2, you can say how many times did I read into this one buffer pool instance, right? So they all maintain this kind of stuff. But again, it's expensive to do, to run time to, and that's why they usually use approximations. You can also have like cron jobs run at like 3 a.m. to tell it to do a complete scan and update everything. We'll see more of this on Wednesday but coming up with selectivities of complex predicates is damn near impossible, right? And they almost always get it wrong. So yeah, so the answer question is yes, this stuff is all exposed to you. Okay, so again, for this one here, again we were upgrading our data system one versus the next, but we wanna make sure that it doesn't foul up any of our query extrusion results, our query plans. So we can tell the data system to use the optimizer as it's existed in an earlier version. Now there may be operators that are added to the data system that are much faster than the earlier versions and the old query optimizer would not be able to select from them, right? So you sort of sacrifice that if you stay back. But again, plant stability is really important in a lot of cases. The last choice is what I think SQL Server does is allows to have backwards compatible plans. So again, you can export the query plan that was generated from the older version, then you upgrade your database system to the new version, and then you can load these back in. And the data system knows that here's the state of the query plan that existed back on the old version and I can always re-execute it and treat it as if it was as I was running on the older version. So again, plant stability is all about making sure that the same query plan we generated yesterday should be roughly the same as today and we get the same performance characteristics. We don't have wild variants in how it works. Okay, so now we can actually talk about how we implement a query optimizer. And there's five basic categories of techniques that I'm gonna go through. We'll start off with doing heuristics, which is the most simplest one. Like if you're building a database from scratch, this is probably what you implement first. And then we'll talk about IVM's search algorithm, some randomized search algorithms, and then here we'll talk about, I think the better way to do all these things. And then the main takeaway is this last one, the unified search of the volcano cascade model is the, in my opinion, the state of the art and the correct way to do this. So now I said in the beginning of this course when we started that this class was not gonna be about reading all the old papers about the traditional implementations of database systems. But in full disclosure, all the work that we're gonna read about here is a lot of it was done before you were born. This one's from the 70s, so I wasn't even born yet. These randomized stuff from the late 80s, I was around and this unified search and stratified search, stratified search to the late 80s, the unified search is like mid-90s. And the main takeaway here is that the, certainly the hardware's gotten better and there's certain techniques within how we can implement these things that've gotten better but the basic idea is still sound, the basic idea I think is the correct way to go. And there's two different dimensions we're gonna talk about as we go along and for all of these. One is just how good is the plan, the algorithms are gonna generate, but also how easy it is to actually implement and maintain the software engineering aspect of it because what we'll see is these other ones here are a huge pain in the ass if you wanna start doing really complex things whereas the unified search kind of, you have to implement it but in my opinion it's easier to maintain and easier to extend to support new operators, new query types and things like that. That's just something to be mindful as we go along. Okay, so let's start at the beginning. The easiest way to do query optimization is called what's called using heuristics. The basic idea is that we're gonna define static rules that allow us to transform logical operators into physical operators. So he brought up the example earlier about I wanna do predicate pushdown, I wanna make sure that I'm always doing selections at the bottom of the query plan. That's an example of an heuristic. And what I was saying to him was like when you could do this in the planner as you transform the syntax tree to a logical plan or you could do this in the optimizer but in the end it's still a rule that you're applying or heuristic that you're applying to do that transformation. So here's some of the basic four things that you'd always wanna do. You obviously wanna perform the most restrictive selection as soon as possible because you're reducing the amount of data you actually have to look at. You always wanna do selections before joins. You wanna push down the predicates, the limits, the projections and other things. And then you do basic join ordering on just the simple cardinality of the tables. You always wanna have the largest table be the outer relation and the smallest table would be the inner relation. So the two systems that implemented this in the early days in the 1970s was Ingress out of Berkeley by Mike Stonebreaker and then Oracle as well. And what's really surprising about Oracle is that this is actually the approach they used up until like the mid 90s, right? And you think this is kind of crazy how do they get so far and get so big by just using again the easiest approach? Well, the complexity of the queries that people were maybe doing back then and the scope of the data they were dealing with is certainly not the level that we dealt with today. And they basically had a sort of a lot of people banging away, making the heuristic based approach work well. And they'll switch to another approach that we'll talk about later on. So what I wanna do is quickly show you an example of a type of heuristic based query optimization that we can do in the context of Ingress. So everyone should already know about predicate pushdown, right? That would have been covered in the intro class. But I wanna show an example of how Ingress did joins because it's also sort of related to how you do adaptive query processing later on. So for this, we'll have sort of a Spotify or some kind of music app clone where we'll have albums and albums and artists and artists can appear on albums. So the query we wanna run is a three way join between the artist appears an album and we wanna get back the names of all the people that appeared on Joy's mixtape when he dropped it. And so in Ingress, Ingress is not gonna be able to do join ordering as we know it. They're having join operators, feed data from a sequential scans or scans into them. What they wanna do is they wanna decompose the query into being single variable operations on a single table at a time, right? So what I mean by that is say here, we have in our warehouse, we have sort of three different predicates with three different variables that we're dealing with. So in the first phase, what we'll do is we'll extract out the lookup on the album table so that we have a single variable predicate, right? Where the album name equals Joy's mixtape. But then in the second table where we write it to now use the temporary table that was emitted by this query here as part of the join. But we wanna decompose this even further again because we wanna get it down to being just single variable wear clothes. So this query here can then be split up into two separate queries where again, this one's doing a lookup where appears is album ID equals what's produced from that query. And this one is the artist ID is generate is used for what's producing this query. So now what the heuristics in Ingress will do is then actually execute these, generate the physical operators that would execute these one by one. So I'm not showing you like a relational algebra plan. I'm showing it in terms of SQL statements because I think it's a little easier to follow, right? So we would execute now each of these queries in a specific order and then use the values that are generated as output from one query as the input for the next query. So the first guy here would execute and we would get 9999. And then we would then feed that as the input to this thing. So now we're not doing that join on the temporary table. We're actually just substituting the value in there, right? So you could think of a physical plan of pushing the variable up into the selection for this query here. This produces two different tuples. So therefore to substitute it now in the third query, we would then split it up into two separate queries again, execute this, and then we produce our final answer, right? So this is what Ingress did in the 1970s. When you read their paper about this, this seems kind of crazy, like especially given what we learned about in intro class, the scenes were expensive, the scenes are very wasteful because you're executing all these separate queries rather than just doing the join right into it. And you have to understand back in the 70s when the databases were pretty small, right? If you read this paper for this particular, this algorithm, this approach, they're talking about a table with a hundred tuples, 300 tuples, right? They're not very big. And then they're running on very limited hardware, right? So using simple heuristics for simple queries like this was good enough. All right, so the advantages and disadvantages of using heuristics is the obvious advantage is that it's that super easy to implement, you can get off from running very quickly, right? It doesn't take a lot of work. And for really simple queries like the one we just showed, it works well enough, right? Obviously now we can support common table expressions, nested queries, UDS, all these other things that make heuristics difficult, but back then that was okay. The problem obviously is that it's hard to generate good plans when you have complex queries with interdependencies. And a lot of the decisions that you'll make about what order you should execute certain joins are gonna rely on magic constants, right? You're not gonna have, you don't have a cost model, you don't have an estimator, so you have to make a good approximation based on what you think the table looks like, right? And it's not really grounded in how the hardware actually performs in any way. So this has problems, but again for ODP, this is good enough for modern OLAP, this will never work. So Oracle and Ingress were around the same time and they both sort of came up with a heuristic based optimizer. But then the third database system that was coming out of the 1970s, System R, they came out with a different approach. And this was developed by Pat Selinger, who actually left IBM recently to do a startup. But she was one of the people that had a PhD at IBM Research that were all working on System R at the same time, and then she picked query optimization. So this, if you take an introduction to database class and we teach you query planning, we show you dynamic programming, it's the Selinger, System R approach, is what we're teaching you. So the basic idea of this is that we're gonna have sort of two phases. The first phase is that we're gonna use basic heuristics that we talked about before due to the predicate push down and other things. And then we're gonna use a dynamic programming approach to do a search on coming up with the correct join order of our tables. And for this, we're gonna use a bottom up planning approach, our divide and conquer search. And if you're familiar with in AI terms, this is called forward chaining, because we're sort of starting with not at the goal, but starting from the ground, and we wanna build up from the bottom up our query plan to reach our final goal, which is the optimal query plan. So again, this is what System R did in the very beginning, but that was never released. So when they came out with IBM DV2 in the 1980s, they used this, and I'll say also too, most of the open source database systems that have a query optimizer will do something similar to this. Like this is essentially what MySQL does. Postgres has something completely different we'll see in a second. All right, so let's talk a little bit about how more detailed what System R is doing. So the first thing we're gonna do is we're gonna split the query up into blocks and then generate the logical operators for those blocks. So you can think of a block as like one whole query or one whole select statement. And then if you have like a nested query, that would be treated as a second block. And then you would split them up into joins to do in-claws and existent things like that. And then for each logical operator that we have in our block, we're gonna generate a set of all the physical operators that we could have to actually implement that logical operator. So remember I said it for a logical operator would be like a join and two tables. And then the physical operators would be all the algorithms that you could use to implement that join. And then we're gonna construct from the bottom up a left deep tree that will have the minimal cost of actually computing that query. And the left deep tree is when you only can do joins on sort of going up the left side of the query plan tree as opposed to like a bushy tree where you could have two joins here and then they get joined together. And the reason why they do only left deep joins is again they wanna cut down on the search base of the search algorithm. Because otherwise it'd be an ugly loss of search and take a long time. Now in practice though, the left deep tree isn't always gonna be the most optimal tree. And certainly in the case of an in-memory database system I don't think it's always that. In the IBM system R days, they were worried about having to write out the inmate result of a join out the disk. But if you only have left deep trees, then you can just pipeline the output of one join as the input for the next join. Whereas in a bushy tree join, you may not be able to do that. But in an in-memory database, we assume everything's in a memory. This is not really a restriction. And I think in the volcano cascade stuff, we'll talk about in a second, they support bushy trees without any problems. All right, so let's say we get back to our join or look up on finding all the people who enjoys mixtape. This time we're gonna change it a little bit by having the output be sorted by the artist ID. And the reason why we wanna do this, as we'll see in a second, is that the system R optimizer is not able to take into consideration the physical properties of the database. What I mean by that is it has no way to know that like, oh, I have an order by in my query, therefore I may wanna use a sort merge join because then my output of the merge will be already sorted and I don't do an additional sort operator. So they couldn't support that in this approach. And the way you have to do this is actually bake the idea that my data needs to be sorted in the actual cost model itself, right? To make a selection, to sort of take in consideration that, oh, my app was already sorted, that's good. I'll give it a lower cost. All right, so the first thing I'm gonna do is choose the best access path for each table. And again, this is just to be done heuristics. So we do sequential scans for these two tables and an index lookup for the album table. And then we're gonna enumerate all the possible join orderings for our table. And this would be just a giant table that looks like this and note that we have natural joins as well as Cartesian products in our table because we may actually wanna do that, but in practice what they'll end up doing is pruning these right away because they almost never wanna do that. And then the last piece was we'll use the dynamic programming divide and conquer search to find the path to get us to the best query plan. So it sort of looks something like this. For simplicity, I'm only showing to the join operators here and you can imagine this thing extends all the way down for all the other ones that were listed in our table. So we'll start on this side here and for this point, we don't have anything actually joined yet, right? So we just have the tables listed here. And then we have paths that go from this node here to these other nodes where we have joined the actual tables and then the edges are annotated with what algorithm was used either source, in this case, sort merge join or hash join. And then these edges will be also annotated as well with the cost of using or executing that particular physical operator to get to this logical operator here. And then likewise, so we'll pick the one that actually for each of these that has the lowest cost and then for each of these, we'll then do the same thing and compute the cost of going to this next logical operator here along these edges with the different join algorithms. And then from this, we can pick which of these two are the best from going from one node to the next. And then we go in reverse order and follow along the path that gave us the shortest overall cost, which would be down here in the bottom. Right, so again, here, we know where output needs to be sorted on artist ID and the sort merge join will do that for us. So in our cost model, we take that in consideration and give this one a lower cost than the hash join here. So here again, none of the physical properties of the data is actually embedded in the search algorithm. And this can only do join orderings. It can't do any other things we may want to do. All right, so the advantages is that in practice, this usually finds a reasonable plan, right? Without having to do an exhaustive search and consider all possible join orders. The downside is that we have that same problem in the very beginning, in order before we even get to that search graph that I showed, we have to apply all the same heuristics that we did in the ingress case to come up with a reasonable logical plan for all the sort of other aspects of the query, right? The predicate pushdowns and limits and other things. The, we said before, the left deep join trees may not always be optimal. You could bring them back, but that sort of explodes the search space even further. And again, there's nothing about this particular approach that takes into consideration the physical properties of the data. All right, so next, we have a class of algorithms or search optimization is using randomized algorithms. So the basic idea here is that rather than having a guided search to go from, you know, so one query plan to the next, we will just generate a decent query plan at the very beginning using heuristics. And then we'll just do a random walk on the search space of all possible query plans until we find one that looks good enough or until we run out of time. So this is actually, the only database system that I actually know that actually implements this is Postgres, because Postgres has a genetic algorithm, let's talk about in a second, that kicks in when there's too many queries or too many joins in your query. So in the late 1980s, there was a proposal to do simulated annealing, to do query optimization. And the basic idea is that again, you have a giant graph, all these different query plans, you jump to some location, compute the cost of that query plan, and then you come up with different random permutations of that query plan that would point to other edges that you're connected to in the graph. The idea is that you would try the random permutation out, check to see whether the cost of that new query plan is less than the best one you've ever seen before, or the one you're at now. And if it is, then you definitely wanna move to it. If not, then you flip a coin and make a decision about whether you're willing to accept a query plan with a higher cost in exchange for just moving to another space. The idea is that you're gonna avoid getting caught in local minimums if you randomly take a query plan that might be actually bad. So you always keep track of what query plan is the best one you've ever seen so that when you run out of time at the end, or if there's like a threshold to say, after 10 milliseconds or whatever, I haven't seen a better query plan so I'm just gonna give up. You always know what the best one it is and you can choose that. So this was proposed in late 1980s, again, out of University of Wisconsin because that was sort of a big database powerhouse back then. But again, I don't think anybody actually implements this practice, yes. Only the line of the cost model, the approach to actually quite ground-liquid is what I said too. So his statement is, everything we've said so far, including all the other ones we'll keep on talking about, rely on a cost model to make a decision about whether a query plan is a good idea or not. And then his statement was, is there anybody that actually tries to run it and see how well you do and use that as the metric? Yes, turns out this is what exactly what MongoDB does. So MongoDB doesn't have a query planner, doesn't have an optimizer, doesn't have a cost model. What they do is they come up with all possible query plans, fire them all off, see which one comes back first, and then keep track of that and use that one the next time you see the same query. They can get away with that because they're not doing really complex analytical queries. And this is at least in the last two years that I looked at and they might have sport more complex things. But like you think of something like I have a query, I could have two indexes I could use, whatever index comes back first, that's the one I'll pick. And then maybe I'll revisit it and after a thousand queries, I think it was the threshold, I'll do that same approach again. Maybe you're expecting to be queries that you ever see in the future of very similar to the queries used on the query. Correct. And MongoDB workloads, sort of the operation, they're not all ATP because they're not doing transactions, but sort of operational database systems where you're dealing with websites. In that case, absolutely, yes. You're seeing the same thing over again. The adapter query processing stuff, which we're not going to have time for, but I don't think anybody actually does this. Nowadays, I mean with main mirror databases, it's getting kind of weird. We're like, in old days, you have queries that take days. That was certainly a problem people have. Nowadays, if you're in memory and you're using a column store, minutes. So it's not worth all that overhead to slow down the regular operational queries, just to make the one query that takes five minutes instead of four minutes, go a little faster. Can you run a query on a sample of the database to figure out? The statement is, can you run a query on a sample of the database to figure out what's going to happen and then use that to guide what decisions you make? It's an interesting idea. I don't think anybody does that. You can do sampling of the data to help you guide your statistics calculation. Again, the hyperguys do this, where they pick a thousand tuples and you just see what these selectivity is the predicates they're using and use that to make decisions rather than relying on possibly stale statistics. I mean, when you think about it, how would this actually work? You pick three tuples for each table. What's the image of the data like? Yeah, but there could be more than one table. It could be, here's the thing. So the number of plans you can have that you can choose is exponential. So how many samples are you going to take? If you have a million different combinations of query plans you can choose that you can do sampling for all one million of them, that would be pretty slow. Yes? You can't just take a sample of the hash table because you might not have hash. So the statement is like, yeah, there's physical operators where you have to generate things like a hash table and you're saying how would you sample a hash table? Yeah, the sampling thing, actually, it's an interesting idea. I just don't think it would work. I think it'd be too slow, right? I don't know whether that's anybody to try that in the literature. My general impression is for simple queries, sampling's not going to help you because most of the things will be sergeable, you fire it off right away. So that doesn't help you. For more complex things, the number of different plans you would have to consider is going to be massive. And if you have to sample execute some of them. I'm going to take some plans and just execute it to the end of this, estimate of what data periods are going to be. That's what I'm going to do. Okay. And then choose part plan based on the estimate. If you think that your estimates are not good enough. So that's one. So your statement is like you could estimate, for some operators, estimate what you think the size of this intermediate app is going to be, and then use that to guide your search for all this stuff. That's the sampling thing from Hyper-I thing in some ways. You're actually talking about actually executing the actual operator algorithms. I don't think anybody does that. We can talk about offline whether that actually makes sense or not. But my initial reaction is no, but we should talk. Yeah. Okay. All right. So we have simulated and kneeling. I'm not going to get through everything for the lecture, but that's okay. For Postgres, they have a genetic algorithm. My understanding from talking with the Postgres developer who actually is a CMU alum, he did his PhD here in the 1990s. One of the main guys actually still lives in Pittsburgh. We had lunch with him in the summer. He basically confessed that the Postgres query optimizer is a mess because a lot of the more simple stuff is heuristic-based, and then if it gets too more complicated, then they switch over to the genetic algorithm. But they have to do certain things in certain stages. They don't have a unified holistic view of the entire query plan. So what happens is in the genetic algorithm, you start with a basic query plan, write something simple, and you come up with different permutations of it. You estimate the cost of all of them, and then you pick the top K ones that are the best based on your fitness function, which is just the cost estimation, and then you permute them or you combine them with each other, and then you do another round of cost estimation. And again, this is a randomized algorithm because you're flipping a coin a bit in some ways about deciding how you permute things when you go from one generation to the next. The key thing about this is that the permutations has to be, the permutation function, the mutator function, has to make sure that it produces plans that are valid. For example, if you want to permute your query plan to move, you have a hash join followed by a sort, and your output needs to be sorted, the mutator function can't switch the order of those because that then the output would be incorrect. So again, you have to codify all the rules of how you do your permutations in here. And that seems to be the recurring theme when we talk about everything so far is that we always have the right code to make sure that we produce plans that are correct. And this is the really hard part of this. So for randomized algorithms, the advantage and disadvantages are that by jumping around the search space, it allows us to explore different plans that we may not otherwise explore in the dynamic approaching or the divide and conquer approach from system R. It can also have very low memory overhead if we don't have to keep any history. So if you don't keep track of all possible plans you've looked at before and what their costs are, then the amount of memory you have to keep when you do optimization is very low. But of course that means that you may end up looking at a plan that you've seen before and wasting cycles for re-computing its costs when you didn't need to. So there's a trade-off there. The, again, another from a software engineering standpoint is difficult to determine why you look at the query plan and understand why exactly the database system chose to use that query plan because it's sort of random where in the heuristic case you can kind of look exactly, here's the rules I applied, here's what I generated. In the case of Postgres, because Postgres is the only one that I know that actually implements this, they have to do a bunch of extra work to make sure that the genetic algorithm is deterministic, meaning I take the same query plan and I get back, same query should get back the same query plan one day to the next and they have to specify a seed number for their random number generator to make this all work. And then of course we still have to implement all the correctness rules by hand in our mutator function to make sure that we don't end up with invalid plans. So this is the main part, this is how we're gonna transition now to what I consider the better approaches. That last point I just made, that in all the three other implementations, we had to implement the rules of what it means to do a transformation from a logical operator to either another logical operator or a physical operator in a correct manner and make sure that the query plan is gonna be correct, that we don't reverse the order of things that we shouldn't. And so in all these other implementations, we've been writing them in a procedural language, Java, C, C++, Python, whatever. And this is really hard and cumbersome to do and it's hard to verify that the transformation rules that you're writing in your optimizer are actually correct. Right, if you just have a bunch of Java code that knows how to mutate query plans and change the order of things, how do you know that for all possible queries, this is gonna be okay? You could generate a framework to do fuss testing for all possible SQL combination, or you know, of SQL operators or different data types, different join orders, things like that and see whether your thing always produces a valid plan. But you still don't have a hard way to verify that it's actually correct because you're writing it in an interpreted language. So the better approach that's used in the subsequent stuff that we'll talk about is to use a declarative DSL, a domain specific language just for your optimizer to codify the rules for doing transformations that you can then verify that they're correct. And then you can generate a rule engine that knows how to apply these rules at runtime to produce an optimized query plan. Now we're still gonna have a cost model, we still have a cost function to determine whether one plan is better than another, but how we're gonna transform or permute our plans as we go along is gonna be more robust and more rigid. So the two ways to do this are to do what's called stratified search or unified search. So in stratified search, our planning is gonna be done in multiple phases in the same way that we did in the system RKs. I will have some basic heuristics now, but the heuristics are written in our DSL so we know that we correct to do our transformation from logical operators to other logical or physical operators. And then we'll do the dynamic programming search to find the optimal plan based on the cost. And then the last approach would be unified search where all the transformations and all the query planning are done in a single phase. And this is what I actually think is the best way to do this. But not all database systems actually do this. So the key thing also too about doing all of this is that in these modern query optimizers is that the physical properties of the data is gonna be embedded in the actual operators themselves. So that means that if we have a sort operator that produces the data in a sorted manner, we can know in other operators up in the tree that it's getting input that's been sorted. Whereas before in the system RKs that had to be baked into the cost model and wasn't exposed to the actual search algorithm itself. So we'll talk about two approaches. The unified search, well, I'll talk about how you do a compilation of these things in a second, but. So the stratified search was developed by IBM in the late 1980s. And again, as I said that we're gonna have a DSL specify our transformation rules we'll apply those rules in the very beginning. And then we'll still use the dynamic programming to figure out the correct join orders or other orders or other operations. And still gonna be using a cost allows us to do the mapping from the logical plan to the physical plan. So the most famous one of the first one that did this approach was a system that was developed at IBM called Starburst. It never was a standalone thing, but a lot of the ideas found their way into DP2 later on. And this was led by Guy Lohman, who is one of the top people at their database research team in Alamedin. So again, the query refrite phase will convert things from sort of relational output to relational calculus and do our transformations directly on that. And then once we have a relational query plan, sorry, a logical query plan, we can then do our system R-style dynamic programming to find our optimal plan. So this is actually what they do in the latest version of DP2, some variation of this. The Starburst Optimizer has the advantage that it works really well in practice and it has pretty good performance and it's also easy to maintain because our rules, in theory, it's easier to write the rules because they're in a DSL, but as you wanna do more complex things, that becomes more challenging. And there's later papers that talk about how the relational calculus transformation that they're doing is very cumbersome and makes things hard. So again, DP2 does this, I don't know if anybody else actually does. And to take a time, I'll just skip right to the one I think is the best. So in a unified search, well, I'm gonna have a single stage where we do all our transformations and all our planning, right? The transformations could be very redundant because we're doing a lot of them. So we have to be careful about not wasting CPU cycles or exploring different plans that we've seen before. So in all these approaches, there's gonna be internal hash table that's gonna allow them to either take a logical operator or a physical operator and keep track of whether we've seen it before and have computed the cost for it so that we don't have to retry it over again. So the two implementations of this, the first one is Volcano and this was developed in the late 1990s by Gertz-Graffy and if you remember reading from the index locking and latching paper, it's the same guy, now he's at HB Labs, but before he used to be I think at I think University of Colorado. And so the basic idea is that we're gonna have a DSL, we can write all our rules for doing our transformations, all our rules for doing relational algebra equivalencies and then we'll generate our query plan by doing a top-down search or if you come by AI background called backward changing. And the basic idea is that rather than starting at the very beginning without any query plan, without knowing where we wanna end up, like in the case of system R, we're gonna start at the very end with exactly what kind of query plan we want and then our search algorithm is gonna go down from that and build it from the top to the bottom. So the Volcano approach was actually implemented in non-stop SQL, HB tandem not non-stop SQL and then HP bought it. I think it's still around. This is one of the earlier distributed parallel database systems. It was called non-stop, because the idea was like super fail-safe hardware. So let's look an example of how this would work. Let's use that same query, we're gonna look up Joi's mixtape. So the first thing we'll do is to start with a logical plan of what we want the query to be. So here we just say that we have our logical join on our three tables, the order doesn't matter and then we specify as a physical property for that logical plan what we want our data to be sort of on. So then now we're gonna apply all our rules from our engine to either logical to logical transformations or logical to physical. So logical to logical would be like reversing order or join, logical to physical would say you have a join and then here's the actual algorithm I want you to use to implement it. So again, we're gonna start from the top and work our way to the bottom. So here we apply all the logical operators to expand these out and then now we'll do the logical physical and say, well I know I could do this with a sort merge join because I know I want my output to be sorted and then I connect these logical operators into it to say this is how we're gonna compute this. So we don't specify exactly how this join occurred, we just know that we're gonna take whatever the output is and feed it into it. And in this case here, we can specify what the access method was or is it an index lookup or is it a sequential scan. And then we keep going down the tree, now we then compute for this join operator, we can do with a hash join or we can do with a sort merge join and the same thing, all these operators then feeding into it. And as we go down, we're calculating what the cost is for that query at the point wherever you are in the tree to the very top. So for each edge, this would have a cost. So if I ever see a join or ever see a path that I'm pursuing where the cost is greater than the lowest cost I've ever seen, then I don't need to go all the way down that path, right? It's called branched bound search in the optimization realm. We can also have enforcers that require the output of one of these physical operators to match what we need for our physical operator. So for example, if we wanna do a hash join here, our enforcer would say, well, we can't do this because the data we're gonna get is gonna be unsorted and our output needs to be sorted. So the enforcer would just cut this thing right off right here without even bothering going down. So instead, we would have to have a sort operator inserted here and then we can do our hash join. Now let's say in this case here, we know at this point that the cost of doing a hash join plus the sort is gonna be more expensive than the fastest path we've ever seen. So we can cut this thing off right here as well. And again, while we're doing all of this, there's a memorization table going on the background because these joins down here could be reused for other paths in the tree and therefore we don't want to have to recompute it. We can just look and I look up a table and know exactly what the cost is and avoid having to do redundant calculations. So the volcano optimizer has to enumerate all possible transformations without repeating it. So that's the memorization table helps with that. And then we're gonna prune things as we go down and avoid having to do an exhaustive search. And in my opinion, I think looking at less things than you would have in the system RKs. Okay, so real quick, the volcano thing came out in early 1990s and it's the same volcano when we talk about the volcano iterator model, same guy, same project. The follow up to volcano is this search optimizer called Cascades and it basically does the exact same thing, some minor fixes to some anomalies that they had. And it was also written to be more object oriented because that was the big rage in the 80s and 90s. So you can declare objects, these operators as objects and they can emit rules so that when you fire off one of them, they can generate more rules and things like that. The same approach going from the top down is there. So what's really interesting about the Cascades model is this is what's implemented in SQL server. And as we'll see on Wednesday's paper, they actually do the best, their query optimizer has the best predictions and the best plans. It's also implemented in Green Plum Orca. And what's kind of cool about what Green Plum does is they have a bunch of different products that wanna do query planning. They have Green Plum, they have this hawk thing they're doing SQL on top of Hadoop. So Orca is a Cascade style query optimizer that's extensible enough that you can support all different stuff for Green Plum, stuff for hawk, for all these different storage systems or database systems, all within a single code base. So the Cascade stuff is really extensible. And I think it's very interesting. Okay, so we're being very handmade in this but I think you get the basic idea. So here's just a summary of everything that we talked about. So we have the heuristics stuff, and that was done by Oracle and Ingress in the early 1970s. Heuristics with cost-based joins from System R. Randomized search is currently used in Postgres. Stratified search is what's used in the current DB2 and the current Oracle. And then unified search is the Volcano Cascade stuff from the 90s, but it's now in SQL server, Green Plum and Rumored Teradata as well, but I'm not sure about that. Is there any questions about these search strategies? So in the sake of time, I'm gonna skip the adaptive processing stuff, but we'll pick up on this on Wednesday. But the basic idea is again is that we can do the dynamic optimization of allowing us to change the query plan as we go along. We saw the example in Ingress where we actually were able to do this. But I say I don't think anybody else actually does this in practice, but we'll skip that. Okay, so what should you remember from this? So query optimization is hard, very hard. And this is part of the reason why the new SQL guys are like, oh, we're not gonna support SQL. It's because you need a query planner. You need an optimizer and it's hard to implement. So they relied on instead you writing procedural code to some API of how you want the database system to exit your queries. And that's okay for gets and sets and easy things, but doing more complicated queries as we see, we even talked about nested queries to see more complicated or combining this with materialized views and other things, it's really hard to do. And they have rumors that Oracle has an entire floor in their Crystal Palace or Crystal City that just does query optimization stuff, right? So it's hard to do. Although I am partial to the top down approach because I think that's like, it just seems more cleaner in my opinion because you don't have to do the heuristics. The literature suggests that there really isn't a difference in performance in two. Both of the query search approaches generate query plans that are equally as good. But in my opinion, again, from a software engineering standpoint, I think the top down approach is better. So again, all of this, as we said in the beginning, is that all in sort of his point before, all of this depends on having a good cost model. If your cost model generates shit estimates for what query should be, you're gonna get shit query plans. So the way you have a good query cost model is that you have to have really good statistics. And that's what we're gonna focus on on Wednesday. How do you actually estimate for a given query plan what, you know, how good is it gonna do? And then use that to be able to determine whether one is actually better than another. Okay, any questions? So for, again, Wednesday we'll talk about cost models and then I'll try to spend some more time at the end and talk about some tips for working in a large code base. Okay, awesome guys, thank you.