 The type of optimizers we've talked about so far have been the sort of classic query optimizers where the query shows up, we parse the SQL, then we run it through the optimizer and we generate a plan before we even begin executing the query. This is how most query optimizers work. Obviously you can't run the query unless you have a query plan so you have to put it through the optimizer. But the challenge is going to be that what we think is the best plan during this optimization phase before we even start running it may actually be incorrect. Because since we can't actually run the query without a plan, there's some assumptions we have to make about what the database and what our environment looks like. But these things can change over time for various reasons, right? So the physical design of the database can change because the administrator of the application could add and drop indexes or change the partitioning scheme. The database itself could get modified, people could insert tuples or delete tuples and that could change the distribution of values for our columns. If we're invoking our queries as prepared statements, then the behavior of the query for one set of parameters might be different for another set of parameters. And then of course, every time you run, analyze or whatever the command is in our database system to recollect the statistics that we use in our cost models, every time we update them, then the decisions we'll make in our optimizer could be entirely different. So the focus today is to understand how can we potentially improve our optimizers, the efficacy of or the quality of the plans that we're generating by maybe relaxing this requirement that we only generate a plan or only revisit our assumptions or we never revisit our assumptions once we generate the plan in the beginning. So to do this, we want to understand a little bit what a bad query plan looks like or why query plans can be considered bad and then end up with less than optimal performance. So we'll cover this more on Wednesday when we discuss cost models, but in general, I say the high level, the biggest problem we're always going to have is the we're going to get the join order is incorrect joins are almost always the most expensive thing we're going to execute in a analytical workload. And so if we get the ordering incorrect, that can get lead to poor performance. And the reason why we're going to make a select incorrect ordering is because we're going to have inaccurate cardinality estimations, meaning we think that our join is going to produce X number of tuples, but it's really going to be, you know, X times Y or some, some, some, some larger multiple than what we thought was going to happen. Again, we'll discuss why this occurs more in the next lecture. But this, this, this issue is going to be a recurring theme that we're going to have to overcome. And today's discussion is sort of see techniques to do this. So, but since we know things are going to go bad, like we just know, we can just assume that our cost model is going to be inaccurate, our optimizer is going to make bad decisions. Then ideally, if we can detect how bad a query plan is, once we start running it, then we can make a decision to adapt the plan to come to modify potentially to account for these differences and what we're seeing in the real data versus what we assume we're going to see. And then we can then try to convert our plan into, into something that's closer to the optimal plan. So, what I mean by like that is, say we have a simple query like this, it's a four-way join between eight tables, A, B, C and D, and then we just have a simple where cause for the B and D tables. So let's say that we run this query through our query optimizer and we generate this plan, right? It's a bunch of hash joins and nothing but sequential scans. But let's say that when we generate this plan for this particular join, we estimated the cardinality of that operator to be 1,000, right? This is an arbitrary number that I'm using for this illustration. The cardinality is the number of tuples that this operator will omit. But let's say when we actually start to run it, we see that we're actually generating 100,000 tuples. So our actual cardinality is two orders of magnitude greater than the estimate one. So the question we're trying to deal with today is if we knew what the just true cardinality was before we started executing it, I'm sorry, while we're executing it, then could we change some aspect of this query plan to get us closer to a more optimal plan? Right? Could we change the join ordering? Would we want to choose a different algorithm to do our join? Would we want to change maybe the access methods that come below the join to use maybe an index or a different type of scam? So this is sort of what we're focusing on today, is how to then maybe adapt this kind of plan when we know something about what the data looks like once we start running it. So the high level idea again is that we want to be able to execute, estimate the behavior of a plan to determine its quality relative to other plans. This is what the cost model is doing. But the tricky thing is going to be back here before I started executing this, I had to derive this cardinality from my statistics that I had maintaining my catalog about what my table looks like or what these two tables look like when you join them together. So these stats are going to be based on histograms and possibly samples that we're collecting from the data. We can also make decisions about what the hardware looks like, what kind of cache sizes we have, what kind of algorithm we want to use, what other queries are running at the same time. The cost model stuff we'll cover next class, but the main idea to think about is what we're talking about today is before we run the query we only have an estimation of what the data looks like and how our query will perform. And if we get that wrong, we want to then try to be able to correct ourselves. So the technique we're talking about today is called adaptive query optimization and sometimes called in the research literature adaptive query processing, they're essentially synonymous. And it's everything I said so far. The idea is that this technique is going to allow the database system to modify the query plan for a query to better fit what the actual underlying data looks like. And we can be modifying for the query plan by just generating an entirely new query plan like throwing away the old one and starting over. Or we could try to modify a subset of the query plan by introducing new sort of subplans or almost like a pipeline at different points where we have to materialize people where we could potentially switch from one plan strategy to another. And this one here you basically go back to the optimizer and start over. This one here is that you can try to have the optimizer only do only replant a portion of it or provide these alternative strategies at the beginning. So the main sort of takeaway approach what we're doing here is that rather than just relying on our statistical models that are estimations or approximations of what the data looks like, we're trying to use the data we've collected while we actually execute the query to then help us make a decision about what the right plan should be for our particular query. And this data we're going to collect could be used for helping our current query or merge it back into the statistics we've collected through our analyze operation and have it be used for other queries. So again, we'll cover the various ways you have to do this, but when you think about what a query actually is doing or what analyze does, analyze is doing a sequential scan to compute some statistical models about what the data looks like. And so if we're doing the sequential scan on a table, that's essentially the same thing as analyze. And so rather than just evaluating predicates or using the tuples as we scan them to generate the result we need for that particular query, we can piggyback off of that scan operator and maintain or update the statistical models with new information. And the question here is whether we just update that models for ourselves to make our query go better or if we can share this with other queries in the global catalog and now other queries can benefit from the data we collected from this. So there's three broad categories that I want to cover using AQO or adaptive query optimization. One is that we can use AQO to benefit future invocations of our query. The second approach is to try to make our current invocation of our query better. And then the last one would be, well, this is like helping your current query. This is also helping your current query, but this one would be sort of starting over from scratch and just running through the optimizer all over again. This would be adding locations in the query plan that allow you to change one strategy versus switching one strategy to the next with, again, out having to go back to the optimizer. So we'll go through each of these one by one. So the most simplest form of adaptive query optimization is, as I said, where as we execute our query, we also collect some information about what the data looks like and then we can use that information to decide whether our query is wrong and want to replan it or we could then merge that back into the sort of global catalog. Again, when you think about this, what are the optimizers we're actually doing? You have a bunch of histograms or statistical models about what your attributes look like. So for a given predicate in your warehouse, you want to estimate the selectivity of that predicate because that will determine how many tuples your scan will emit and you can use that to make decisions about join orderings and other things above in the query plan. So as you execute the scan, you actually know the true selectivity because you're applying the predicate on the tuples and you know the number of percentage of the tuples are going to match. So if you then determine that the cost model estimated in my selectivity was 1%, but when I run the real query and I run the query and actually do the evaluation of the predicate, my selectivity is 99%, then I wanted to use that information to help me decide whether to replan my query or that future query has come along. They can, you know, exploit the knowledge that I've gained. Again, the one approach is to try to fix my current query or the other one is just merging it back into the overall statistical model in the catalog so that I can then help queries in the future. So the most basic approach to do this is called reversion-based plan correction. And the idea here is, as I said, is just every single time I invoke a query, I keep track of what query plan I generated for it, I keep track of the cost estimations I had for it, and then I'll have all my metrics of what happened when I ran it, how many tuples I selected and how much CPU or memory that I used. And I'm going to maintain this history inside the database itself. So you'll see this in the commercial systems like DB2, Oracle, and SQL Server. They have this built-in repository of the history of every single query that ever got invoked, and they can use that information to help decide how to do query planning in the future. So let's say that we have a prepared statement, or we have a query that's been invoked all the time, and we have a cached query plan. So rather than maybe running through the optimizer every single time, we can just use the cached query plan that we've already generated from previous invocations. So if now there's a change in the statistics or something about the database, the physical design that changes, and we recognize that we maybe want to, for this particular query we keep invoking, we want to run it back through the optimizer and see if we generate a new plan. But then when we run that new plan for this query, if we see that the performance of the query is worse than the old plan that we had before, then we just want to revert back to it. Right? If there's a regression in the performance, we switch back to the plan that we know actually performed better for us, despite the change in the physical design or the statistical models. So we use that simple query example that I have before, the four-way join. Say again this is my original plan, and I'm doing nothing but sequential scans and a hash join, and say that my estimated cost is a thousand, and my cost estimation is pretty good, so my actual cost actually matches up. These are just synthetic numbers here. So I'm going to store in my execution history for my database system that I generated for this query, I generated this plan, and when I ran it I had this cost. And this is just another database or another table in my database system, right? You're sort of eating your own dog food. Rather than having an auxiliary store, this is just another table that you record this information. All right, so now let's say there's a change in our database design. Say the DBA comes along and adds two indexes on the B table and the D table, which we're using in our order clause. So now when we invoke the same query again, we would recognize that the design the database has changed in such a way where we may now want to reconsider the query plan for this particular query. Right? So this query touches B dot val and D dot val, while I just happen to create indexes on those columns. So I want to run this through my optimizer again and see what plan I get. Let's say now for the new plan, it's completely different. So now instead of running hash joins, we're running index nested loop joins and we're doing an index scan on B on D, which we can now do because we have an index on that, which we didn't have before. And so now we're going to pick this plan for our query because the estimated cost is 800, which is less than the estimated cost that we had over here. But when we actually run it for whatever reason that we don't care about at this point, the actual cost is 1200. This could be that we incorrectly estimated that the cost of these nested loop joins would be cheaper than the hash joins, so we picked in nested loop joins. So just as before, it's when we actually put this, excuse me, that's not corona. If we actually now put this in our exercised history, we would recognize that for this plan here, again it performed worse than this other one here. So the next time we invoke it, we want to make sure that we use this plan. I want to revert back to the one that we know performed better. So for this approach here, this is something that Microsoft has had in SQL Server and I think Oracle has something similar since maybe 2012, 2013. And this is pretty coarse grain, this is pretty brain-dead heuristic. It's basically saying, oh, this query plan is bad, let me just switch back to this one. So it's all or nothing thing. So the paper you guys were assigned reading from Microsoft is called plan stitching, and the high-level idea is exactly the same, where if we recognize that our query is running slower than query plans we saw in the past, rather than potentially just throwing away the entire query plan, the new query plan, and reverting back to the old one, maybe there are elements or aspects or subplans within the newer plan we actually wouldn't want to retain because that'll help us lead us towards a better plan, a more optimal plan. And the other thing you think about for the plan stitch as well is that the subplans you're going to borrow from other queries don't need to be actually from the same query. Like in this case here, I can only reuse the plan in the simplest form. I can only swap between plans if they're running on the exact same query. But with plan stitch, because I can excise out subplans or portions of the query plan, as long as I know that they're logically equivalent, I can take bits and pieces from other queries. The other thing too is that if there is a change in the physical design where a new query plan becomes invalid, meaning it defined that it wanted to do index scan, but then I dropped that index, rather than just getting thrown away the entire query, the query plan in its entirety, I can maybe again pull out pieces of it. So the basic approach they're going to use is, or the way they're going to generate these stitched plans is a dynamic programming search method using a bottom-up approach where you check to see from going from one level to the next and saying we do a system R, going one node to the next, you pick which sub-plan is the best. And then once you reach the end goal, you find the cheapest path. So this means that it's not guaranteed to find a better plan than the best plan you have so far. And it is not guaranteed to always produce a valid plan, but there's some basic heuristics to use to make sure that happens. So going back to our example here, say that this is our new plan and say it was working just fine, like it was actually faster, so we always want to use this. But now if I come along and I drop one of the indexes that I'm using, this plan now becomes invalid and under sort of coarse-grain reversion, I can't reuse it, but with plan stitching, I actually want to figure out what components of this sub-plan are the query plan here that I may want to use in the new plan. Even though overall it's invalid, there's still portions that are still usable. So in this case there, say this portion of the sub-plan of this part of the query, the execution cost is 600, and we would know this because we can keep track of the actual runtime cost of all the operators in our query. And for this one here, this sub-plan over here has a cost of 150. So now if I combine these together into a stitch plan, the total cost of this case would be 750, whereas before, if I did run this, it was 1,000. So again, the idea is that we want to be able to borrow bits and pieces of different query plans to help us produce a more optimal plan. And this is being done separately from the regular optimizer. In the case of Microsoft SQL Server, they're running Cascades, so they're actually doing a top-down search. But this is sort of this auxiliary search that's running on the side, that in the background, it tries to find a plan so you can stitch together. So let's talk about how they actually do this. The first step is you need to identify which portions or which sub-plans in our queries are logically equivalent. We talked about this before under Cascades when we had multi-expression groups. We want to know that the output of a given sub-plan is the same or equivalent to another sub-plan, right? And again, we have to rely on the rules of relation algebra to recognize which operations can be commutative or associative. So in this case here, this portion of the sub-plan, the output is that a join, b join, c. This portion of another sub-plan is the output is c join, b join, a. But since joins, these inner joins here are commutative, we know that these are logically equivalent. Now, as I said, well, one challenge with this is that determining whether any arbitrary logical expressions or logical sub-plans are equivalent has been shown to be undecidable. Meaning the question is, are these two sub-plans logically equivalent? It's a yes or no answer, but there's no algorithm that exists, has been proven, that can be guaranteed to always give the correct answer. So in the plan stitch phase, they're going to rely on some additional heuristics to identify things like, oh, I know that these two sub-plans are accessing different tables, so therefore they can't be logically equivalent, right? You obviously can do more complicated things. The optimizer itself in SQL Server also has those kind of checks in place. And so they rely on that as well. So they have their own heuristics to prune things that can never be logically equivalent. And they rely on the SQL Server optimizer to identify that the logical sub-plane you're trying to mash together, or the sub-plane trying to mash together in the stitch plan is invalid. So the heuristics are providing them with this sort of sweet spot balance between the difficulty in the implementation, right, so much of the enforcer rules, the accuracy of the termination, whether they're equivalent and then the performance, right? It's not an exhaustive search, an exhaustive evaluation of all possible inputs to the different sub-plans. It's just rules based on the relational algebra. All right, so now once we identify that we have a bunch of equivalent sub-plans, we want to figure out, we want to sort of combine them together into one giant query plan that where you're going to add some additional operators to determine that you can have branches to go down different paths in the sub-plane. So this is how they're going to encode all the different combinations of the sub-plans that you could stitch together. So the way this is going to work is they're going to introduce this new OR operator, which is not actually being used for execution. This is just something for the search. The OR basically indicates that the sub-plans below it are logically equivalent. So we could choose either path. So starting from the top, we have an OR clause at the very beginning, and then we have the two for this particular query. We have either doing the hash join or the nested loop join. And again, these are logically equivalent because this is a join, b join, c join, d. And this is c join, b join, a join, d. And those joins are commutative. So therefore, these are logically equivalent. So then now, say, go down. We're going to go with a depth first search going down on this side. For this one here, same thing. We can do the hash join on a and b followed by c. This is the nested loop join on c followed by c join, b join, a. Again, these are logically equivalent. So that's why we have the OR clause. We can choose either one. And then we're going to keep going down until we get to our leaf node in the sequential scan. And then here, there's another option for us in this portion of the query plan because the one we stitched from just only had a hash join. So now, in this case here, for the hash join a, we can do a sequential scan as we saw on the first plan, or we can do the index scan on b because that came from the second plan. And so we have an OR operator to express that. Going back up here, we can only do a sequential scan on c. So that's a straight path. Going back up here for the nested loop join, the only thing we can do below it is another nested loop join. And then for this, we can either do a sequential scan on a, or again, for b, we can either do the sequential scan or the index scan. Going back up here for the hash join, again, that's a sequential scan on d feeding in, and then we just complete the rest of the tree like this. So this is a bit more simplified version of what they showed in the paper, but these are actually the possible options you can have. And so what I think, remember in the paper, what they talked about is that this approach in doing the search within this defined a stitch plan, that they're able to stitch about 75 to almost 100% of all the plans together for the workloads that they looked at. All right, so now that we've encoded our search space, we actually want to do our search. And this is just starting from the bottom and going up in the same way we did with the system R dynamic programming search, where we just, for every single leaf node, we start off with figuring out what the cost is for going to the next operator. We pick which one is the best, and then once we complete all the, we do the search for all the nodes at our current level, we then go up to the next level and complete this process. All right, so let's say we start with the sequential scan on a, it only has one option first, which is just the, oh, it has either a hash join on a, b, or the nest of loop over here. Say the hash join is cheaper, so we pick that. Now we do a sequential scan on b. This has an OR operator. All right, so this is either doing a hash join or the nest of loop join. And let's say the hash join is cheaper, so we pick that. Now we do this for the index scan on b. Again, there's an OR operator. You need the hash join, the nest of loop join. And so because we have an index, the nest of loop join actually would be cheaper here, so we would pick that. And we just keep going down the line and do this for all our leaf nodes, and then we're done, we go up to the next level. And then again, now we have a cost for all these paths leading up, and we just pick which one is the cheapest for us. And then we reconstruct the, we construct the stitch plan that way, right? So again, I think this is an interesting approach. I don't think Microsoft is actually running this in production, like this was a research paper that was published in SIGMOD. I don't know of any other system that's doing something similar like this. I, from an engineering standpoint, the fact that you have to run this separately from the query optimizer, and sort of have separate infrastructure for that, Terry, what are you doing? Right, so rather than having separate, so search infrastructure, if this is integrated into the query optimizer, like component itself, I think this would be a really interesting approach. So there's another system that does something similar to this plan stitching, but they're actually working on a sort of a cogent level rather than the physical query plan level. So Amazon has their Redshift data warehouse service, and it's based on a par Excel, and they use actually a, they do it, it's a transpilation engine. So the database system for a given physical plan generates C++ code or C code, which they then compile, and then they run, they invoke the shared object that comes out of the compiler, and that's how they do query compilation. So the, obviously, the most expensive part of, cogent engine is the compilation, right? In their case, they're actually forking GCC or whatever compiler they're using to actually generate the machine code. So they wanna try to avoid that for every single query. So what they can do is, they say you're doing, you wanna compile the scan on B, where you wanna see where B.Val equals some input parameter. So they'll cogent that piece, run it through the compiler that generates x86 code, and then they'll go ahead and cache it. And then now, any time you re-envoke this query, you know you can just reuse the compiled version of the scan on B. But similar to plan stitching, what they can also do is they can recognize that if you have another query with the same kind of predicate, B.Val equals some parameter, it'll cogent the exact same thing. So rather than recompiling it, which is again the expensive part, they can identify that they have a cache plan fragment for this scan here, and they can reuse that. And so they actually can do this across all possible, plus all their customers. So like this scan on a table to do one, something equals something on a Varchar field, that's gonna be the same from one table to the next because it's a column, so you're just ripping through the column. So they can actually share these little fragments and stitch these physical query plans of the compiled query plans together from all possible customers. So now for a given query that they've never seen before, if it has the same pattern of access methods and joins and other things as queries from another customer, they just pull from the cogent cache and stitch it together. So that's kinda cool. All right, so another interesting system to talk about is IBM's Leo, the learning optimizer. And so this is an example of where you have a feedback loop being used to improve the accuracy of the cost models in the system. So the idea is that, again, if I keep track of what my cost model estimates were when I generated the query plan, and then when I run it, if I recognize that those estimates are way off, I start recording information about what I'm seeing in the real data. And then when my query completes, I return the result back to the user or the application that requests the query, but I also go update my cost model statistics with the new information that I've collected. So IBM's Leo is actually shipped in production in DB2 today, but this is one of the other examples of a commercial system applying one of these adaptive query processing techniques. All right, so the plan-stitching stuff that we talked about or the version stuff is about fixing future invocations of a query to improve them based on the results that I'm seeing when I actually execute my query. But now we wanna talk about how do we fix my query? Like if I invoke my SQL query and I determine that I have a bad plan, what do I do? How can I fix that? Because I don't wanna wait for the next invocation, I wanna fix the one I have right now. So I'm calling this the replanning the current invocation. Again, the idea is that if I've determined that the observed behavior of the query plan as I'm executing it is way off or divergent from what the estimated behavior was that the cost model produced, then I can decide to potentially either stop the query and go back and generate a new plan or I can decide to maybe, how much more, recognize that I've already produced some work for me and keep that portion of the data that I've already processed and then return back to the optimizer and ask it to just generate a sub-plan. So again, either just start with a scratch and you decide that continuing with the same query plan that I have now is gonna be worse than just starting over. Obviously if you're at the last two with the last operator and then it's a bad idea, just let it finish. So striking a right balance of this is difficult and then the other approach is determining that while I'm doing 100 joins and I've already done one of them, let me keep that one that I have because that was expensive and then I'll replan the ordering for the other 99. Again, the whole idea here is that you're going back to the optimizer and saying, hey, generate me a new plan. So let me give an example that does something sort of similar to like this. So this is from Apache Quickstep. Quickstep was or is a embedded analytical engine sort of similar to DuckDB, but I don't think it's supported SQL. It came out of University of Wisconsin and then it's been turned over to the Apache Foundation. I think it's been kicked out of the incubator program because I don't think they've updated it recently. I don't know what's going on with Ignatius and his team, but I haven't really seen any updates in a while. But they had this really interesting approach called look ahead information passing, where I can do some work at the beginning of my query and pass that along to other operators or other portions of my query plan and help me make a decision about what the right ordering is for things up ahead. So for this example, say we have a simple data that has three tables. And so this would be a star schema. So this approach only works for star schema. We have a fact table in the middle and then you have dimension tables coming out of it. So it's not for arbitrary star schemas or arbitrary tree-based schemas. So the way this is gonna work is say this is my SQL query like this. I'm doing a three-way join between the fact table and the two-dimension tables. So what I'm gonna do is before I start scanning the fact table and start computing the hash tables for my join, I'm gonna scan through the dimension tables and generate a bloom filter. We saw this similar technique being used when we talked about joins, right? This idea came from VectorWise that you can generate a bloom filter and then pass it on to the other side of the query plan so that maybe avoid a hash table probe and receive this. So the joins are gonna be on the dimension tables. These are gonna generate the hash tables and the fact table is just gonna do a probe. So I wanna generate the bloom filter and then check the bloom filter to see whether the key I'm looking for could even exist in the hash table which is cheaper than doing the hash table probe. But what we're gonna do differently here is that we're actually gonna pass these bloom filters and when we pass it over here to the fact table we're going to start do some sampling to determine the selectivity of the bloom filters for these different tables. And then if we determine that, well the second table here, the second dimension table is actually more selective than the first one, then I wanna reshuffle my joins so that I do the probe on this hash table first because I'm gonna end up throwing away more tuples. And we can do this before we actually even start running anything because we've already built the hash tables, we generate the bloom filter and we can make the decision before we start scanning and doing the probe. So I think this is a really interesting idea as far as a quick step is the only one that does this and I don't know whether it actually made it into the open source version. All right, the last adaptive query optimization techniques sort of our category I wanna talk about is what I call sort of plan pivot points. And the idea here is that we wanna introduce additional subplans in our query and then have a sort of space special synthetic operator that we put into our query plan that allows us to pivot or switch which query, which path in that query plan we wanna do. And the idea here is that we can put conditions in the switch operator or the change plan operator that if we determine that our data looks one way we'll go down one path, if it looks another way we'll go down the other path. It doesn't have to be two, it can be multiple ones. So the sort of two most famous techniques for doing this are parametric optimization or proactive re-optimization. Again, at a high level they're gonna work exactly the same way. It's just the sophistication of their technique is slightly different. So parametric optimization was actually developed in the late 1980s in 1989. This actually came out of the Volcano project. Again, the same one that does the Volcano query optimizer, the Volcano iterator model. They also did early work on adaptive query optimization which is, again, I'm saying that work is very influential. So yeah, as I said, the idea is that for each pipeline in our query that we think that there's different alternatives we could have that would make a big performance difference will generate different cell plans for them. And then now in our query we'll have this choose plan operator that basically has an if clause that says if the cardinality of the operator below me looks one way or it's of a certain size then I wanna choose the first plan. If it looks another way then I'll choose this other plan. And in this case here if I know that my data is really small then maybe I wanna do a Nessa loop join because that's gonna be cheaper than having to build a hash table and probe it. But if my data is really big then I maybe wanna build a hash drawn, right? Or do the hash drawn. So again, I think this is actually an interesting idea. Of course, obviously the tricky thing is determining what this condition should be and it's sort of through trial and error as you develop the thing. And it's actually very dependent also on the hardware. But the nice thing about this is like there's nothing, we end up like not having to go back to the optimizer and sort of replan everything. And we don't throw away any of the data that we've collected, right? So we do this hash join and then we just determine whether we wanna go down one path versus another. A more recent sophisticated approach to this is called proactive re-optimization. And this is actually combining the ability to go back to the optimizer and generate a new plan as well as to tweak it in the same way we saw in the previous example. So they actually can do both. And so at a high level it works like this. So query shows up, we generate through the optimizer and so we'll generate different switchable plans just like before. We're also gonna now gonna generate bounding boxes that allow us to determine whether the assumptions we're making and our decisions about whether to go down one path versus another are actually gonna match up with reality. By basically trying to put a balance on the uncertainty we're seeing in the data as we run. So now we start executing the query and just like before in Leo and other techniques we execute query, collect statistics about the data that we're seeing for our particular query. And then we can switch the query plan just as we saw before if we determine that one plan path might be better than another. But then if we also determine based on our, if we're sort of exceeding our estimations in our bounding box thresholds, if we see that we're way out of whack and our estimations are way off then we just go back and can re-optimize. And then you can determine whether to pin the portions of the query plan that you've already executed because you know they're expensive or you can just say throw everything away and start over. So this is sort of again a crash course on adaptive query optimization. I actually really like these techniques and there's for obvious reasons, right? Like it doesn't rely on getting it right the very beginning. Like you can sort of correct yourself as you're actually running the query. So again, we'll see in next class when we talk about cost models of how bad things can actually get. But the way you actually need to implement this is super important that it's just not, you don't want to implement your optimizer and your execution engine completely separately from each other. It's sort of a symbiotic relationship where you have to know what kind of strategies could be employed by the execution engine and the optimizer in terms of like switching paths and throwing away intermediate results or not. And then you build your optimizer or you build optimizer around what your execution engine can actually do. So for this reason I think like the, I think applying this technique for, or using this technique with sort of this optimizer as a service like Orca or CalCyde could actually be tricky because there's different approaches for how you can actually support adaptive query execution in the system itself. So in addition to having sort of more robust or more sophisticated query optimizers, versus the open source systems, all the major database vendors now support this within the last, actually mostly in the last three or four years, like DB2 had this Leo thing in the early 2000s, but really in the last three years of both Oracle and SQL Server and now Teradata also included the ability to do adaptive query optimization. But to the best of my knowledge, Postgres and MySQL simply can't do this. And none of this are the newer open source systems that have come around in the last decade and support anything like this. So, all right. So again, this was just sort of to show you that you don't have to build the optimizer the way we described it where you sort of plan once and run it. There are techniques to actually modify the query while it's running and they get feedback from execution and put it into the system. Ter, what the fuck are you doing? All right, sorry. So next class we'll then start discussing how cost models work and we'll see actually why they're so bad. Okay? Yeah. You know? Got a belt to get the 40 ounce bottle. Get a grip, take a sip and you'll be picking up models. Ain't it no puzzle I guzzled because I'm more a man. I'm down in the 40 and my shorty's got sore cans. Stacks and sick packs on a table. And I'm able to see St. I was on the label. No shorts with the cost, you know I got them. I take off the cap and first I tap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottom.