 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. I know I say I'm excited every week. This week I'm really excited. Today we have Nico Bruno and Cesar Galano, from Microsoft on the SQL Server team. They're here to talk about their query optimizer, which is awesome because again the research shows that Microsoft has the best query optimizer, so they're going to tell us all the secrets then why this is the case. Cesar has been the query processor team at SQL Server. Actually he's been on the SQL Server team since 1997 and he's did his PhD at Harvard. Sorry. And then Nico has been on SQL Server. And you were at MSR. You need to get MSR, yes. Yeah, right now you're on the SQL Server team and then you did your PhD at Columbia. You were for the release? Yes. I haven't seen him in so long. He's a Google guy. Anyway, so guys, as always, I'm going to thank the Stephen Moy Foundation for keeping it real, for sponsoring us. And then if you have questions for either Cesar or Nico, please unmute yourself, say who you are and where you're coming from. At any time, we want this to be a conversation. All right, guys, thank you so much for doing this. We really appreciate it. Thank you. Thank you. Okay, sure. So let's start. So today is Cyber Monday, so you get two speakers for the price of one. We're going to be talking with Cesar about the Cascades framework for query optimization at Microsoft. Cascades is this framework that started like 25 years ago. And several Microsoft products use it as a foundation for optimization. So we're going to be talking about a little bit of what we do there. So the framework itself, I mean, Andy had been covering that in class. I've seen some lectures that it covers Cascades. And also I've seen that Rebecca also talked a little bit about how Cascades works internally, the memo and all those structures in one previous current in talk. So instead of doing that, we thought we'd say Cesar to essentially give a very, very brief overview of Cascades and how it fits into different products. And then the idea was to say, okay, let's talk about 10 things that probably don't appear in the original papers that we do or we don't do in the different products that use Cascades. So with that, let's get started. Cesar, I think you are muted, so but I'll pass it to you. Okay, so this is the one slide that we have about Cascades itself. As Nico mentioned, we're not going to go deep into this, but we'll just mention a few things here. Just looking back at what it has provided in comparison to some other possible alternatives that we could have used. So one thing that is really valuable we have found in the Cascades framework is the algebraic representation, just a uniform algebraic representation of everything. So instead of working off a parse tree or some other compact representation of what a query is, we use relational algebra. And then the algebraic representation then extends to the scalar operators. So when you see a query, a query is an operator tree that has either relational algebra operators or scalar operators. And this unified framework is just very convenient to build on top of. So that is one special thing that we have found very valuable in Cascades. Another thing is the use of composable rules. The basic contract is there is a set of alternatives, and the set of alternatives can be enhanced, extended, other things can be added by using rules that can be examined in isolation and proved in isolation. So having composable rules is one of the top things. One minor detail here in the original Cascades paper and also thinking about earlier work on transformation based enumeration, there was some thought put into, well, what is the language that we would use to represent what is the pattern to match and then the replacement. What we found is that most of the time there is additional logic in there. You need to check what the predicate looks like, what are the properties. So we're not using a high level language to describe the rules. The pattern is a set of operators, and then a function generates the substitution of the replacement or the alternative. Is that for historical reasons or is that because if you had to build it today, would you have a DSL or you just say you see no need for it? It was too much of a hassle. Because at some point, if we're going to describe community activity, that's easy enough. But almost immediately you start having to check properties and then we check properties. Well, there is some data structure. Do you extend the language to access that? So it just becomes too much of a hassle. It was just simply to say it's C++, you go write your thing, you build the new operator tree and pass it back. The more complicated rules what require the DSL is as expressive as C++. So instead of just going through this intermediate, we go directly through C++. The other thing is that it's integrated and uniform optimization. And what we mean by that is you have an abstraction that is this is the set of alternatives that we will consider. And then each of these alternatives is going to be costed and we're going to pick between the costs. But thinking of an optimizer like IBM that is split into query rewrite, then cost-based optimization, but for joints only, this is just uniform optimization from the start. So you put whatever alternatives you need to put and then they get costed and picked. So anything else in here, Nico? Okay. And then here's a quick history of Cascades at Microsoft. The work for SQL server started in 95 around 95. And I think something very important for this whole journey was that Getsker Efe was at Microsoft at that time and he was the architect of the query processor. So that was very important and that was very valuable because he laid out some principles that have been enduring here. The first time it shipped at Microsoft was in 1998 with SQL server. And that was a new query processor, new query optimizer and new query execution based on the principles that were laid out in Volcano and Cascades. And this line of SQL server, it continues. Then 10 years later, Cascades was also used in Cosmos and Scope. This published work, the Scope work. This is a scale out, like highly scalable relational processing. And that code was derived from the SQL server QO. So a snapshot was taken from the query optimizer and then the code was forked. Then in 2010, another product that was also derived from the same optimizer is the parallel data warehouse. In 2010, there was a Microsoft appliance that was also a scale out. The same thing, the optimizer was extended. It was separated into two parts and there were some quirks to it. But it was a very effective approach to get a Cosmos query optimizer for the scale out data warehouse. And that is also published. The references are below. And then after 2014, we have several other products that have been derived from the parallel data warehouse in particular cloud services. So there is one thing today called SQL on demand that you can get on Azure. There is Synapse SQL. All of those are scale out SQL as a service in the cloud. And they also target data warehouse workloads. Was there any discussion about like, oh, let's do it with Orca or CalCyde to have an optimizer as a service? That way, you're not maintaining four different code lines for essentially the same kind of optimizer? Not really. And I think part of that was all of these branches of the SQL Server Optimizer, they were able to leverage a lot of the logic that was already decoded there. So it was a good starting point. And you know, several of these things, they branch off, but then later on they branch back in over time. And so it made sense to keep it as a related technologies. Yeah, I mean, I think Cosmos is the only one that branched off and never merged back, but all the others exist in the same code base. So same thing that is basically targeting specific things to different engines. Right. And part of what we will be talking about today or mentioning today, some of these applies to some of these different versions of that optimizer. The core of it is common to all of them, but we mentioned some things that belong in one and they're maybe not the other yet. Okay, cool. So let's start with the usual pipeline simplified optimization pipeline. Similar to many other systems, we started with parsing. This is very simplified and there are a lot of special cases and more complexity. But I, unless you can see this as the main building blocks, right? So we start with parsing and then as a realization is this process when we bind the parse tree in semantic checks and types and so on. And then we start the first phase, which is simplification and normalization. And this is also then via transformation rules. Specifically, there is a combination between fixed point of transformations to each operation in the tree and also some top down and bottom up passes on the tree. This is some special engine that runs on trees, not on memos and tries to both simplify queries, the initial query and also canonicalize basically the representation of queries so that they have a common representation. And there are several rules here. I mean, I mentioned a couple of rule families, like subquery removal and the correlation. There is all these things, our empty results, if you have some joints where one of the sites is empty, we know that the whole thing will be empty and therefore we just propagate that empty basically relationship up. Transformation from outer to inner joints based on nullability constraints and so on. A bunch of cube and roll up reductions and group bytes with functional dependencies, we remove group by column. There's a bunch of things that are done, pushing filters is another example, to get the tree into a canonical and simplified way. The rules are matching patterns like you want in cascades, right? The rules, yeah, it is using the same basically building blocks of cascades, but not in the presence of a memo. This is mostly tree-to-tree transformation. This is simply what people say, oh, let's make all the rules that are always beneficial, but not only the beneficial rules, but also the ones that simplify the tree into a canonical form. Yes, it's using the same infrastructure of promise and patterns and substitution and so on. The interface is the same, so you could actually pick up any transformation rule and say, well, this is going to be simplification or this is going to be exploration. It's almost like the same source code tree and then sometimes they're cost-based in exploration of cascades and sometimes you always fire those. Yes, this one's always fired up to fixed point. That's awesome. Great. Once we have that, we have an initial tree and there are a bunch of tasks in this pre-exploration before we actually do the actual cost-based optimization. I mentioned a couple of those things here. We do project normalization and by that, basically pushing down projects, which are the projected columns that create computer columns, down in the tree as much as possible so that then we can match them to the table and identify whether we have computer columns in the table. If you case something like where A plus B is smarter than three and you have a computer table that is A plus B, this pushing of projects down allow you to basically just recognize that that expression is something that has been computed as a special column in base tables. We do these type of things and then we pull projects up so that they are not intermingled with the rest of the expressions because that would complicate a lot of the matching that happens later. AutoStats here is about the first, the identification of columns that we would need statistics on to be able to do proper cost-based optimization. Those are things that appear in selections, in group bytes and joins and so on. Also, basically, identifying whether we have such stats and if we don't have the stats, just synchronously building the stats before actually optimizing query. That, of course, results in some initial startup cost for the first queries that get executed, but because we are sampling and so on, this is a relatively fast step that brings the statistical information up to date with what the query requires. You said query shows up. You're like, oh, it's touching a table. I don't have stats for it. So then you fire up a background job and start collecting that stats because you're hoping by the time you get to the cost-based search, you have the stats available? It is not about it. It's a synchronous task. Everything is simplified here because you can set that you want stats to be automatically created or you can disable that and create it manually as you want, but in general, if this is enabled, AutoStats kicks in and initially creates the stats that are needed. After we have that, we do the initial, and AutoStats also can load statistics on computer columns as well. So if you have where A plus B is more than something and you have a computer column on A plus B, you have a history and so on. Based on that, we get the initial kind of estimation. So we get now all the cardinality values and all the information about the initial tree. We do some joint collapsing that will become more clear when we talk in a subsequent slide about building energy joints. It's more than canonicalization, but it helps other purposes that we'll talk later. And then we try to do a trivial plan. A trivial plan are these plans that we know that independently of cardinality, independently of how big tables are, we can get the optimal plan. This is very restrictive, single table, doesn't require work tables and so on. However, a SQL server started as an OLTP engine, primarily initially. So a lot of plans are caught by this trivial plan thing, and then we don't have to continue if we can get a plan by this method. How trivial is true? Sorry, I missed this. Select one, sure, you don't need to do cascade some of that, but select star from a table without a where falls, would that count as a trivial plan? Yes, for instance, single tables that don't require work tables based on the indexes that you have, if you have a field, then you have an index that is perfect for that query. These types of things are basically. Sargeable queries. Only those that we can guarantee that no matter what the plan would be optimal. It is a small, small subset, but in many cases, it is what small queries do all the time, OLTP queries do all the time. That's awesome. Yes, great. So after that, we get to the exploration and cost base. And in that we divide this into stages. And again, this would become more clear in a subsequent slide. But essentially, each stage is applying a subset of the exploration and implementation rules over a subset of conditions that are increasingly larger and larger. This is different from other systems that said, okay, let's do joint reordering as a stage. And then let's do pushing of group bias at different stages. Each stage here does the whole thing, but in an expanding kind of a set of rules and timeouts and so on. And this is mostly done so that we can get to good plans quickly enough. We talk more about that in a subsequent slide when we talk about how to optimize or optimize it itself. But in any case, when you see here the full parallel, which is the latest stage, that green box there is mostly what we know as cascades. Everything else that we've talked so far are kind of important things that help and complement the pure exploration and implementation of cascades. And after that, it comes post optimization phase that it is engine that is specific, all the engines that Cesar mentioned do different things on this stage. For instance, common sub-expression spools are treated in some way to share computation in the cosmos optimizer, for instance. The ability to do the scalar evaluation. Where do we place the actual scalar evaluation of expressions that can be used in multiple places and so on. There are several things that happen there in post optimization. After optimization is done, we have the final plan that gets compiled down into an executable set of iterators. And that's another story. This is the boundary between QO and execution. Sorry. Sorry. Yeah. Yeah. You may ask a question. I'm Lin. I'm a PG student here. I'm wondering are the plans for those different engines actually compatible? Like for SQL Server, Cosmos and other things? Are the same set of plans you generated are compatible across different engines? Just like that for different engines, there's a little bit different customized. Yes. That's a good question. Cosmos is a little bit different. Cosmos has a lot of special things coming from the blend between map reduce and actual relational systems. It has a lot of user-defined operators. You have producers, combiners, even different logical and physical operators. So that thing, these engine-specific things are not done in general. Cosmos uses its own stuff. But all the other things that Cesar talked about, SQL Server, data warehouse, and on-demand and so on, they share the basic infrastructure not only on T-SQL, which is the variant of SQL that SQL Server supports, but also on the logical and physical operator. Most of this post-optimization on the engines, except for the Cosmos one, are done even in the same codebase with the same set of assumptions and so on. I see. Next. Thanks. Sorry, just another quick question. It's Jonas. So you're three stages there under exploration. Those are basically just defined quick wins. If your cost threshold, if your costing is met by a certain threshold, you exit early. Otherwise, you do successive broader searches for a better cost plan. We have a slide about that. It's essentially trying to trade off the time that you spend optimizing versus the time that you spend executing the actual query. You don't want to optimize longer than you're going to be executing and those things. And these are basically trade-offs to quickly get to something that is good enough. Perfect. Thank you. Okay. Cesar, anything else or we move to the next? You're muted, I think. Sorry. Yeah, sorry. Just a couple of quick comments here. This slide is interesting because it puts the cost-based optimization in the context of a larger process that is part of the compilation and is part of the quality that your query optimizer is going to have overall. And then this simplification and normalization stage, it makes a big difference because it is very common for query generators to be used and they can generate all kinds of things that are subject to simplification. So if you look at the plan quality that you're getting, yes, for sure cost-based optimization is a huge thing. But simplification and normalization, it just also makes a big difference in practice. So that's that part. With that? And then this slide, it could be expanded into multiple slides, but we put only one of them to make room for some other things. Of course, a big part of the quality of your query optimizer is going to be, well, you have to have a framework that is flexible and powerful and that can support the work. But then you have to go and do the specific work that is just the domain specific logic. Over time, more rules have been added. So we have over 400 transformation rules now. One comment here is that these rules are about generating, they're about relational algebra and relational operators. So things that we do with scalar trees, for example, constant expression evaluation or detection of contradictions or normalization of predicates, those are, in general, those are not transformation rules, but they're just code, they're just logic. So here's a list of some of the things that are done or some of the ideas that are touched by this. Joint reordering for sure, outer joins, subquays, stars, those don't flake. Joint elimination, for example, joint elimination is a simplification. This comes as a simplification because if you know you're doing a joiner, you have a foreign key so that the join is not necessary, it would be eliminated. Materialized views are also done through transformation rules. So the set of alternatives would have options with different materialized views possibly or with no materialized views. Same with use of indices or large end lists. How does that work? Because it depends on what's in the catalog, right? So that means your rules are, are you dynamically looking up in the catalog to say, oh, I can switch to this materialized view? Yes, the, so the start of optimization, we will load, we would load useful information from the catalog. And this goes from nullability of columns to check constraints, the column, column always greater than 100, whatever it is, to foreign keys. And materialized views are just part of this metadata information that is just loaded at the beginning. And there are, the materialized view definitions are there and we have a list and then we have some lookup structure so that even, even a table, I can, I can try to go find quickly. Are there materialized views defined in this table or what is the list? And when do you, when do you rewrite for like, for like regular views, the non-materialized views? Does that happen before, is that another rule in property? Does that happen in like the pre-processing stage? Well, that happens in the front end. So by the time we get to the query optimizer, this has been expanded already. In the algebraizer slide, in the previous slide, the algebraizer is expanding the views. Okay, okay. Right. Another, another interesting area is update plans. We'll get to the, to the query in a moment. For example, there is, you know, me familiar with this Halloween protection problem, you're doing an update and in your, your, an index scan is feeding you the roles to update. But as you change them, they may appear later in the same index. So you get into some kind of infinite loop. So the way we model Halloween protection is as a physical property. An update to a table needs to, or requests a physical property that is, I need Halloween protection on this particular column or on this particular set of columns. And then different implementations would have, would it provide or not this physical property? For example, if I, the general way to prevent the Halloween protection problem is by spooling the set of rows that you're going to, to change. Right. And if you have hash join and you're, and your table that you're updating is on the build side, then this is providing Halloween protection. But if you have a nested loops, for example, that doesn't provide a Halloween protection, also depending on the specific access structure that you're picking up, it may give you Halloween protection on some columns and not on others. So, so this is just one, one example of how we're using physical, physical properties. That's, I've never heard this. This is amazing. The thing here is that we have rules, we have properties, and every little thing that we want to do, we try to cast it as a problem of what properties should we require, what pros should we derive, what rules will satisfy this property so that, and Halloween protection is a great example of that because in principle, it sounds like something alien to all of these, but it can be cast in the same frame. Yeah, it's another example. I had a curiosity, like how, is that, how far back did that Halloween protection, was that added to the optimizer? Because presumably, like, you know, whatever the visual side base code has some bullshit in there and like the execution engine, when did you guys put, introduce it into the optimizer that that's a property you, you can enforce? Was that like at the very beginning or was that like in the last five years? Oh, we can also do this too. No, this was part of the, the initial work, really. And, you know, we, we got the side as code. It had, it had a query processor. It got replaced between 95 and 98. And that's what, that's the time when we put in cascades in place. And, and this was part of the first set of, yeah, the, the, the first functionality because we had to deal with this. Yeah. Right. And this is a, this is an example. This is an example of a query. I think this is, this is one of the tip CH queries. So here we have, we have a join of two tables, nine item part, and you'll have a sub query. And in this case, suppose you have a materialized view that gives you the, for line item grew by part key, the, the sum of the quality, which is related to this, to this query. But it needs to be transformed so that the view can be matched. So this is the, this is an example of a, of a query plan that would be generated from it. And it, it's color coding some of the, some of the transformation rules that, that kick in. So one, this yellow, yellow thing is sub queries. Because one of the things, one of the things that would happen in the simplification normalization stage is that the sub query will be really converted into a join. So this is going to be a join with a group by now the group by can also move around the joint based on joint conditions and all that. So, so there is some sub query logic that, that kicks in here. And then there is a, there is a reordering that takes place because now we're going to have three joins. So that's a green part here. Do this, these two joins, three, three tables. There is segregation, the aggregation transformation rules that kick in because of course this is aggregation. And then there's a materialized view matching that in the, in the end, in the final plan is used instead of, instead of doing the, the bottom of aggregation of line item. One, one of the, one of the things that is done here is also using bitmaps. So there's a filter that happens right above the materialized view. And then this, this is a bitmap filter that is, that is going to be set up from the higher join based on the joint conditions so that, so that we only, we only pass off the, at least a subset of the rules based on the, on the later elimination. And we'll chat about bitmaps, especially bitmap costing a little bit later. So this is, so this is an example of rules and properties. Maybe another, another comment here is something that is very, very convenient, very useful is as part of logical properties, we have something called the constraint properties that keeps track of things like equivalent columns so that when we are looking at property satisfaction, we can, we can, we can use information or call my career lens to see if a sort, one particular satisfy, sort satisfies, satisfies another and that sort of stuff. Something, something else here? We call this like rules and properties because there's an interplay between those. A lot of these properties are properties on scalars like constraints and all those things. And because we have such a thing, then we don't need all these many, many rules of saying not of, not of x is equal to x because all of that is encoded in, in properties themselves. And then we can have normalization and we don't explode the even, even further the set of rules that we have to apply it. Okay. So when we are doing cost based with cost, cost is basically get come from kinetic estimation, kinetic estimation are based on statistics. So I, we thought I would talk a little bit about the statistics that we use at the bottom, the most, the most common statistic are single column historians that are derived from these old max diff historians, which are these reasons where the split points are single column, the split points are in a place where the, the, the largest deviation from some metric happens like frequency or density and so on. This idea at the right is an example where essentially every bucket in the history. So if you don't typically have at least 200 buckets, these things are changing, but there are 200 buckets each bucket essentially contains information about not just the, the end points, but also information about within a bucket, what is the number of distinct values in the bucket and the because of that also, what is the average density, the average value in a bucket. He's also trying other stuff like whether the history is always growing to the right because it's the date thing, but it's inserted and therefore the right most bucket, you should always consider there will be something even there's nothing because updates happen and there are a lot of information additional with the with, with historians that is the most important one. There's also multicolumn density information, which is essentially saying, well, if you have multiple columns, how many distinct values are on, on, on those set of columns, obviously average column length for all the columns that are valuable length like strings, bar charts and so on. The less kind of perhaps well known things are their price and price are also a use for kind of a sub string predicates like like contains starts with and so on. So they're the price that also are built on those columns, on bar chart columns. And then in the, in the distributed setting, if like the data warehouse or even in SQL 7 for partition tables, there's this notion of having multiple smaller historians, one per partition that gets basically put together to get the, the overall history. But because you can do it, this kind of this hierarchical thing, then that helps in different case, you have partition tables, you can swap in, swap out partitions and then be able to get the, the history, the overall history without reconstructing the whole thing. And also happens in DW, when you have different distributions and then you want to get the information about the global thing based on, on history and some of the smaller ones. Lately, these things are sometimes lossy because it's difficult to split, to merge history that have different boundaries and so on. So some of the systems are transition to lossless camp sketches, hyper log low and heavy heaters and so on to, to get the basic information on top of which the history are, are built, but they are lossless and they can be merged in a, in a good way. And other systems like Cosmos that have very, very large data and it was, at some point, it was trying to model things for, to power the, the entire search engine being and you have a lot of skew, you have web pages that are linked by everybody and so on. So these are also a metric of skew. There is how much each partition deviates from the average in case that you have a elements that are too common and so on. So those are the kind of the types of statistics that are there and the studies are built on data sources which can be base tables, including the computer columns that base table have. There is this notion of filter indexes as well that you can create an index that it is built on top of a, a, a selection. So you can build something over a equals 10 and then you have statistics over that filter indexes and materialize views as well. And talking a little bit about the mechanics here, the creation of these statistics can be manual. Customers can say create statistics on this column or, or on this multi, multi density, multi column densities because it's much more difficult to understand automatically subsets to create. A creation can also be implicit because every time they create an index, you, by, by construction, you will have also information to, to create the statistics on the leading column and multi column densities on, on every prefix of the index and you have the option of having automatic creation, which is the thing that we, that we talked earlier. Updates also can be done automatic via ModCounter. So every time that you have updates, a per column counters keep getting updated until we reach some threshold on top. After that I'm a synchronous creation of update of statistics happens and then you, you keep those statistics up today. And again, everything is kind of a simplification. So when I say block level sampling is, there's an option of doing sampling using block levels. So we, we don't pay, we don't do sample over every row, but we do sample over the pages. And some systems also have some optional cross validation to determine when to stop doing page level sampling and, and getting something that is representative for, for creating the stats. Initially single, the single column matrix histograms are a sort base because you need to sort the initial thing to actually do the, construct these max div histograms. And others don't require sorting like the sketches HLA like the heaters don't require sorting, which is an improvement that, that allows you to, to scale better for fewer memory concept, less memory consumption and so um, so moving a little higher on the stack. Once you have statistics, we get to see and essentially the systems implement kind of an algebra of histograms. So histograms are defined over the base tables, are loaded on base tables, but then they are propagated through operators is to, to basically allow you to do coordinate estimation. And this propagation is important. And it goes from simple propagations. And when you say A plus two greater than five, A plus two is actually something, it's a scalar operator that will propagate some, some change to the histogram because you are basically shifting the history to the right. You look at it. But every operator is able to actually propagate histograms, especially histograms that are not directly related to the, to the operator itself. For instance, when you say where A plus two greater than five, there might be a histogram on B that is later used for a group buy. And that is a movie needs to be adjusted by the, by the selectivity of that predicate. And there are a bunch of things, how you adjust this team values based on, on selectivities and so on. Not only, so it happens all the way, even for a more complex constructs like this having some A greater than 10. If you have a histogram on A, we'll create a histogram on sum of A based on some large, large numbers. So you assume that there will be a Gaussian distribution of, of the sums. And then therefore you'll get a histogram on the sum of A that then can you can compare with the greater than 10 and get some estimations bottom up to, for, for kind of estimation. Now, the, the next question is, sure, you, you have all this history, but it is not, I mean, there's not a unique and optimal way of estimating cardinality values. For instance, this is a very simple example. If you have a where clause A equals 10 and B equals 12, you can use many different ways of estimating the cardinality of this. You can use the histogram on A and a histogram on B and assuming independence or not assuming independence or doing all these type of things. Or you can say if you have a multicolumn density on A and B, basically the number of distinct values on A and B can be an estimator of that specific predictor because you have equalities on both columns. Or you can have a filter index on A given B equals 12. And then that can give you a better estimate of A equals 10 given B equals 12. So there are many different things and they depend on various factors. So over time, different engines and implemented more and more sophisticated ways to deal with this kind of explosion of weight to estimate cardinality. Initially, there was another property, which was the quality of estimation, which was trying to give a numeric value for the certainty that this estimation was accurate enough and it was degrading over time as more operators were crossed over and so on. And that culminated a few years ago with a full rewrite of the cardinality estimation framework that now uses kind of holistic calculators for estimating query fragments, trying to normalize that as well, because as you might know, if you do an estimation of a join and after that a selection or you do first the estimation of a join and then the join, the results are not always the same and have different properties. So this basically abstracted a lot the mechanics to do cardinality estimation and explicitly made, it made explicit the assumptions that are made during cardinality estimation like uniformity, independence, the containment assumption for joins or not and so on. And these things now can be more cleanly overridden via hints. You can have hints that say, okay, tweak the model so that something happens. Okay. There are a bunch of other topics, probably we should move faster. Memory grants also are based on cardinality estimation, there's some recent work of using cardinality feedback, especially in Cosmos to learn cardinality models based on actuals and then retrofeed them back and override the cardinality estimation and so on. So we cover card estimation, we can move a little on costing. Yes. Just one quick thing to mention on cardinality estimation. As you know, there is a lot of work on frameworks for plant generation and keeping set of alternatives, generating and maintaining a set of equivalent alternatives for cost estimation. But there is very little published on what are good frameworks to do cardinality estimation. You typically find some models, some formulas, that kind of thing. But it deserves a framework and we have a little bit of it but it would be good to have more. So maybe just a quick question on that. So because we're in academia and recently there are actually quite some discussions on the learning cardinality techniques within the community of academia. I think the Microsoft itself, as far as I know, there are some work from Microsoft Research or there are some also other work I think in collaboration from the Cosmos team and Berkeley. I sincerely do some like online adjustment or learning cardinality estimation thing. We're just curious about what you guys take on it. Has those things been integrated or still looking at it or maybe that's like further down the road? It depends on what we want to do. Cosmos, this thing has been integrated and it has been in production for a while now. SQL Server requires a little more care because of all the existing customer base and the inability to regress anything. Even if you have better statistics, you can regress plans. So there is a much more careful analysis of this but it is getting their way into the system in some form or another. Good text. Okay. Now quickly on costing just a couple of points to make here. One is that the cost of the cost when we talk about the cost of a soft tree or physical tree, the cost it's not a single value but it is a line. So here's a he's illustrated and this line is the cost as a function of the number of rows to produce. So there is a cost to produce the first row and a cost to produce the last row and we currently assume this is a linear function. It could be could be changed but this is the assumption. And there are actually three lines. One is the cost for the first execution of a soft tree and then there is a cost to rebind and cost to rewind. Rebind means this soft tree actually has an outer reference parameter underneath nested loops that is passing a value from the outer side for example and rebind is the cost of running the same subtree with a different parameter value and rewind is the cost of executing the same subtree with the same parameter values. And this is used for two things. One is rebind and rewind are where cash, the effect of cash, the data you got from disk and is now in my memory, we account for that in the cost for rewind and rebind. And the other part is costing for soft trees that return just partial result sets. So you're going to get something but you don't need the entire set. So we'll see an example of that in a minute. The other things to take into account here are bitmap filters. So as you know when we, it is a common technique when you do hash joins that on the build side you create a bitmap, maybe a bloom filter that later on you push down on the other side to reduce early the rows that are flowing from the probe side. So information about bitmap filters is passed down in the what we call the cost context. The cost context is part of the optimization goal and it's somewhat comfortable to the required properties in the sense that it's information provided by your context on how the result will be used or what are the properties that it should satisfy. So part of a row goal or row goals are going to be part of the optimization goal. For example, frequently it will be all rows but sometimes it may be I just want one row or I want 100 rows that sort of thing. And here's an example of two plans. So this is a join of two tables and maybe if you want all the rows then the fastest way to get that the best way to do that is by using a hash join. However, if you want to get just the top 10 rows for example, to get the top rows the top 10 rows in the first case when you have a hash join you will have to build your hash table entirely. So you have to consume your build side completely. So the cost of first row is high even though the cost of last row may be lower. We may be better off using a nested loops join to get just the 10 rows even if going through to the end of the result set will be more expensive. The cost under the cost context of 10 rows nested loops might be cheaper will be cheaper than that hash that hash join. So it is it is through the the cost lines that we that we account for these partial row goals and also rewinds and rebinds. Anything else Nico said? Nico said main points on the costing. It's halfway through the thing so probably we have to speed up a little bit with the seven minutes. So obviously on performance. Many design choices and cascade itself is assuming that pools search of the space of plans right. I mean you say okay this is commutativity this is the activity let it run as much as needed to explore the whole thing and this is not true in practice for two reasons. We've seen we may respect our customers but we've seen customers that have 100 150 joins in a single query. So clearly you cannot have an exponential blow up of alternatives for such a big search space. And the second thing is that even if there are not that many joins and you can do the full exploration in some cases you want to balance the time to do actual optimization with the time that you take to execute the query. So there are a bunch of things that we do to basically address these trade-offs. Of course the first simple thing are plan caches so we have several stages of caching in the system. The first time that you get a query we look it up in the plan cache if it has been optimized before because the faster optimization is the one that already happened so if the plan cache already contains it finds the query then we have to check if the query is still valid because the query might have been optimized for a different when an index was present and the index was dropped it might have been optimized before we dropped a column on the table and so on. So we need to get whether the query is valid also there are a lot of set options that might affect plan choices so embedded in the in the cache key you have all these set options like how we treat the embedded errors and so on and so on. So if all the if you find it and it's valid then we just keep optimization which is an optimization of over the time you get the actual plan. Otherwise we optimize and replace in the plan cache. The plan cache has some sophisticated mechanism to do eviction of older plans or plans that are invalid and so on. Tribial plans we explained before it's another technique to avoid doing the whole optimization when certain the certain properties hold and optimization stages and timeouts is an important one where each optimization stage applies a subset of transformation for instance materialized views are only done later not in the earlier stages and things of that nature and at the end of each stage we get what is the estimated cost of the plan and we make a decision whether it's going to be beneficial to continue exploring more plans or whether this is a good trade-off between the time you spend optimizing and the time that you spend executing this plan. Another important thing is memo seeding because we are not doing an exhaustive search of the space sometimes it's good to have a good starting point and do a localized search around that and join reordering is one such one good example of that so instead of just putting a random join order and then doing something until we get a timeout if you remember in the first slide we do this join collapsing we have the notion of an n-ary join and then based on this n-ary join based on cardinality based on properties we can seed the memo with a bunch of good starting points and good starting points I use different heuristics that are good for kind of analytics kind of queries good for well tp type of queries the one that tries to mimic the original syntactic order and several others but by seeding the memo you can basically get faster to the better areas of the plan space that you want without going to full exploration and there are various approaches to this gradual optimization such as you you have some temperature that get decreasing and then at some point you stop applying rules so essentially some ways to localize the search into a into a good subset all of those things are overriding and you can just say no just go to full optimization and so on but these are typical things that we do for improving the performance of the optimization itself. A very good question sorry just for those timeouts at different stages I think like third thing I mean are those timeouts hard-coded for each stage or there are some like ways to set those timeouts numbers? Right no these timeouts are similar to the cost model where you calibrate the how much is always reading a page how much is this how much is that these things are also calibrated by saying how long does it take to run each rule how many rules do we want and it's basically similar to the cost model it's a cost model for the optimizer itself where you have these kind of stages that I mean it's not that they cannot be changed in fact we we introduced some new stages every now and then but those are fixed they are the stages that optimization goes through. Right but for the timeout number for each stage it sounds like those timeout numbers would be dynamically set at the wrong time or they are preset when the system starts. That's a very good and interesting point you may say why don't we just time out on three seconds and we don't do physical timeouts we do logical timeouts the timeouts are your timeout after n number of rules oh I see the reason of that is that we don't want when a system is more heavily loaded to get a different plan only because you spend more actual wall time executing some rule so but it is a it is an art basically there's no bulletproof way of doing that so there's calibration and so on but the the important detail is that all these timeouts are on the on the logical space I see it's interesting I see thanks. Okay so one quick comment here and then there will be a question so this memo seating it's not it doesn't have to be a single point we actually can we actually see the space frequently with several starting points that are the result of different heuristics and then the result of those heuristics is putting a uniform cost-based space where where based on cost we will pick one or another and we will export to explore the neighborhood so never seen these multiple points in and one thing on this problem is when you do exhaustive search you can optimize a number of things and just I mean just dynamic programming you you solve one problem once and then you use it later but you don't have a complete thing until until later so finding finding a practical good approach to that's it in between we're using heuristics and we're doing a complete full plan plan space search it's a it's a practical problem it's an important practical problem so the question is I think we're at the hour yeah we are at the hour now we we have three three slides three three topics we were thinking of doing one of them is optimizer functionality which is a console expressions and two of them are a little more peripheral uh that's supportability and testing um do we go for another 10 minutes do we stop here do we pick a subset what's the right thing uh so uh actually uh unfortunately Andy it has to set up another talk right now so he's not going to wrap up so I'm a I'm a PhD student working with Andy so I'm going to wrap up this talk for him so I think uh probably we can go uh with another like five or ten minutes um if somebody has to leave already this meeting or this talk will be recorded anyway so they can catch up later for the people that have time I think we can go a little bit over like ten minutes something like that okay sounds good so we'll zoom through these these uh a couple more slides um so this one is about supportability in the the point here is that the the optimizer doesn't always do what what you would like it to do or what you expect it to do so there are controls that you need to put in place um in um I mean there is on one hand there's visualization and and then there are controls um the the bottom the bottom picture here shows uh live query statistics so uh SQL Server has um show plan that that shows you what is going to be executed or what was executed in this case um it also has functionality that tells you how things are executing as as they go um so you would see a number of rows being processed and and time being taken by every operator so this this can help uh uh see how how plans are executing um in um in reality um and then in terms of in terms of inputs to Qo how how a couple of ways you can control its behavior there are there are query level hints and there are join hints and table table hints and these are things like use this index or use hash join here or use a uh use some merge join there um there is one uh one other thing that is called um uh use plan or these plan hints um the result the result of the of the query optimizer as a as a plan which happens to be in XML format you can you can associate this so you can feedback to the query optimizer um next to a query and you can say for this query use this plan and in the way that is implemented is not is not by replacing or by by using the plan uh uh just directly but what is done is we we look for that plan within the memo um and there are some there is some work there um so that we we avoid having to search the memo um do an exhaustive search of the memo for the plan that that you want to get so um that's that is the that is part of the query hints and it is done also via required property that's another example like what we talked about the Halloween protection I mean this search plan in the memo is done using structural properties on the type of plan that you want to get um in one uh one useful um knob that we have in here is that um we can associate hints to queries I mean what the the the simplest way the the standard way to put a hint is you write your query and at the end of your query you say option and then you put whatever you need to put in there and in every database has has a uh comparable functionality the other thing that you can do here is you basically have a um a spot of metadata you have a catalog catalog table that would have queries and then what is the hint to apply for that query so um and the value of this is again if you have query query generators it's it's not always possible to go change change a query to use a particular hint but using this kind of in that way or look up table we can do that and and we look at the next uh this animation um one other thing that we do there is um insert hints based on the observed behavior of a query so there is a there is a there's a feature called query store that keeps track of all the queries that have been um issued to the system um what was the plan what was the execution um and and information about runtime and we can uh we can process this and and based on it associate hints that can correct issues on the on the quays of the workload without without having to change the workload and that's uh it's all written okay now we'll get progressively faster and faster commons of expressions um so this is something interesting because it's time it kind of stretches the limits of what cascades is is was meant to to be doing initially suppose you have two tables a fact table and a dimension table and you have some query like this right with a cte a common sub expression that is the leg from fact joint dimension and then you do a select union all select right so you you you select twice the ct yeah union all so there are two there are many plans but two families of plans that can be useful the the first one is well let's just uh replace the ct in each one of the of the mentions with the whole query and just optimize the thing and it's important because especially in all tp scenarios this is a plan that happens like this because on one of the sides of the union all because you're selecting some specific query and you have a foreign key then you can eliminate the whole join and do some indexing on the fact table while on the other one you do a hash join of the two tables so it is basically saying yeah we'll duplicate the work but each one is going to be so efficient because you're leveraging indexes and so on that the result is going to be better than if you just do the the global thing once which is very good in some cases not great in others specifically you can have an alternative plan like the one below where you do the join of both the common join twice and then you spool this result and then you read it twice and do the the filter and the concatenation of course this is not just naively just saying let's do the ct optimize it in isolation you'll see that the predicate that you push to the common part has f val grade and five or d val smaller than six which are the basically the union of all the conjunctions that are fed from the common expression so in many cases doing this type of work is more efficient than trying to duplicate the work twice and and get the results however there are a bunch of problems to solve to do that in a principled way you have to identify one of the common expressions and in general you have to identify the topological order of all these different common sub expressions and and do some optimizations across these components because you cannot basically in general you cannot you you wouldn't be able to push this filter below a common sub expressions pool because this by definition this is read by all the consumers and therefore you cannot just push the the filter from one consumer but it's not clear how how to obtain the union of all these two if all you're doing is just basically getting transformation based on trees so there is a lot of work that needs to be done to adapt cascade to to work off graphs instead of trees we've been doing some work it is not done but it's a fascinating topic that can be and I mean that I think it might be fruitful to to to work more on a lot more to say but we are very short on time so let's go your muted sister okay but for this one test testing co-optimization is hard because there are many many possibilities how do you know that those various possibilities really give you the same result that's one dimension another dimension is co-optimization is hard because even if you're getting the same result what you want to get is you get you want to get the best plan or at least a very good plan in in showing or validating that that is the case it's it's difficult there is there is this work we have been doing in this part of it has to do with just probabilistic testing which is you can you can generate random queries and at least validate that the system doesn't doesn't crash you can generate random random execution plans within the same within the same for the same query for the same co-optimization and those are two parts where we do randomized testing and then there are also the references of well we capture we capture playbacks from customers in different forms and for different system and it's important to run these over time um there are micro benchmarks that we need to that they value all to go through um there is also work on automatically creating minimal repros so once once you have once you find a query that has a problem but this is a query that is two pages long um what's a what's a good way your automated way of reducing your your repro so that it's something more manageable um so it your testing is hard and um it's it's part of what uh it's an important part of what needs to be done in in in the industry last one promise last one um the idea here was to just show that by having a robust robust query optimizer you can start building on top of it some other meta optimization engines that use the the whole optimizer as just a small building block and because it is robust what you're building up upon you can get good good tools that are uh working off the optimizer and the two examples very very quick uh we have this uh database team advice which is trying to do a meta optimization problem which is giving a workload what are going to be the set of indexes and views that will make this workload as fast as possible so there's a whole search space of let's see this index and so on and in fact this structure in default in this similar way you have a search space of indexes let's say transformation rules that merge to indexes that remove one index and so on but the cost of a configuration is what would be the cost to optimize this workload and there's some work that with small extensions we can have these hypothetical configurations and hypothetical optimizations being done by reusing the same optimizer in SQL server uh another example is the the data warehouse engine which uh when we built that we didn't want to do this two step optimization on which find the best serial plan and then on top of that serial plan you start sprinkling a distribution or parallelism because that is known to be suboptimal instead of that what we end up doing was to call call the server and do the optimization but instead of returning one plan we return the whole memo that is the whole search space of all the plans that are kind of centralized and then we do a second optimization over all the plans for the distributed operators and that's the that has been the original optimizer for the pw appliance but also serves to show that this with small extensions to the machine itself instead of outputting a plan you output the memo internally at least we can basically create more complex engines on top of the the the contract that the original optimizer is giving so again unfortunately we could say much more i mean i went too fast on that but uh we reached the the question slide and we are we are out of time but any any other questions yeah this is great i'll clap for everyone it's a great talk a lot of interesting and useful information for us so i guess we'll first open up the floor for the audience i think we probably have time for maybe one or two at most two questions so please have any questions i have a question this is andy kimble from cockroach labs says our you know we worked together years ago um i had a question for you about uh you said that framework for cardinality estimation that that we should publish more on that or more should be published on that are there any plans to publish on that like from within microsoft plans is too strong a word i yeah i think it would be good to at least show what we have because uh well honestly i think there is there is there's more work that that that could be done there that should be done there um i would like to at least show what i mean the the framework that we have um but no no no immediate plans just just just too much going on right now i think it'd be very interesting that's one thing i think we struggled with because there is so a little published about it that you just sort of starting from scratch almost right i'm one i'm from cmu i'm also i'm a first-year student with andy i was kind of wondering so you have you said 400 plus rows how much effort is it to add a new row within the constraints of like your existing system like can you give some kind of estimate if one person a few weeks um one of the let's see like uh to give some context like sorry uh some systems like i think cockroach for example they have a dsl so do you do stuff like that to make it easier or is it not necessary uh no we we went through uh rules that are coded in c++ mostly as i explained that dsl's are good for some constrain cases but then when you go into very complex scenarios you need so powerful a dsl that it will serve as an intermediate for the real thing so you create a new rule by basically setting the i mean a class setting some properties like what is the pattern and what is the set of substitutes for that rule and you may overwrite some promise to order the how rules are done and then basically just write a bunch of code and then probably the most complex thing of writing a new rule is to just test the rule and do all the necessary kind of work to make sure that this rule works there are all circumstances and there are all engines and there are all forces and so on but i remember i mean one of the last things that i wrote was a new way of doing cubes in in in the student context i remember that because that was a completely new rule uh it took a week let's say a week to do or i mean a few days plus more few days to test on this and that so it is something that can happen relatively efficiently if you want yeah and i think um maybe just to add to the other compliment that just writing the code for our new transformation rule that would take one to two days um but then but then you have to you have to test it and and the big question is does this interact with other when there are some standard questions does this interact with other transformation rules or how does this interact with other transformation rules not so much in terms of correctness because by design as long as the transformation rule is correct input output then that is not a problem but um but one question is does it blow up your space that is really the big question and if it does then then you have a lot of um you have much more thinking and art going on in fact it doesn't even need to blow a lot of the space because we talked about timeouts and how timeouts are specific to the number of rules that you applied so by adding one rule and getting one more transformation here you might affect the timeout that happens and the quality of the plan so that's why there is a lot of test collateral there trying to understand well another another thing is like compatibility level so when you add new new things you add it under a compact level so that customers opt in there are a lot of infrastructure there to to make it this process a little more robust um yeah cool so so at the last I'll ask one final question that we often ask to about this for our guests which is that I think a little bit into the future for example in a scope of five to ten years what would you guys think would be a like one like important thing or significant thing that you guys want to add or like a change or extend the optimizer uh at that scope like let's say five or ten years into the future say sir please well what I I think uh yeah I think an important thing to work on is um cardinal decimation we can add estimation better and there are there are there are many once you get to the next level of detail there are there are many important paths to follow um and one of them as you mentioned was uh feedback and observation of execution and those kinds of things um yeah I think improving cardinal decimation is is it's one of the it's a big thing there are a lot of of mid-term problems that are juicy enough to prevent them from thinking in in ten years because for instance the the commons of expressions and all that it is a very important problem that can result in a lot of performance improvements in real queries and so on and it is not clear at all how you tackle that we have to go into a cascade plus plus do we have to do something radically different to have DAX as a first class citizen or not costing and kinetic estimation and all the supportability how to adapt to your environments when we had to go through to to distribute systems like DW and Cosmos we did different things I mean can we have something that it is more modular like when you talk about Caltech there are a lot of these things that that are mid-term yeah cool cool very interesting of course yeah you have to to throw the ML okay doing ML everywhere which is the thing to do these days this is something interesting to be seen how much and you push those type of techniques into into Qo and how how much you cannot do it I mean it is to be decided but this is a fascinating thing in the in the five-year horizon that you actually sure sure sure I know you do you do I also knew you need to be a little bit careful you like to say where to apply them where it doesn't really make sense to apply them etc but very interesting all right thanks a lot it's again it's a great talk and thanks a lot for coming Nico and MC there and hope I think that's it that's it I hope everyone stay health and stay safe and stay tuned for the next week okay all right thank you bye bye