 So today's focus is on how to actually then build the second major component of a query optimizer of how to actually do estimations on what we think the system is going to actually do when it executes a query, like how much it is going to cost us to execute the query. So yeah, so this is not working. So the reason why we have to do this is to be sort of obvious, when we talked about the cost-based search models from the last two classes, I said that this is sort of black magic box thing that's going to tell us that here's what the expected cost of executing this particular query is. And again, it's an internal cost. It's not something that's meaningful outside of the database system. I mean, you can't take MySQL's cost model, have it spit out a number, and then take Postgres' cost model and take that number and then make a comparison, right? This is completely dependent on the implementation of the system, and it's really just meant for us to be able to say, this query plan is better than this other query plan for some reason which we'll describe going forward in this lecture, and therefore, that's what I want to use, right? So the other thing important to understand, too, this is also independent. This cost model is independent of the search strategies we talked about last time, meaning whether we're doing top-down or bottom-up, it doesn't matter. At the end of the day, we still need a cost model that still needs to make predictions about whether one plan is better than another. So if you're going to build a cost model, how would you actually do it? Well, there's a couple of different things you can include in it. So the first is the physical cost. So these would be what is the hardware actually going to do when it executes the query, right? How many CPU cycles? How many cache misses? How much data are you going to read from disk? This is obviously dependent on the machine that you're running on and its hardware configuration because as we saw in a couple of different examples, whether you're using a Xeon CPU or a Xeon Phi, those CPUs have very different characteristics and therefore, the performance of your algorithms or your query could vary depending on what hardware it is. So this is usually pretty tricky to do because of that reason, because it's hard to actually have build models based on all different possible hardware configurations that could exist. The next approach is to do what I'll call logical costs and these are where we're basing their cost estimates on what the operators in a query plan are going to actually do at a logical level. So a physical level would be, again, I read this many blocks from disk. A logical level would be, I'm going to read this many tuples from the table or my join operator is going to spit out this many tuples after the join or my scan is going to filter so many tuples. So for these, this is going to be independent of what algorithm we're actually using in our query plan. So this is only looking at the logical operators and the physical operators. This should be sort of obvious. If I'm doing a nested loop join versus a hash join, at the end of the day, both algorithms should generate the exact same result, the same number of tuples, because otherwise I have problems in my implementation. And so therefore, we don't care whether, for logical estimates, we don't care about whether it's one algorithm versus the other. So the tricky thing is going to be, and what was in the paper that you guys read about and we'll talk about going forward, is you're obviously going to need to be able to estimate accurately what the output of an operator is going to be because that's going to be fed as the input for the next operator. So in order to say, the number of tuples I'm going to read into this operator and therefore generate this much output, I need to know what came below me in the query plan. And that, again, that's going to be the hardest thing we have to do. And the last one is going to be sort of algorithmic calls, sort of asymptotic complexity of the operators. And this is where it actually matters whether we're doing a hash join or a nested loop join or index scan versus a sequential scan. And again, for these ones, we can just sort of have weights that's being able to say, oh, the hash join is X times better than a nested loop join and therefore prefer that. But there are obviously some scenarios where the nested loop join would be better. So as we'll see as we're going along, and actually I'll show the next couple of slides, what we're primarily going to do for an NMRI database is going to be a combination of these two. For this one up here, you can do it, but it's hard. And this is another good example of the difference between the commercial database systems and the open source database systems. So at the talk, the Redshift talk on Monday, if you went to that, Ippercratus mentioned that the query optimizer in their system is notoriously difficult and they've spent a lot of money fixing it up at Amazon. And it's way better than any of the open source ones that are out there. It's sort of commercial grade or enterprise grade. Because the enterprise guys are going to include all of these things. Open source guys probably just include to these two guys. So for a disk-based database system, the most obvious thing that we care about is the disk, right? It's the end of the day that that's the most expensive thing, right? Feeding, getting things off of a spinning disk hard drive or SSD. So in this world, the CPU costs are not negligible in the sense of we can ignore them. But if your cost model only includes how many blocks of data that I read and write from disk, that's probably going to get you 90% of the way there, right? Who cares what you actually do when you bring the data into memory? I guess it's getting from disk is the most expensive thing. Now, obviously, it's less of an issue on modern SSDs. But definitely, if you have a spinning disk hard drive, taking account of the difference between sequential IO versus random IO is a big difference. And a lot of times, again, you see in a disk or in a database system, especially older ones, they use algorithms that are designed to optimize the amount of sequential IO you're doing, right? So the important thing that I understand about in a disk-based system is that the database system is going to have complete control over what's in memory, over its buffer pool management. Assuming you're not using Mmap, which we can discuss later, we know exactly, in our databases, we know how we're actually bringing blocks in. We know how we're writing blocks out. We know what algorithm we're using to decide what data is cold, what data we want to evict, right? So we have complete control over this, and we can include this in our calculations in our cost model. We know how, if we're going to give this amount of memory in a buffer pool to run this query, and therefore it has to read this amount of data, whether that's sequential versus random IO, we can then take that account into our cost models to make estimations of what we think the query is actually going to do, all right? So I would say this is, what I'm describing here is how you would do this on a single node disk-based database system if you're a distributed database, just replace the word disk with network, right? That's basically the same thing, and you have the same issues. So I want to a quick example of what Postgres does. The reason why I always like to use Postgres as an example for describing how real system actually implements this is that, in my opinion, it's almost like a textbook definition or implementation of a database system. If you take any introduction class, you take the textbook we use in that, and the way they describe the algorithms in the textbook is exactly how it's actually implemented in Postgres. So for Postgres, they're going to use a combination of CPU and IO costs that are going to be weighted by these what are called magic constant factors, magic weights. And the reason why you do this is basically you just say, because it's dependent on what the hardware actually can do. So they'll say sequential IO will be x times faster, x times better than random IO, and memory IO will be x times faster than sequential IO, right? So in this environment, they are obviously targeting in these sort of default configuration a database that's on disk, and therefore you don't have a lot of memory. So again, they want to use that in their cost model to account for what's actually going on. But the tricky thing is they expose this to you as an administrator. You can actually tune these weights for you, right? So the default is the memory is 400x faster than reading from disk, and then sequential IO is 4x faster than random IO. So you can go in the documentation to see how you change these costs. But then they had this nice little warning here that basically says, if you start mucking around with these, you could have problems, right? Because again, these are highly dependent on what the actual workload is and what the data looks like, and what your hardware actually looks like. It's really hard to get this right, right? So most people don't tune these things because they don't know what you're doing, and you don't want to cause regressions on some portion of your workload. Even if you have a commercial system does, we can look at what DB2 does in their system. Of all the three major systems, DB2, SQL Server, and Oracle, DB2 is actually the most open about discussing what their query optimizer does and what their cost model actually does, right? There's no, as far as I know, there's no major publications from Oracle or SQL Server that talks about what they do, right? We just sort of know some things based on talking to people there. But IBM is actually pretty good about discussing this. And this call comes from a presentation from Guy Lohman, who was the guy that invented the starboard stuff that we talked about two classes ago. So the DB2 cost model is a combination of all of these things. So first of all, they're gonna look at what the database actually looks like. So the schema, and they collect on the tables, the columns, or any indexes. But then there also, when you turn the system on, they're gonna run these little micro benchmarks that are gonna stress test like the CPU, the disk, memory, and the network. If it's a distributed system, and they're gonna use that to essentially generate the weights that I was showing you in the last slide for progress, so progress you have to set these weights manually. DB2 tries to figure this out for you and tries to be real fine grained based on what your hardware can actually do. But then also, when you actually start doing cost estimation for the query that shows up, they just don't look at your query in isolation by itself. They also account for what other things are running at the same time. And then use that to determine what the effect of those concurrent operations are on for your particular query. Like if you're running by yourself, then you know you take all the memory in the world and you run really fast. But if the bunch of other queries are running at the same time, you're not gonna get all the memory. So therefore, they take that into consideration when they estimate the cost for your query plan. Right, so again, this is just, it's way more sophisticated than what any of the open source guys do. And the Postgres one, as far as I know, at least the lesson I looked is much better than the MySQL one. It's probably the best open source. It probably is the best open source query optimized and cost model that's out there today, at least a year ago when I looked. I don't can't think of anybody else that got better. All right, but we care about MMA databases. So what do they do? So from looking at the documentation and the manuals of the major MMA database systems, as far as they can tell, everyone does what I'm describing here. So basically, the disk is gone, so you don't care about disk IO. Yes, you have to write to the log, but that's incidental, right? That doesn't affect whether you choose one particular query plan versus another, like if I update a table, who cares? The disk is not considered in the cost of that, of writing up the log record, right? So at the end of the day, all of the MMA databases are going to do a combination of primarily how many tuples are being processed by an operator, and how many tuples they're generating. And some basic weights to say hash joins are better than nest loop joins. But the number of tuples in and out is the major factor. The reason why you can't actually do anything more fine-grained or account for how much memory I'm going to be able to use for my query in the same way you could in a disk-based system is because we have no control over actually the cache management of our systems. That's all done by the CPU, and the CPU, we can provide hints to it, we can prefetch things, we can try to pin things in our caches. But at the end of the day, the CPU decides on its own what gets moved in and out of the cache. So we don't even bother with any of that. So we just try to estimate how much data as my operator is going to read and write. And it turns out that ends up being a, we'll say, a reasonable estimation of what the CPU resources I'm going to use to run my query. And this also says also, this is not also accounting whether I want to run a one thread versus a hundred threads, right? That's sort of a separate policy that's different than what we're trying to do here, and that's sort of almost like a mission control or resource management for the system. So typically, as far as I know, you don't see that in accounting for the cost model, and that sort of happens after the fact. Again, if you have it, even if you're an in-mary system like a MemSQL, since they're distributed, again, go back to the last slide, they care about the network, so they have to account for that. And not so much, and this still matters, but the network is the major factor. All right, so I want to give an example of what was a sophisticated cost model from the 1990s done in small-base. It actually does some of the things that we talked about with DB2. So what they would have is they would have this sort of two-phase approach where as the today's developers were building the systems at the company that were building small-base, they would try to identify what are all these low-level primitives that occur during query execution. Whether or not they're primitives in the same way, they're not really in vectorized primitives where you have those for those. But it's like low-level things like I read an index or I write to a tuple. So they generate these low-level operations and then they create these micro-benchmarks that allow them to sort of simulate each of these different operations. So then now at runtime in your system, when you deploy small-base, they would take all these micro-benchmarks, run them when you turn the system on, collect some profile information about how fast your machine is. And then in the cost model, when you look at a query plan, you count up the number of these low-level operations or primitives that they're doing, multiply by the micro-benchmark results that you collected, and then that's how they're determining what the cost of executing a query plan is. So for again, small-base was this early memory database system prototype out of HP Labs. HP Labs then spun it off as a separate company called Times 10, and then Times 10 got bought by Oracle in 2006. And it still exists today, you can download it. The old system used to beat it, we haven't tried it against our new system yet. The Oracle primarily sells it as like a CPU cache or sorry, an in-memory cache for the main Oracle like flagship database, but you can still run this as a standalone system. But as far as I know, again, by looking at the documentation, what they did in the 19, like this paper in the 1990s, they don't do in the real system today. They still do, they do what I've described before, they just estimate what the number of tuples in and out for an operator is. This approach is similar to BOGO MIPS, if you know what that is in Linux. When you boot up Linux, they run this little micro-metric called BOGO MIPS that tries to approximate how fast your CPU is, and they use that for timings decisions for interrupts and scheduling. If you ever look at slash proc slash CPU info, you'll see like here's my Intel CPU, here's the model number, here's my cores I have. There'll be a little entry called BOGO MIPS. And he used that as an approximation of how fast your CPU is. So I said that the most important thing we're gonna care about is the number of tuples in and number of tuples out or processed by an operator. So now we gotta figure out how we're actually gonna be able to estimate that. So the way we're gonna do this is that we're gonna try to estimate the selectivity of an operator, which is gonna determine the percentage of the tuples that are fed into it, it's the percentage of the tuples that will then be admitted as output. So if I'm given 100 tuples and my selectivity is 10%, then I'm gonna admit 10 tuples. So the way you traditionally do this in a database system is you're gonna just model this as the probability of whether a predicate will be satisfied for a particular tuple. And so the way we can generate now these probability estimations is through a combination of these different techniques. So we have domain constraints, this would be something like if we know the value range ahead of time of a particular attribute, like if it's an enum field, we know that the cardinality is fixed. We can rely on pre-computed statistics that we can generate and put into our data blocks and our tables. We talked to us before, we talked about compression, these are like zone maps. So I can pre-compute aggregations for different columns in a block. And then when I wanna figure out what's the likelihood that my operator's gonna match or my predicate's gonna match a tuple within a block, I can look at the zone maps and maybe derive some early information, like a minimax value. We can also use histograms or approximations, which I'll talk about in a second, and then we also do sampling. So there's no one of these is better than another, like you kinda wanna use a combination of all these things if you can to try to estimate selectivity. The more information you can get, the better. So the number of tuples that we're gonna be, and a processing is gonna combination, the number of tuples that we're gonna process will be dependent on three different things. So first is obviously the access method that we're using to read tuples from the table. If it's a sequential scan, then we're gonna potentially look at everything unless we push down the limit clause. If it's an index scan, then we can need more fine grain and look at the subset of the total key space. Then we have actually the distribution of the values of the database attributes. This is where the estimation stuff is gonna come into play to let them know for a given predicate, what's the likelihood of probability that I have values and that would satisfy that. I mean, of course, then also what are the actual predicates themselves, right? If I'm doing a quality predicate on a unique column, then I know my selectivity is gonna be one over the number of tuples that I have, because the only one attribute can match. But now if I just start doing a range scan, then this becomes more problematic to try to estimate this, right? So the main takeaway is that for simple queries like something equals something, we can do a reasonably good job for matching that. If you start throwing in inequalities, throw in range scans, throw in disjunctions, then it becomes really hard. And again, the more data you can get, the better. So one approach, again, in the intro class, we teach you how to generate histograms, right? That's basically you run the analyze function or analyze operation in your database system, and that does a sequential scan of your table. And it computes some kind of histogram based on what values it actually sees. So again, that's the standard technique. What's been sort of more prevalent in recent years is to generate, instead of exact histograms, to generate what are called sketches, which are these approximate data structures that can give you a hint about what is contained in the data. So think of this as the bloom filter is an approximate data structure, because it can give you false positives, but it'll never give you false negatives. And it won't tell you what values or keys are actually in the bloom filter. If you just ask it whether something exists, it'll give you a true or false. So these sketches are a little bit more complicated, rather than giving you simple true and false, they can actually tell you values that could exist, right? But again, you could have false positives. Actually, some of these I don't know whether you get false negatives as well, right? So there's a bunch of different kind of sketches that people can use, right? The reason why I bring this up was a few years ago, we had the CEO of Splice Machine come and give a talk in the intro class. And he's see me alum, he's on the Board of Advisors for the Dean. And he was here in February. And he made this comment about when he was talking about their query optimizer and their cost model was they did the textbook way initially, where they just build histograms first. And they use the standard algorithms equations to try to do estimations on the selectivity of predicates in their operators. But then they end up using this sketching library from Yahoo, that did these approximations, right? So you get error bounded estimates to be able to say like, this is what I think it's gonna be and here's my confidence about what I'm telling you. And they said that when they switched to using these sketches instead of histograms, the accuracy of their predictions and the robustness of the query optimizer, primarily because of the better cost model, the difference was quite significant. So I think they said in the old system when they were using histograms, they could do the cost model would be okay up to like 10 tables per query, like a 10 table join. But when they added this, then I think they could go up to like 75 tables. Which is, again, it's not I'm saying that's the right metric to use to determine the quality of your cost model. But that was sort of anecdotal evidence that suggested that using sketches was better than histograms. And in our old optimizer in the old system, we ended up using histograms as well. But, sorry, we ended up using some of these sketches as well. But we never vetted it. We never actually measured how much better it actually was. And again, this approach I think is also used in commercial systems. The other main technique to generate estimations about the selectivity are to do sampling. The basic idea here is that rather than look at these sketches or the histograms and try to approximate what the selectivity is, let me actually just take a subset of the tables I'm accessing, run my query on that, or run my scan with predicates on that, and then determine what my selectivity is. And then assume that my sample is a good approximation of what the total table actually looks like. And then now I have better knowledge about making my choices in my cost model. Right, so there's basically two approaches to do this. It's like online versus offline. So the offline approach is you in the background every so often you generate this read only copy of the table that you sort of put aside and then you do a measure sample and you use that in your cost model to predict this selectivity of your predicates, right? And then you have this sort of this background job that looks and say, says well, how much of my data has changed? And then if it's above some threshold, then I go back and refresh my sample. The other approach to do this in an online fashion is when the query shows up and you go in your cost model, you actually go on the real table and run a little bit of your query now and then determine what the selectivity of your predicates are. And obviously this becomes tricky because you're trying to do a cost estimation for running your query, but now you actually run something of the query, some subset of the query on the table and that could end up being slow. And then you also don't want to slow down other transactions or queries that are running at the same time. So you want to make sure you run this and read uncommitted without updating any latches or locks to avoid interfering with anybody else. So the way to think about doing the sampling thing is I have my where clause rather than doing any joins. I just pick out the predicates from my where clauses or our join clauses and I just run that on the tables themselves to compute this. Of course now this becomes hard if you have joins cuz now the selectivity of the join is tough to compute unless you start joining things. So the extent in which people use different systems, use sampling aggressively will vary. And this is where we get into the commercial systems are doing something that they don't really talk about publicly so it's hard to know what they're actually doing. But I know in the case for the paper you guys read, I think they mentioned some paper. SQL Server performs the best. SQL Server is doing a combination of histograms or approximations plus sampling. Okay, so any questions about this? So we're using this to figure out we have to use either histograms, sketches, or sampling to figure out the selectivity of these predicates. And then we can use that to compute the cardinality of our operators, which is then the amount of data that's being generated as the output. And the reason why I want to know what this is cuz that is then being fed as the input to our next operator. And the problem's gonna be that the more inaccurate our estimates are for the cardinality of the selectivity of our predicates in these operators, at the lower parts in the tree, then that's gonna get amplified as you go up. Cuz if I'm off of the bottom leaf node, as I go up that error gets carried over and now it becomes even more off. Again, this is what you saw in the paper you guys have read. So the textbook way you would actually compute the cardinality is by modeling the selectivity, as I said, as a probability. And then you make the following three assumptions about those probabilities to be able to compute what the cardinality is gonna be, the selectivity is gonna be, right? And again, so I would say this is what I'm describing here is what we teach you in the interruption class. If you go read every single textbook about databases, this is what they would tell you. And hopefully they would say to the caveat like, this goes real wrong real quick, but people still do this, right? Cuz there's nothing else, right? The only way to get the exact cost of a query is actually to run the query. But that's actually super slow, so you can't do that. So you make these assumptions, you make these sort of trade-offs in the accuracy just to be able to get something that works reasonably well. So the first assumption we're gonna make in many cases is that we're gonna have a uniform distribution of values in our attributes. Meaning the probability that a given attribute appears in my column is the same for all values. Of course we know that this is not true, right? More people live in New York City than in Montana. So I can't assume that the zip codes in Montana have the same probability of occurring for people, a people database than people in New York City. All right, the way you typically get around this is to maintain a separate data structure for what are called heavy hitters. The thing of this is just like a little hash table on the side that says, here's the top 10 or top 20 values that occur in my column. Cuz more times than, I'm more likely to query those values. And therefore I can go do a look up in this heavy hitter table and get more accurate estimates. But then everything else is assumed to be uniform. The next is that we're gonna assume that all our predicates are independent. Again, we're modeling these as probabilities. So that means we can actually multiply them together and that's gonna produce the true selectivity of our predicates. Of course, that's not gonna work. We'll see that in the next slide. And the last one is that we're gonna assume that the join keys for interrelations that we're trying to join on will always exist in the outer relation, right? This obviously doesn't work, and if you start doing left outer joins or red outer joins, this becomes problematic. But again, this is another major assumption that people have in their cost model. So let me show an example of how this all goes bad. In this great little vignette of this sort of simple problem that, again, that was in a blog order from Guy Logan. But I like this cuz it illustrates exactly all the problems that you have when you make all those assumptions. So say you have a simple database, keeping track of cars. And you have in your database, you have ten different models, like Tesla, Honda, Ford, so forth. And then, sorry, ten minks, Tesla, Ford, Honda, Toyota. And then you have a hundred different models. So Ford Escort is a model or a Toyota Corolla is a model. So you have a hundred of those. So if you have a query now that says, where make equals Honda and model equals a cord, right? So you have a conjunction and two equality predicates. If you make the independence and uniformity assumptions from the last slide when trying to estimate the selectivity of this predicate, right? Then you're gonna end up with one over ten because we have ten minks, right? So it's an equality predicate. So it's Honda, it can occur once, so it's one out of ten. And we multiply that by one over a hundred. Cuz again, we have a hundred different models and a cord is one model. So it's one over a hundred. So in this case here, the selectivity is estimated to be 0.001. But in reality, we know that Honda is the only one that makes them the cord. So it's not this, it's not one over ten times one over a hundred. These comms are actually correlated. If you have a cord, then you have a Honda cord, right? So the true selectivity is one over a hundred. So again, that independence assumption that from the last slide is gonna make us be an order of magnitude off in our estimations. And then again, now if I start saying if I'm order of magnitude off at the selectivity of some lower operator, and I feed that now into another operator, who's also gonna be another order of magnitude off, then I'm really screwed, right? So the way to solve this problem, and as far as you know, this feature only exists in the commercial systems. This is due to what are called column group statistics. Or you're basically telling the database system that these columns are correlated, and therefore it should maintain statistics about them. And when it computes the selectivity during its cost model estimates, it should treat them as being correlated, right? And don't assume that they're independent, right? So again, basically the way it works is like the DB has to comment and tell the data system, hey, these two columns are correlated. And then if the system supports that, then they can update their cost model appropriately. Again, only DB2 and Oracle do this, because as far as I know, MySQL and PostgreSQL don't do this, and none of the other major open source systems do this. Now you may be thinking why does this, why is this manual, right? Why does the DBA have to tell the DB system? And I think that's why. Why can't we do this automatically? Cuz it's hard, right? Like, think about how to figure this out. In this example here, it's super easy, right? Cuz I only have 10 makes and 100 models. But I have a billion tuples, and I have a really wide table with a thousand columns. Now I gotta go look for every single unit combination of different columns. Look at all different possible values that can occur, right? This thing's gonna blow, the search base is just massive. So this is why you can't do this automatically. Now, there may be ways to approximate this automatically. But again, as far as I know, no major system can do this automatically for you, you have to tell it ahead of time. We can talk offline about whether deep nets help with this. There are people looking into this now, but nothing exists yet. All right, so let's look now about how these simple examples can really foul us up, right? When we start throwing in joins into our query plan. So here we're doing a simple three-way join on ABC or AID equals BID, AID equals CID. And then BID has a filter where the ID is greater than 100, all right? So assume here we have a, there's a filter predicate, a filter operator in between the B and the join, right? For simplicity reasons, I'm just showing the filter being done down here, right? So the first thing we need to do is compute the cardinality of the access methods that are retrieving tuples from the tables, right? For A and C, there's no filter. So therefore, the cardinality of the operator that's scanning A and C is just the number of tuples that are in the table. I mean, that's easy for me to compute, I can get that from the catalog. For B, it's gonna be the number of tuples that are in B, multiplied by the selectivity of our predicate here, all right? So then now we're gonna feed up the output of these scan operators into our join, so now we need to be able to compute the cardinality of our join operators. So what's gonna happen here is that if I, if these guys, if this thing is bad, A and C is simple, right? Cuz there's no filter. But if this thing's wrong, then now I'm feeding in the incorrect number of tuples into this join operator. And then now I gotta figure out what the selectivity of these two attributes are. Assume I don't have a foreign key, assume that these guys aren't correlated, like I know nothing about them. So now I gotta figure out for every single value that exists in A, for A, A.ID, how many values are gonna match that in BID? And I try to figure this out in my histograms, but that's gonna be hard. So then now I take the output of this estimate. Now that's gonna be fed into this estimate here, this join. This thing is just reading the table, so that's super simple. So now I have the same problem that I have before. Now I gotta figure out, well, what's the likelihood that a tuple that matched AID equals BID here will exist and get fed up into this thing, and then also be able to match on C.ID, right? So again, I get that wrong, this gets wrong, and then this gets even more wrong, right? So this is what we're dealing with, it's just like the math, it's not in your favor here, regardless of whether we're treating these things as probabilities or as not, these estimates are always gonna get wrong and it gets worse as we go up. So again, you were to see this in the join on our benchmark in the paper you guys read, the more tables you add, the more bad it gets, right? For this reason here, because we're just throwing errors on top of errors up in the query plan. All right, so the paper I had you guys read came from the hyper team in Germany. And it was an evaluation of how accurate are the sort of the cardinality estimates in operators for a variety of different database systems. And then they wanted to use that to figure out the just how bad things actually can get, how all for you are from actually the real time. So the paper does not tell you how to build a cost model. It's sort of roughly sketched out how to do this here. But it's just showing you what happens when the s's in your cost model go wrong. And then they sort of propose some sort of design principles of building a database system is sort of what you should focus on and make the system more robust to deviations or errors in your cost models. So in the paper, they propose a new benchmark code, the JOB, the join order benchmark. So in this one, it's actually based on I think the IMDB data set, so the movie website. So and this is a real data set. So the tables are actually going to have skewed based on a real world distribution, right? And like TPCH and TPCC, these have uniform distribution where everything, the likelihood of one attribute occurring is the same for all, sorry, like one value occurring for a given column, it's the same for all values. And the join order benchmark has actually real skew in it. So they're going to generate a bunch of different queries that are going to join more and more tables and then they want to be able to measure what the cost model thinks the selectivity is going to be of the cardinality of the operators and compare that with what the real data actually looks like, right? So for this, the way they're going to use this is that they're going to load the data in once and then they're going to run Analyze, which again, that fires off the background job to actually go scan all the table and let the system compute whatever statistics that I want to compute. That's the best case scenario. There's no updates, I load the data in and I run Analyze. And again, they want to see in this best case scenario how wrong things get. So I want to show this one graph here because this is the most important one here. And so the way to understand this is that the y-axis is how far away you are from being exactly accurate in the cardinality estimations. So this middle line here is when you're perfectly accurate. And then if you're above this, then you're overestimating if you're below this, then you're underestimating, right? And then the x-axis is the number of tables that they're joining in the query here, right? So the first thing to point out is that everyone, the overall trend seems to be that everyone is underestimating, right? As you add more tables, everyone sort of starts to go down, right? All right, so again, what is that from? Well, that, as we showed in that when we talked about those assumptions. If you assume that things are independent, then I assume my selectivity is going to be way less than it actually is. So that's why they're all underestimating from what the actual value should be. So the first one to point out is this one guy here actually does reasonably well up to about three or four tables, right? And across all of these, they had the most tightest balance from being 100% accurate versus all the other ones. The three systems here are essentially all have the same trends. But the balance will vary. Right, again, as you add more tables, the estimation gets worse. And then at six tables, it's really bad. Then you have this middle guy here who's just way off right away. It is okay with two tables, and then after that, it's underestimating way more than everyone else. So I think in the paper, they tell you that the first one and the last one are Postgres and Hyper, right? And I think I spoke at last time of who is the best one here, right? Let me take a guess what this one is. Microsoft SQL Server, okay. Let me take a guess what these other two ones are. Which one's Oracle? This one here? You said it was this one? All right, raise your hand if you think Oracle is this one. Raise your hand if you think Oracle is this one, okay? And then we need a last system, take it, DB2, okay. So you think this is DB2, this is Oracle? Okay, there you go. So, let's focus on SQL Server. So SQL Server does, it's Cascades, but again, the cost model is independent of whether or not you're doing Cascades or bottom up, right? But they're doing sampling, they're doing histograms, all right? And it's just, SQL Server's just really good. In my opinion, I consider it to be of these three systems here. I think it's the, at least what they've talked about publicly, in my opinion, it's the most state-of-the-art leading edge system, right? Oracle and DB2 are good, they're not bad systems, it's just, I feel like SQL Server, based on what they've talked about publicly, it seems like this thing is way farther ahead than everyone else. Again, I would say for all database systems, open source and commercial, SQL Server probably has the best query optimizer, and it shows here. But let's now look into see a little more detail about what actually happens when you get correct estimates. So for this one, they're going to instrument Postgres 9.4, and the way to understand this graph here is that the x-axis is what the percentage of the queries that they're going to execute are within the, or how much slower are they from the actual execution of the query itself, right? So if you, sorry, how much, let me just sort of, they instrumented Postgres, and they made it so that rather than using histogram to estimate the cardinality of an operator, they modified the system so that this, whatever function it says, go estimate what the cardinality is, they replace that with a magic oracle that always gives them back exactly the correct answer every time, right? So the way to read this is it's how far they are from what the real estimate is going to do versus what the query performance will be if you have the exact estimates versus what if you use the built-in cost model, whether it's making approximations. So if you're in this band here, 0.9 to 1.1, means that the performance of the query using the approximations is roughly the same of what you get when you have the exact result. And then over here, you're getting way, way, way slower as you go in this direction. And then this is what percentage of the queries and the total number of queries that they gave into the system fit into these different buckets here, right? So this shows you that when you're using estimations, that 60% of the queries are 1.1 or great more time slower than the queries when you have the true cardinal, the exact estimations here, right? And so what's going on is because Postgres is underestimating the cardinality of its operators. It thinks that the number of tuples that it's going to pass from one operator to the next is less, is lower than it actually is. And they use these cardinality estimates to size the hash table for when you do a hash join, right? Or decide whether you even want to do a hash join at all, right? If you're only going to access a small number of tuples and a join, then a nested loop join is super fast. Because you don't have to set up a hash table, you don't have to build it, you don't probe it. Just do these two little for loops on like four tuples, and the outer table and the inner table, that's as fast as you're going to ever get. So Postgres ends up because it's underestimating the cardinality of its operators to say, you're operating one with ten tuples, nested loop join, right? And then it starts running and it says, shit, this is not ten tuples. But by that point, the query plan is baked, right? You can't switch back to do the hash join, right? Why not? That one you'd have to do adaptive query processing, right? So you have to be able to say, my query optimizer made a mistake. Go back and rerun the optimizer and get a new plan based on something, based on what data I've seen so far. But nested loop join and hash join get the same result. Say it again? And nested loop join and hash join get the same result. His statement is the nested loop join and the hash join get the same result. Right, but like, so you could do it the way, actually, one way to do this is that if I see now the as, if I see that I have, I thought I was going to have like ten tuples. So now I have one billion tuples coming in. Before I even start the nested loop join, I could say, don't do nested loop join, switch over, do a hash join. You could do that in some ways. It depends on how they're pipelining tuples, but they don't. It's an engineering thing, right? It's sort of like once the optimizer makes the decision, they go with it. Now, adaptive query processing or optimization is when you say, I think the optimizer made a mistake, go back and re-generate a query plan and now you have to get into this world of like, do I throw away all the data I've processed so far and start up from scratch? Or can I come up with a new query plan that can use some of the data I've already processed and just continue with that? That would essentially do what you're proposing. But as far as I know, Postgres doesn't do that. I don't know actually what SQL Server and Oracle do. I think they throw everything away and start over. You primarily do that for join ordering. Switching between nested loop and hash join. There's no reason you could do it. I mean, Postgres doesn't do it. I think for engineering reasons. Okay, so in this case here, again. Things are running slower because a lot of queries end up being nested loop joins when they really should be hash joins. So the next thing they did was, in Postgres, you can pass a flag to say, sorry, yes. So the leftmost part in this graph is that some queries actually run faster and even false, kind of like this, than true one. Yes, but I don't know why. Okay, I don't know. We can ask Victor, I don't know. It was to error one way, error the other way, the error is handled like, two wrongs make a right, two wrongs make a right. Okay, it's just noise, I don't know what you're saying. Like the error from the first thing fed into error in the second part. Which turned out to be really good. Got it, okay. So he said the error for the first join operator fed into the first join overestimated, the second join that it fed into underestimated, and then that came out to be exactly correct. Now why this is running 0.3 faster, I don't know. But it's pretty low amount, it's one or two percent. All right, so what I'm saying here is that the first assessment of why things are running slower is because it's picking nested loop join when it's picking hash join. So in Postgres you can pass a flag in your session, like you open up the terminal and you can say set no merge join or something like that or no nested loop join. You can tell the optimizer to not even consider a nested loop join for your query. So in this case here, it only considers a hash join. But now you still see that you're still getting worse performance, right? Then where you should be if you have two cardinalities, right? It did help, right? Few queries are getting picked as nested loop join, but still some queries are still running slower. In this case, it turns out that in this version of Postgres, they were using the cardinality estimates from the operators to decide how to allocate the hash table when you do the hash join, right? So you say, I'm gonna do 100, I have my cardinality is 100. So make sure I have a hash table that has 200 slots to put things in. And so if you undersize the hash table, what happens is that as we talked about, you end up with these really long, I think they were doing a bucket hash table. So you have these really long bucket chains that essentially becomes a sequential scan every single time you do a probe, right? So, again, if you had correct estimates about what the cardinality was, then you could say, all right, my hash table needs to be this big, if my data set's coming in, it's gonna be this big. And then you don't worry about these long sequential scans. So to prove that this was actually the case, okay, this was Postgres 9.4. Then in Postgres 9.5, they had the ability to dynamically resize the hash table if your estimates go wrong. So they back ported that feature from Postgres 9.5 into 9.4. And they re-ran the same experiment. And now you see that there's more queries that are actually getting closer where you should be if you have the true cardinality, right? So this is a good example of showing that it's not just picking whether one operator should be a hash join versus a nest loop join. It's actually what you're actually doing in that operator can be greatly affected by the estimation that your cost model's making, right? So Victor then, he and I were talking about this, and he sent me sort of a synopsis of what he thought were the most important things that came out of this experiment or the study that he did. And the first is that they felt that having an accurate cost model from your query optimizer ended up actually being, in some cases, more important than having the fastest engine in the world, right? Like, who cares that we can do SIMD processing or query compilation and all the other tricks that we talked about the entire semester? If our optimizer is total shit and we're picking the worst join orders and we're always doing nest loop join, then who cares how fast the engine is in simple examples? When we actually throw real-world data at it, then we're just gonna get crushed. So being able to pick up the join order is probably the most important thing from all of this. The other thing they found was basically that the cardinality estimates are always gonna be wrong. So when you actually implement the operators in your system, you wanna have them be adaptive enough to not, or adaptive so they don't have to rely on estimations coming from the query optimizer. Cuz you just assume that's gonna be wrong, and as the data comes in, you wanna be able to adapt whatever it is your system's doing to account for the actual data that it's actually seeing. So you'd be able to resize the hash table automatically, rather than just taking whatever the cost model gives you and just fixing the size to be exactly that. And we're doing this now in our own system. We haven't pushed it to the master branch yet. This is in the separate LLVM branch. But we can reorder like predicates. So some predicates are more selective than others. The cost model might think that you wanna execute your predicates in this order, we can then shuffle things on the fly as we see real data. The other things that he said that, and you sort of see this in design about in Hyper, is that having an engine that can do fast sequential scans and fast hash joins may actually end up being better than having all of these sort of fancy indexes to do joins and lookups, right? Cuz the more indexes you have and the more complex the estimations have to be for your cost model. So rather than try to account for any of this, you just say, screw it, I'm gonna sequential scan everything and just rip through the data really fast and who cares whether an index would help me here. For analytical queries, I maybe agree with this for some things, maybe not for others, right? I have mixed feelings about this one. The last one he says that trying to have a more accurate cost model, like using the microbenchmark stuff that I talked about at the very beginning, like trying to profile the hardware and include that in your estimations. They felt that that was a waste of time. Just having better cardinality estimations is the most important thing. So that means having better statistics, better sketches, whether you're throwing deep nets in there or not. We could talk about that separately. But getting this right is more important than having sort of more fine grained accuracy of what the hardware section can do, okay? So I thought these were pretty useful, right? Again, some of these we've adopted in the design of our system, other ones not so much. All right, so the last thing I wanna talk about, and this will lead us into what we'll talk about on Monday, is this project from IBM from, I guess this point, 2001. So 18 years ago for this thing called Leo. So you may be thinking, and it's sort of what he was sort of suggesting earlier, is that if my cost model is gonna make these estimates and they're gonna be wrong, then as I run my query and I see the data is actually completely different, can I just fix myself? Or can I just possibly also go fix the estimates, right? And this is what IBM was trying to do with this thing called the learning optimizer. So the idea would be that my query shows up, I run it through my query optimizer, the cost model generates estimates about what it thinks the data looks like. I then run that query, and then now I observe are the estimations about the cardinality and the selectivity of my predicates. Is that matching what I'm actually seeing in the real data? And then when the query finishes, they check to see whether the estimations differ from the real data. And if they differ, then they try to feed back the real data back into the optimizer, so that when the next query shows up, they can rely on the data they've already collected from the previous query, right? So this seems like this would solve this problem, right? It's like, our system is wrong, we run the real data, see what actually happens. The, I would say that every DB2 administrator that I've ever talked to and I ask them about this, that they always say that the first thing they do when they install DB2 is they turn all this off, because it never worked. And we had Oracle has something similar to do memory management, like automatic memory manager for the buffer pool manager. And we talked to some DVAs a few weeks ago, and they're like, yeah, we turn that off immediately. Like all this automated stuff, n***, right? So I never found out why, I've heard there was some engineering difficulties at IBM to get this thing to actually work the way they wanted to work, and fit into the rest of the system correctly. I think this didn't work because of engineering reasons, and it's a shame because it seems like this would solve our problems. But because of this, I think a lot of people, a lot of these, a lot of commercial systems have been hesitant to adopt similar techniques, just because this thing sort of fell apart. But this then leads us into what we're going to talk about on Monday, of like the sort of revival, or not a revival, but a continuation of a long history of trying to do automated tuning in database systems to sort of alleviate some of these difficulties, right? So now what's in vogue is machine learning, and the people trying to apply machine learning to these problems. But it actually goes back into the 1970s. People have been working on this problem for a long time. So this is just sort of one example of this that didn't work. Okay, so the main takeaway today is that for an MRE database, being able to estimate the number of tools that we're going to process per operator is going to be for us a reasonable approximation of what the extrusion cost of a query is going to be. Of course, now that means that in order to estimate the number of tools you're going to process per operator, you got to know what the cardinality is of your children operators. And then as we showed, things can go bad pretty quickly. And so I think the way Microsoft does the cost estimations is using how many channels sampling and sketches is the right way to do this. And we tried doing this in our own database system, but we never vetted it to determine whether it was actually accurate. And what will happen is in the summer when we bring over the old optimizer from Peloton into the new system, we're probably not going to bring over the cost model, we will start with a simple cost model that basically says, if I have an index, my cost is zero, if I have a sequential scan, my cost is one. That's the simplest cost model you could ever have. That won't handle join ordering for us, but at least that'll get started on being able to pick indexes, okay?