 All right, so this is the third lecture in the focusing on query optimizers. So this is sort of like, I'd say this is the last lecture for the core topics of the semester. So at this point here, everything I've covered is pretty much what you need to build a full database system from scratch. We've covered the execution engine, we've covered networking, we've covered indexes, we've covered control. Now we're finishing up with query optimizers. Like I said, this is it. This is, I mean, you obviously can go more details of different parts of the system, but this is the core stuff that we would actually need if we want to build a real database system. What I'll teach on Monday is sort of a exploratory topic on doing automated databases, which is stuff that we do here at CMU. And then Wednesday we'll have the guest speaker. So actually, I'm sort of going through this list now, but this is the, oh, okay, this is the remaining agenda for you guys coming up. So on Monday next week, I'm gonna give out the final exam, to take on final exam. I used to give it out on the last day of class, which we did on Wednesday, but the problem is you'll turn it in when you use the final presentation. So by giving it made first, you only have five days to do it, so that's not enough time. But please still come to this talk here. So on Monday, or say Wednesday, we're having a Neil Goya from SAP HANA. He's coming to give his guest lecture. And again, what I like about having people from industry come and give talks because they basically say, here's all the problems that we're facing and it covers all the stuff we covered in the semester, which is nice. Then the second code review will be due May 4th. We've already graded half of the code reviews, landed one half on behind of my others. I hope it sent out emails tonight to have everything fully graded. We'll post everything, the grades on campus, it'll send you guys feedback. Final presentation is May 6th at 8.30 a.m. We can start at nine and we'll just come here. And then the code review will be done, you have to do your second review on May 11th and the final code drop will be May 14th and that'll include the write up about the design document described which you've actually done. So we're almost there, we're almost done for the semester. Any questions? Okay. All right, so the focus on today is, nope, cross models in theory. All right, so today's focus is on how to actually then build the second major component of a query optimizer of how to actually do estimations on what we think the system is gonna actually do when it executes a query, like how much is gonna cost us to execute the query. So yeah, so this is not working. So the reason why we have to do this is sort of obvious. When we talked about the call space search models from the last two classes, I said that there's this sort of black magic box thing that's gonna tell us that here's what the expected cost of executing this particular query is. And again, it's an internal cost. It's not something that's meaningful outside of the database system. I mean, you can't take MySQL's cost model, have it spit out a number and then take Postgres' cost model and take that number and then make a comparison. This is completely dependent on the implementation of the system and it's really just meant for us to be able to say this query plan is better than this other query plan for some reason, which we'll describe going forward in this lecture. And therefore, that's what I wanna use. So the other thing important to understand too, this is also independent. This cost model is independent of the search strategies we talked about last time. Meaning whether we're doing top down or bottom up, it doesn't matter. At the end of the day, we still need a cost model that still needs to make predictions about whether one plan is better than another. So if you're gonna build a cost model, how would you actually do it? Well, there's a couple of different things that you can include in it. So the first is the physical cost. So these would be what is the hardware actually gonna do when it executes the query? How many CPU cycles? How many cache misses? How much data are you gonna read from disk? This is obviously dependent on the machine that you're running on and it's hardware configuration because as we saw in a couple of different examples, whether you're using a Xeon CPU or a Xeon PHY, those CPUs have very different characteristics and therefore the performance of your algorithms or your query could vary depending on what hardware it is. So this is usually pretty tricky to do because of that reason, because it's hard to actually have build models based on all the possible hardware configurations that could exist. The next approach is to do what I'll call logical cost and these are where we're basing their cost estimates on what the operators in a query plan are going to actually do at a logical level. So a physical level would be again, I read this many blocks from disk. A logical level would be, I'm gonna read this many tuples from the table or my join operator is gonna spit out this many tuples after the join or my scan's gonna filter so many tuples. So for these, this is gonna be independent of what algorithm we're actually using in our query plan. So this is only looking at the logical operators not the physical operators. And this should be sort of obvious, right? If I'm doing a nested loop join versus a hash join, the end of the day both algorithms should generate the exact same result, the same number of tuples, because otherwise I have problems in my implementation. And so therefore we don't care whether in, for logical estimates, we don't care about whether it's one algorithm versus the other. So the tricky thing's gonna be, and what was in the paper that you guys read about and we'll talk about going forward is you're obviously gonna need to be able to estimate accurately what the output of an operator is gonna be because that's gonna be fed as the input for the next operator. So in order to say, the number of tuples I'm gonna read into this operator and therefore I'm gonna generate this much output and you need to know what came below me in the query plan. And that again, that's gonna be the hardest thing we have to do. And the last one is gonna be sort of algorithm, the cost or asymptotic complexity of the operators. And this is where it actually matters whether we're doing a hash join or nested loop join or index scan versus a sequential scan. And again, for these ones we can just sort of have weights that's being able to say, oh the hash join is X times better than a nested loop join and therefore prefer that. Obviously some scenarios with a nested loop join would be better. So as we'll see as we're going along, and actually I'll show the next couple of slides, what we're primarily gonna do for an N memory database is gonna be a combination of these two. For this one up here, you can do it but it's hard. And this is another good example of the difference between the commercial database systems and the open source database systems. So at the talk, the redshift talk on Monday, if you went to that, Ippercratus mentioned that like, oh the query optimizer in their system is notoriously difficult and they've spent a lot of money fixing it up at Amazon and it's way better than any of the open source ones that are out there. It's sort of commercial grade or enterprise grade because the enterprise guys are gonna include all of these things. The source guys probably just include to these two guys. So for a disk-based database system, the most obvious thing that we care about is the disk. It's the end of the day that that's the most expensive thing. Feeding, getting things off of a spinning disk hard drive or an SSD. So in this world, the CPU costs are not negligible in the sense of like we can ignore them but if your cost model only includes how many blocks of data that I read and write from disk, that's probably gonna get you 90% of the way there, right? Who cares what you actually do when you bring the data into memory? It's getting from disk is the most expensive thing. Now obviously, it's less of an issue on modern SSDs but definitely if you have a spinning disk hard drive, taking account the difference between sequential IO versus random IO is a big difference. And a lot of times again you see in a disk or in a database system, especially older ones, they use algorithms that are designed to optimize the amount of sequential IO you're doing, right? So the important thing that they'll understand about in a disk-based system is that the database system is gonna have complete control over what's in memory, over it's buffer pool management. Assuming you're not using M-Map, which we can discuss later, we know exactly in our databases, we know how we're actually bringing blocks in, we know how we're writing blocks out, we know what algorithm we're using to decide what data is cold, what data we wanna evict, right? So we have complete control over this and we can include this in our calculations in our cost model. We know how, if we're gonna give this amount of memory in a buffer pool to run this query and therefore it has to read this amount of data, whether that's sequential versus random IO, we can then take that in account into our cost models to make estimations of what we think the query is actually gonna do, all right? So I would say this is, what I'm describing here is how you would do this on a single node disk-based database system. If you're a distributed database, just replace the word disk with network, right? That's basically the same thing and you have the same issues. So I wanna a quick example of what Postgres does. The reason why I always like to use Postgres as an example for describing how a real system actually implements us is that, in my opinion, it's almost like a textbook definition or implementation of a database system. Like if you take any introduction class, you take the textbook we use in that and the way they describe the algorithms in the textbook is almost exactly how it's actually implemented in Postgres. So for Postgres, they're gonna use a combination of CPU and IO costs that are gonna be weighted by these what I call magic constant factors, right? Magic weights. And the reason why you do this is basically you just say, because it's dependent on what the hardware actually can do. So they'll say sequential IO will be X times faster, X times better than random IO and memory IO will be X times faster than sequential IO, right? So in this environment, they are obviously targeting in the default configuration a database that's on disk and therefore you don't have a lot of memory. So again, they wanna use that in their cost on a due account for what's actually going on. But the tricky thing is, they expose this to you as an administrator that you can actually tune these weights for you, right? So the default is the memory is 400 X faster than reading from disk and then sequential IO is four X faster than random IO. So you can go in the look in the documentation to see how you change these costs, but then they had this nice little warning here that basically says like, if you start mucking around with these, you could have problems, right? Because again, these are highly dependent on what the actual workload is and what the data looks like and what your hardware actually looks like. It's really hard to get this right, right? So most people don't tune these things because they don't know what you're doing and you don't wanna cause regressions on some portion of your workload. So even idea what a commercial system does, we can look at what DB2 does in their system. Of all the three major systems, DB2, SQL Server and Oracle, DB2 is actually the most open about discussing what their query optimizer does and what their cost model actually does, right? There's no, as far as I know, there's no major publications from Oracle or SQL Server that talks about what they do, right? We just sort of know some things based on talking to people there. But IBM is actually pretty good about discussing this and this call comes from a presentation from Guy Lohman who was the guy that invented the starboard stuff that we talked about two classes ago. So the DB2 cost model is a combination of all of these things. So the first of all, they're gonna look at what the database actually looks like, so the schema and the statistics they collect on the tables, the columns or any indexes, but then they're also, when you turn the system on, they're gonna run these little micro benchmarks that are gonna stress test like the CPU, the disk, memory and the network if it's a distributed system. And they're gonna use that to essentially generate the weights that I was showing you in the last slide for Postgres. So Postgres, you have to set these weights manually. DB2 tries to figure this out for you and tries to be real fine grained based on what your hardware can actually do. But then also when you actually start doing cost estimation for the query that shows up, they just don't look at your query in isolation by itself. They also account for what other things are running at the same time and then use that to determine what the effect of those concurrent operations are owned for your particular query. Like if you're running by yourself, then you know you take all the memory in the world and you run really fast. But if a bunch of other queries are running at the same time, you're not gonna get all the memory so therefore they take that into consideration when they estimate the cost of your query plan. Right, so again this is just, it's way more sophisticated than what any of the open source guys do. And the Postgres one, as far as I know, at least the last time I looked is much better than the MySQL one. Probably the best open source, it probably is the best open source query optimization cost model that's out there today. At least a year ago when I looked, I don't can't think of anybody else that got better. All right, but we care about in-memory databases. So what do they do? So from looking at the documentation and manuals of the major in-memory database systems, as far as they can tell, everyone does what I'm describing here. So basically the disk is gone so you don't care about disk IO. Yes, you have the right to the log, but that's incidental. That doesn't affect whether you choose one particular query plan versus another. Like if I update a table, like who cares, like the disk is not considered in the cost of that, of writing up the log record, right? So at the end of the day, all the in-memory databases are gonna do a combination of primarily how many tuples are being processed by an operator and how many tuples they're generating and some basic weights to say hash joins are better than the nest-reloved joins. But the number of tuples in and out is the major factor. The reason why you can't actually do anything with any more fine grain or account for how much memory I'm gonna be able to use for my query in the same way you could in a disk-based system is because we have no control over actually the cache management of our systems. That's all done by the CPU and the CPU. We can provide hints to it, we can prefetch things, we can try to pin things in our caches. But at the end of the day, the CPU decides on its own what gets moved in and out of the cache. So we don't even bother with any of that. So we just try to estimate how much data is my operator gonna read and write. And it turns out that ends up being a, well I'll say a reasonable estimation of what, the CPU resources I'm gonna use to run my query. And this also says also, this is not also accounting whether, like whether I wanna run a one thread versus 100 threads, right? That's sort of a separate policy that's different than what we're trying to do here. And that's sort of almost like a mission control or resource management for the system. So typically as far as I know, you don't see that in accounting for the cost model. And that sort of happens after the fact. And again, if you have it like, even if you're an in-mary system like a MemSQL, since they're distributed, again, go back to the last slide, they care about the network. So they have to account for that and not so much. And this still matters, but like the network is the major factor. All right, so I wanna give an example of a, of what was a sophisticated cost model from the 1990s done in small-base. So that actually does some of the things that we talked about with DB2. So what they would have is they would have this sort of two-phase approach where the, as the database developers were building the systems, like at the company that were building small-base, they would try to identify what are all these low-level primitives that occur during query execution. Whether or not they're primitives like in the same way, they're not really like invector-wise primitives where you have like those for those. But it's thinking like low-level things like, I read an index or I write to a tuple. So they generate these low-level operations and then they create these micro-benchmarks that allow them to sort of simulate each of these different operations. So then now at runtime in your system, when you deploy small-base, they would take all these micro-benchmarks, run them when you turn the system on, collect some profile information about how fast your machine is, and then in the cost model, when you look at a query plan, you count up the number of these low-level operations or primitives that they're doing, multiply by the micro-benchmark results that you collected, and then that's how they're determining what the cost of actually query plan is. All right, so for again, small-base was this early memory database system prototype out of HP Labs. HP Labs then spun it off as a separate company called Times 10, and then Times 10 got bought by Oracle in like 2006. And it still exists today. You can download it. The old system used to be that we haven't tried it against our new system yet. The Oracle primarily sells it as like a CPU cache, or sorry, an in-memory cache for the main Oracle like flagship database, but you can still run this as a standalone system. But as far as they know, again, by looking at the documentation, what they did in the 19, like it was paper from the 1990s, they don't do in the real system today. They still do, they do what I've described before, they just estimate what the number of tuples in and out for an operator is. This approach is similar to Bogo MIPS, if you know what that is in Linux. When you boot up Linux, they run this little micro-metric called Bogo MIPS that tries to approximate how fast your CPU is, and they use that for timings decisions for like interrupts and scheduling. If you ever look at like slash proc slash CPU info, you'll see like, here's my Intel CPU, here's the model number, here's my cores I have. There'll be a little entry called Bogo MIPS. And you use that as an approximation of how fast your CPU is. So I said that the most important thing we're gonna care about is the number of tuples in and number of tuples out, or processed by an operator. So now we gotta figure out how we're actually gonna be able to estimate that. So the way we're gonna do this is that we're gonna try to estimate the selectivity of an operator, which is gonna determine the percentage of the tuples that are fed into it. It's the percentage of the tuples that will then be admitted as output. So if I'm given 100 tuples and my selectivity is 10%, then I'm gonna admit 10 tuples. So the way you traditionally do this in a database system is you're gonna just model this as the probability of whether a predicate will be satisfied for a particular tuple. And so the way we can generate now these probability estimations is through a combination of these different techniques. So we can have domain constraints. This would be something like if we know the value range ahead of time of a particular attribute, like if it's an enum field, we know that the cardinality is fixed. We can rely on precomputed statistics that we can generate and put into our data blocks and our tables. We talked to this before, we talked about compression. These are like zone maps. So I can precompute aggregations for different columns in a block. And then when I wanna figure out whether a, what's the likelihood that my operator is gonna match or my predicate is gonna match a tuple within a block, I can look at the zone maps and maybe derive some early information, like a minimax value. We can also use histograms or approximations, which I'll talk about in a second. And then we also do sampling. So there's no one of these is better than another. You kinda wanna use a combination of all these things if you can to try to estimate selectivity. Like the more information you can get, the better. So the number of tuples that we're gonna be, and a processing is gonna combination, the number of tuples that we're gonna process will be dependent on three different things. So first is obviously the access method that we're using to access to read tuples from the table. If it's a sequential scan, then we're gonna potentially look at everything unless we push down the limit clause. If it's an index scan, then we can need more fine grain and look at the subset of the total key space. Then we have actually the distribution of the values of the database attributes. This is where the estimation stuff that's gonna come into play. So you don't know, for a given predicate, what's the likelihood of probability that I have values that would satisfy that? I mean, of course, then also, what are the actual predicates themselves? If I'm doing a quality predicate on a unique column, then I know my selectivity is gonna be one over the number of tuples that I have because the only one attribute can match. But now if I just start doing a range scan, then this becomes more problematic to try to estimate this, right? So the main takeaway is that for simple queries, like something equals something, we can do a reasonably good job for matching that. If you start throwing in inequalities, throw in range scans, throw in disjunctions, then it becomes really hard. And again, the more data you can get, the better. So one approach again, in the intro class, we teach you how to generate histograms, right? That's basically you run the analyze function or analyze operation in your database system, and that does a sequential scan of your table and it computes some kind of the histogram based on what values it actually sees. So that's the standard technique. What's been sort of more prevalent in recent years is to generate, instead of exact histograms, to generate what are called sketches, which are these approximate data structures that can give you a hint about what is contained in the data. So think of this as like a, the bloom pointer is an approximate data structure, right, because it can give you false positives, but it'll never give you false negatives. And you can't actually, it won't tell you what values or keys are actually in the bloom filter. It is, if you just ask it whether something exists, it'll give you a true or false. So these sketches are a little bit more complicated rather than giving you, you know, simple true and false. They can actually tell you values that could exist, right? But again, you could have false positives. I don't know what, some of these I don't, actually some of these I don't know whether you get false negatives as well, right? So there's a bunch of different kinds of sketches that people can use, right? The reason why I bring this up was, a few years ago we had the CEO of Splice Machine, come and give a talk in the intro class, and he's CUMI alum, he's on the Board of Advisors for the Dean, and he was here in February. And he made this comment about when he was talking about their query optimizer and their cost model was, they did the textbook way initially where they just build histograms first, and they used the standard algorithms equations to try to do estimations on the selectivity of predicates in their operators. But then they end up using the sketching library from Yahoo that did these approximations, right? So you could bound the, you get error bounded estimates to be able to say like, oh, this is what I think it's gonna be and here's my confidence about what I'm telling you. And they said that when they switched to using these sketches instead of histograms, the accuracy of their predictions and the robustness of their query optimizer, primarily because of the better cost model, the difference was quite significant. So I think they said that in the old system, when they were using histograms, they could do, the cost model would be okay up to like 10 tables per query, like a 10 table join. But when they added this, then I think they could go up to like 75 tables, which is again, it's not saying that's the right metric to use to determine the quality of your cost model, but that was sort of anecdotal evidence that suggested that using sketches was better than histograms. And in our old optimizer in the old system, we ended up using histograms as well, but, or sorry, we ended up using some of these sketches as well, but we never vetted it, we never actually measured how much better it actually was. And again, this approach I think is also used in commercial systems. The other main technique to generate estimations about the selectivity are to do sampling. The basic idea here is that rather than look at these sketches or the histograms and try to approximate what the selectivity is, let me actually just take a subset of the tables I'm accessing, run my query on that or run my scan with predicates on that, and then determine what my selectivity is, and then assume that my sample is a good approximation what the total table actually looks like, and then now I have better knowledge about making my choices in my cost model. So there's basically two approaches to do this, it's like online versus offline. So the offline approach is you in the background every so often you generate this read-only copy of the table that you sort of put aside, and then you do a metric sample and you use that in your cost model to predict this selectivity of your predicates. And then you have this sort of this background job that looks and says well how much of my data has changed and then if it's above some threshold then I go back and refresh my sample. The other approach is to do this in an online fashion is when the query shows up and you go in your cost model you actually go on the real table and run a little bit of your query now and then determine what the selectivity of your predicates are. And obviously this becomes tricky because you're trying to do a cost estimation for running your query, but now you actually run something of the query some subset of the query on the table and that could end up being slow. And then you also don't wanna slow down other transactions or queries that are running at the same time so you wanna make sure you run this and read uncommitted without updating any latches or locks to avoid interfering with anybody else. So the way to think about doing the sampling thing is like I have my where clause rather than doing any joins I just pick out the predicates from my where clauses or join clauses and I just run that on the tables themselves to compute this. Of course now this becomes hard if you have joins because now the selectivity of the join is tough to compute unless you start joining things. So the extent in which people use different systems use sampling aggressively will vary. And this is where again this is where we get into like commercial systems are doing something that they don't really talk about publicly so it's hard to know what they're actually doing. All right but I know in the case for the paper you guys read I think they mentioned some paper. SQL Server performs the best. SQL Server is doing a combination of histograms or approximations plus sampling. Okay so any questions about this? So we're using this to figure out we have to use either histograms, sketches or sampling to figure out the selectivity of these predicates and then we can use that to compute the cardinality of our operators which is then the amount of data that's being generated as the output. And the reason why I wanna know what this is because that is then being fed as the input to our next operator. And the problem is gonna be the more inaccurate our estimates are for the cardinality of the selectivity of our predicates in these operators at the lower parts in the tree then that's gonna get amplified as you go up because if I'm off at the bottom leaf node as I go up that error gets carried over and now I become even more off. Again this is what you saw in the paper you guys have read. All right so the textbook way you would actually compute the cardinality is by modeling the selectivity as I said as a probability and then you make the following three assumptions about those probabilities to be able to compute what the cardinality is gonna be or the selectivity is gonna be. And again so I would say this is what I'm describing here is what we teach you in the interruption class if you go read every single textbook about databases this is what they would tell you hopefully they would say the caveat like oh this goes real wrong real quick but people still do this because there's nothing else. The only way to get the exact cost of a query is actually to run the query but that's actually super slow so you can't do that. So you make these assumptions you make these sort of trade offs in the accuracy just to be able to get something that works reasonably well. So the first assumption we're gonna make in many cases is that we're gonna have a uniform distribution of values in our attributes meaning the probability that a given attribute appears in my column is the same for all values. Of course we know that this is not true like more people live in New York City than in Montana so I can't assume that the zip codes in Montana have the same probability of occurring for people database than in people in New York City. The way you typically get around this is to maintain a separate data structure for what are called heavy hitters. The thing of this is just like a little hash table on the side that says here's the top 10 or top 20 values that occur in my column because more times than you know I'm more likely to query those values and therefore I can go do a lookup in this heavy hitter table and get more accurate estimates but then everything else is just assumed to be uniform. The next is that we're gonna assume that all our predicates are independent. Again, we're modeling these as probabilities so that means we can actually multiply them together and that's gonna produce the true selectivity of our predicates, of course that's not gonna work, we'll see that in the next slide. And the last one is that we're gonna assume that the join keys for interrelations that we're trying to join on will always exist in the outer relation. This obviously doesn't work and if you start doing left outer joins or red outer joins this becomes problematic but again this is another major assumption that people have in their cost model. So let me show an example how this all goes bad in this great little vignette or this sort of simple problem that again that was in a blog article from Guy Loewen but I like this because it illustrates exactly all the problems that you have when you make all those assumptions. So say you have a simple database keeping track of cars and you have in your database you have 10 different models like Tesla, Honda, Ford, so forth and then sorry, 10 makes Tesla, Ford, Honda, Toyota and then you have a hundred different models so like a Ford Escort is a model or a Toyota Corolla is a model. So you have a hundred of those. So if you have a query now that says where make equals Honda and model equals a cord so you have a conjunction into your quality predicates if you make the independence and uniformity assumptions from the last slide when trying to estimate the selectivity of this predicate then you're gonna end up with one over 10 because we have 10 makes so it's an equality predicate so it's Honda can occur once so it's one out of 10 and we multiply that by one over a hundred because again we have a hundred different models and a cord is one model so it's one over a hundred so in this case here the selectivity is estimated to be 0.001 but in reality we know that Honda is the only one that makes them the cord so it's not this, it's not one over 10 times one over a hundred these comms are actually correlated like if you have an accord then you have a Honda Accord, right so the true selectivity is one over a hundred so again that independence assumption that from the last slide is gonna make us be an order of magnitude off in our estimations and then again now if I start saying if I'm order of magnitude off at the selectivity of some lower operator and I feed that now into another operator who's also gonna be another order of magnitude off then I'm really screwed, right so the way to solve this problem and as far as you know this feature only exists in the commercial systems is to do what are called column group statistics or you're basically telling the database system that these columns are correlated and therefore it should maintain statistics about them and when it computes the selectivity during its cost model estimates should treat them as being correlated and don't assume that they're independent, right so again basically the way it works is like the DBA has to come in and tell the database system hey these two columns are correlated and then if the system supports that then they can update their cost model appropriately and again only DB2 and Oracle do this I don't, as far as I know my SQL and Postgres don't do this and none of the other major open source systems do this now you may be thinking why does this why does this manual, right why does the DBA have to tell the database system and I think that's why why can't we do this automatically because it's hard, right like think about how to figure this out like in this example here it's super easy, right because I only have 10 makes and 100 models but I have a billion tuples and I have a really wide table with a thousand columns now I gotta go look for every single unit combination of different columns look at all different possible values that can occur, right this thing's gonna blow, the search base is just massive so this is why you can't do this automatically now there may be ways to approximate this automatically but again as far as I know no major system that can do this automatically for you you have to tell it ahead of time we can talk offline about whether deep nets help with this there are people looking into this now but nothing exists yet, right so let's look now about how these simple examples can really foul us up, right when we start throwing in joins into our query plan so here we're doing a simple three-way join on ABC for AID equals BID, AID equals CID and then BID has a filter where the ID is greater than 100, right so assume here we have a, there's a filter predicate a filter operator in between the B and the join, right for simplicity reasons I'm just showing the filter being done down here, right so the first thing we need to do is compute the cardinality of the access methods that are retrieving tuples from the tables, right for A and C there's no filter so therefore the cardinality of the operator that's scanning A and C is just the number of tuples that are in the table and that's easy for me to compute I can get that from the catalog for B it's gonna be the number of tuples that are in B multiplied by the selectivity of our predicate here, right so then now we're gonna feed up the output of these scan operators into our join so now we need to be able to compute the cardinality of our join operators so what's gonna happen here is that if I if these guys, if this thing is bad, right A and C is simple, right because there's no filter but if this thing's wrong then now I'm feeding in the incorrect number of tuples into this join operator and then now I gotta figure out what the selectivity of these two attributes are assume I don't have a foreign key assume that these guys aren't correlated like I know nothing about them so now I gotta figure out for every single value that exists in A, for A dot ID how many values are gonna match that in BID and I try to figure this out in my histograms but that's gonna be hard so then now I take the output of this estimate and now that's gonna be fed into this estimate here from this join this thing is just reading the table so that's super simple so now I have the same problem that I had before now I gotta figure out well what's the likelihood that a tuple that matched AID equals BID here will exist and get fed up into this thing and then also be able to match on C dot ID so again I get that wrong this gets wrong and then this gets even more wrong so this is what we're dealing with it's just like the math is just is not in your favor here regardless of whether we're treating these things the probabilities is not like these estimates are always gonna get wrong and it gets worse as we go up so again you were to see this in the join on our benchmark in the paper you guys read the more tables you add the more bad it gets for this reason here because we're just throwing errors on top of errors up in the query plan all right so the paper I had you guys read came from the hyper team in Germany and it was an evaluation of how accurate are the sort of the cardinality estimates in operators for a variety of different database systems and then they wanted to use that to figure out the just how bad things actually can get how all for you are from actually the real time so the paper does not tell you how to build a cost model it's sort of roughly sketched out how to do this here but it's just showing you what happens when the cost and the S was in your cost model go wrong and then they sort of propose some sort of design sort of design principles of building a database system sort of what you should focus on and trying to make the system more robust to deviations or errors in your cost models so in the paper they propose a new benchmark called the J-O-B that join order benchmark so in this one it's actually based on I think the IMDB data set so the movie website and this is a real data set so the tables are actually gonna have skewed based on a real world distribution right and like TPCH and TPCC these have uniform distribution where everything the likelihood of one attribute occurring is the same for all sorry likelihood of one value occurring for a given column is the same for all values and the join order benchmark is actually real has real skew in it so they're gonna generate a bunch of different queries that are gonna join more and more tables and then they wanna be able to measure what the cost model thinks the selectivity is gonna be of the cardinality of the operators and compare that with what actually the real what the real data actually looks like right so for this the way they're gonna use this is that they're gonna load the data in once and then they're gonna run analyze which again that fires off the background job to actually go scan all the table and let the system compute whatever statistics that I want to compute and that's the best case scenario there's no update so just I load the data in and I run analyze and again they wanna see in this best case scenario how wrong things get so I wanna show this one graph here because this is really this is the most important one here and so the way to understand this is that the y-axis is the how far away you are from being exactly accurate in the cardinality estimations so this middle line here is when you're perfectly accurate and then if you're above this then you're overestimating if you're below this then you're underestimating and then the x-axis is the number of tables that they're joining in the query here so the first thing to point out the overall trend seems to be that everyone is underestimating as you add more tables everyone sort of starts to go down what is that from? Well that as we showed in that when we talk about those assumptions like if you assume that things are independent then I assume my selectivity is gonna be way way less than it actually is so that's why they're all underestimating from what the actual value should be so the first one to point out is this one guy here actually does reasonably well up to about three or four tables and across all of these they had the most tightest balance from being 100% accurate versus all the other ones the three systems here are essentially all have the same trends but the balance will vary again as you add more tables the estimation gets worse and then six tables it's really bad and then you have this middle guy here who's just like way off right away it is okay with two tables and then after that it's underestimating way more than everyone else so I think in the paper they tell you that the first one and the last one are PostgreSQL and Hyper right and I think I spoiled it last time of who is the best one here right let me take a guess what this one is Microsoft SQL Server okay then let me take a guess what these other two ones are which one's Oracle, this one here you said it was this one all right raise your hand if you think Oracle is this one raise your hand if you think Oracle is this one and then we need a last system take it DB2 okay so you think this is DB2, this is Oracle okay there you go so let's focus on SQL Server so SQL Server does is Cascades but again the cost model is independent of whether or not you're doing Cascades or bottom up right but they're doing sampling they're doing histograms and it's just SQL Server is just really good in my opinion I consider it to be of these three systems here I think it's the at least what they've talked about publicly in my opinion it's the most like state-of-the-art leading edge system right Oracle and DB2 are good they're not bad systems it's just I feel like SQL Server is based on what they've talked about publicly it seems like this thing is is way farther ahead than everyone else again I would say for all database systems open source and commercial SQL Server probably has the best query optimizer so and it shows here but let's now look into see a little more detail about what actually happens when you when you get correct estimates so for this one they're going to instrument Postgres 9.4 and the way to understand this graph here is that the the x-axis is what percentage of the queries that they're going to execute are within the how much slower are they from the the actual execution of the query itself right so if you how much we sort of they instrumented Postgres and they they made it so that rather than doing using histogram to estimate the cardinality of an operator they modified the system so that it this you know whatever function says go estimate what the cardinality is they replace that with a magic oracle that always gives them back exactly the correct answer every time right so the way to read this is it's how how far they are from the what the real estimate is going to do versus what the what the query performance will be if you have the exact estimates versus what if you use that the built-in cost model but it's making approximations so if you're in this band here 0.9 to 1.1 means that you're the performance of the query using the approximations is roughly the same of what you get when you have a proxy you know what if you have the exact result and then over here is that you're getting way you know way way slower as you go in this direction and then this is what percentage of the queries in the total number of queries that they gave into the system fit into these different buckets here right so this shows you that when you're using estimations that 60 percent of the queries are 1.1 or great more time slower than the you know the queries when you have the true cardinality the exact estimations here right and so what's going on is because Postgres is underestimating this the cardinality of its operators it thinks that the you know the number of tuples that it's going to is going to pass on one operative next is is less is lower than it actually is and they use these cardinality estimates to size the hash table for when you do a hash join right or decide whether you want you want to do a hash join at all right if if you're only going to access a small number of tuples in a join the if you're then a nested loop join is super fast because you don't have to set up a hash table you don't have to build it you don't probe it you just do these two little for loops on like four tuples and the outer table and inner table that's you know as fast as you're going to ever get so Postgres ends up because it's underestimating the cardinality of its operators to say oh you're operating one of the 10 tuples nested loop join right and then it starts running and it says oh shit let's just not this is not 10 tuples but by that point the query plan is baked right you can't switch back to do the hash join right why not why not uh because that one you have to do adaptive query processing right so you have to be able to say my query optimizer made a mistake go back and rerun you know run the optimizer and get a new plan based on something I you know based on what data I've seen so far but nested loop join and the hash join got the same result say it again nested loop join and the hash join got the same result the statement is the nested loop join and the hash join get the same result right but like so you could do it the way actually one way to do this is that uh if I see now the as if I see that I have like I thought it was going to have like 10 tuples so now I have 1 billion tuples coming in before I even start the nested loop join I could say don't do nested loop join switch over do a hash join you could do that in some ways it depends on how they're they're pipelining tuples um but they don't it's an engineering thing right sort of like it's like once the optimizer makes the decision they go with it now adaptive query processing or optimization is when you say uh I think the optimizer made a mistake go back and regenerate a query plan and now you have you know now you have to you get into this world of like do I throw away all the data I've processed so far and start up from scratch or can I come up with a new query plan that can use some of the data I've already processed and just you know continue with that that would essentially do what you're proposing but as far as they know I mean I know Postgres doesn't do that uh I don't know actually what SQL server and Oracle do I think they throw everything away and start over right you primarily you primarily do that for for uh for join ordering switching between nested loop and hash join um there's no reason you could do it I mean Postgres doesn't do it I think for engineering reasons okay so the right so so in this case here again things are running slower because a lot of queries end up being nested loop joins when they really should be hash hash joins so the next thing they did was uh in Postgres you can pass a flag to say sorry yes so the left most part in this graph is that some queries actually run faster when you're in false yes but I don't know why I don't know we we can ask Victor I don't know uh it was too error one way error the other way the errors can't work out like too long speaker right too long speaker right oh okay it's just noise I don't know you're saying like it like the error from the first thing fed into error in the second part oh got okay so he said like the error for the first join operator fed into the the first join overestimated the second join that it fed into underestimated and then that came out to be exactly exactly correct now why this is running you know 0.3 faster or you know I don't know all right but it's I mean it's it's pretty low amount like it's you know one or two percent all right so all right when I'm saying I was saying here is that again the the first uh sort of assessment of why things are running slower is because it's picking nest loop join when you're picking hash join so in Postgres you can pass a flag in your session that you open up the terminal and you can say set no merge join or something like that or no nest loop join uh you can tell the optimizer to not even consider a nested loop join for your query so in this case here it only considers a hash join but now you still see that you're still getting worse performance right then then you know where you should be if you have true cardinality so it did help right fewer queries are getting picked as a nested loop join but still your some queries are still running slower in this case it turns out that um the in this version of Postgres they were using the the cardinality cardinality estimates from the operators to decide how to allocate the hash table when you do the hash join right so you say I you know I'm gonna do a hundred I have my cardinality is 100 so make sure I have a hash table that has you know 200 slots for you know to put things in and so if you undersize the hash table what happens is that as we talked about you end up with these really long you know I think they were doing a bucket hash hash table so you have these really long bucket chains that essentially becomes a sequential scan every single time you do a probe right so again if you had correct estimates about what the cardinality was then you could say all right my hash table needs to be this big if my data sets coming in is going to be this big and then you don't worry about you know these these long sequential scans so to prove that this was actually the case uh again this was Postgres 9.4 then in Postgres 9.5 they had the ability to do dynamically resize the hash table if your estimates go wrong so they backported that feature from Postgres 9.5 into 9.4 and they re-ran the same experiment and now you see that there's there's more queries that are actually are getting closer where you should be if you have the true cardinality right so the this is a good example of showing again it's not just picking whether one operator is should be a hash join versus a net loop join it's actually what you're actually doing in that operator can be greatly affected by the the estimation that your cost model is making all right all right so uh Victor then uh he he and I were talking about this and he sent me sort of a synopsis of what he thought were the most important things that came out of this experiment or the study that he did uh and the the first is that they felt that having an accurate um uh you know cost model from your query optimizer ended up actually being in some cases more important than this having like the fastest engine in the world right like if you you know if who cares that we can do semi-processing or query compilation and all the other tricks that we talked about the entire semester if our optimizer is total shit and we're like picking the worst join orders and we're always doing this a loop join then who cares how fast the engine is and like you know simple examples when we actually throw real real real world data at it then we're just going to get you know we're just going to get crushed so the the being able to pick up the join order is probably the most important thing from all of this the other thing they they found was basically that the cardinality estimates are always going to be wrong uh so when you actually implement the operators in your system you want to have them be adaptive enough to not or adaptive so they don't have to rely on estimations coming from from the query optimizer because you just assume that's going to be wrong and as the data comes in you want to be able to adapt whatever it is your your system is doing to account for the actual data that it's actually seen right so you be able to resize the hash table automatically rather than just you know taking where the cost model gives you and just fixing the size to be exactly that and we're doing this now in our own system uh we haven't pushed it to the master branch at this is in the separate l of m branch but we can reorder like predicates so some predicates are more selective than others the cost model might think that you want to execute your predicates in this order we can then shuffle things on the fly as we see real data the other things that he said that uh and you sort of see this in design about uh in in in hyper is that having an engine that is can do fast sequential scans and fast hash joins may actually end up being better than having all these sort of fancy indexes to do joins and and lookups right because the more indexes you have then the more complex the estimations have to be for your in your cost model so rather than try to account for any of this you just say screw it i'm just i'm going to sequential scan everything and just rip through the data really fast and and who cares what you know whether an index would help me here um for analytical queries i maybe agree with this for some things maybe not not not for others right i i i have mixed feelings about this one and last one he says that uh trying to have a more accurate cost model like using the micro benchmark stuff that i talked about at the very beginning like trying to you know profile the hardware and and include that in your estimations uh they felt that was a waste of time just having better cardinality estimations is the most important thing so that means having better statistics better sketches whether you're throwing deep nets in there or not we could talk we could talk about that separately but getting this right is more important than having sort of more more fine-grained accuracy of what the harbors actually can do okay so i thought these are pretty useful right again like some of these we've we've adopted in the design of our system other ones not so much all right so the last thing i want to talk about and this will lead us into what we'll talk about on Monday um is this project from from IBM from i guess this point 2000 2001 so 18 years ago uh for this thing called Leo so you may be thinking and it's sort of what he was sort of suggesting earlier is that if my cost model is going to make these estimates and they're going to be wrong then as i run my query and i see oh the data is actually completely different can i just fix myself or can i just possibly also go fix the estimates right and this is what IBM was trying to do with this thing called the learning optimizer so the idea would be that my my query shows up i run it through my my query optimizer i generate the cost model generates estimates about what it thinks the data looks like i then run that query and then now i observe are the estimations about the cardinality and the selectivity of my predicates is that matching what i'm actually seeing in the real data and then when the query finishes they they check to see whether the estimations differ from the uh from the real data and if they differ then they try to feed back the real data you know back into the optimizer so that then when the next query shows up they can rely on the data they've already collected from the previous query right so this seems like this would solve this problem right it's like like oh ours went wrong we run real data see what actually happens the i would say that every db2 administrator that i've ever talked to and i ask them about this that they always say the first thing they do when they install db2 is they turn all this because it never worked um and we had oracle has something similar to do memory management like automatic memory manager for the buffer pool manager and we talked to some dbas uh a few few weeks ago and they're like yeah we turn that off immediately like all this automated stuff n*** right i i so i i mean i i never i never found out why why i've heard there was some engineering difficulties at ibm to get this thing to actually to work the way they wanted to work and fit into the rest of the system correctly i think this didn't work because of engineering reasons um and it's a shame because it seems like this would would solve our problems but because of this i think a lot of people a lot of these a lot of commercial systems have been hesitant to adopt similar techniques just because this thing sort of you know fell apart but but this and then lead us into what we're going to talk about on monday of like the sort of revival or not a revival but a continuation of a long history of trying to do automated tuning uh in in in database systems to sort of alleviate some of these difficulties right so now you know what's in vogue is machine learning and people trying to apply machine learning to these problems but actually goes back into like the 1970s like people have been working on this problem for a long time so so this is just sort of one example of this that did that didn't work okay so uh the main takeaway today is that for an mre database being able to estimate the number two was ever going to process per operator is going to be the for us a reasonable approximation of what the the execution cost of a query is going to be um of course now that means that in order to to to estimate the number two was going to process per operator you got to know what the card in the cardinality is of your children operators and then as as we showed things can go bad pretty quickly um and so I think the way uh microsoft does the cost estimation is using combination of sampling and sketches is is the right way to do this and we try doing this in our own database system but we never vetted it to determine whether it was actually accurate and what will happen is in the summer when we bring over the the old optimizer from peloton into the new system we're probably not going to bring over the cost model we will start with a simple cost model that basically says if i have an index my cost is zero if i have a sequential scan my cost is one like that's the simplest cost model you could ever have that won't handle join ordering for us but at least that'll get started on being able to pick indexes okay right any questions right next class as i said we're going to talk about self-driving databases which is a sort of point term that i've been using for our system oracles been calling their system this as well we we can we can i'll try not to not have to cut things out but i they've made some claims about the the autonomous nature of the database system that i disagree with so and this is what i'm really focused on in our system the part of the reason why we decided to build a database system from scratch at Carnegie Mellon is because i want to be able to put all these sort of automated features directly in the system as we're actually building it rather than trying to grasp something on you know after the fact which is how most people do try to do these things okay