 So this is the last lecture we're going to do on query optimization. And this is actually a very important one because you have the cost models, what you're going to use to figure out whether one plan is better than another. So the way to sort of think about how this one's going to relate to the last two lectures is that the previous two lectures are about doing the search and the transformations to find different query plan alternatives for these different physical and logical query plans that are equivalent to the original plan you were given. And then the cost model is necessary to figure out whether one plan is better than another and then it can choose that from all the different alternatives. So for today, we're going to sort of talk about how do you actually build a cost model, what are the components you can use to approximate the cost of running queries. And then we'll talk about how to actually do cost estimation based on this. And this will be tied into the paper that you guys read and it showed that if you get the selectivity of your estimates wrong when you do query planning, you get bad performance. And then I'll finish up as I promised last class with a sort of small vignette of sort of tips that I've sort of accumulated over the years about how to go about working on a large code base independently. And this will be important for you guys as you get started on project three. If you haven't started yet, you should start very soon. But how you take an existing code base that you didn't write and may not be entirely documented completely at or at all. And the student that wrote it, the person that wrote it is dead or gone. So how do you actually go about understanding the system? And then this will be important when you go out into your career. So as I said, the today's lecture is really about this idea of doing cost based query planning. So at the very first lecture when we talked about query optimization we talked about how in the very first systems, Ingress and Oracle from the 1970s, these were based on doing static transformations or using heuristics to decide how to generate a optimized query plan. And as we said, for some simple things like doing predicate pushdown or selecting the right index for a one table query, these are good enough. But when you want to start doing more complicated things, N-Way joins, CCEs, window functions, all sorts of more advanced SQL things, then heuristics are not going to be sufficient because it's going to be really hard to come up with rules that you could execute and apply to transform a really complex query. And so one of the major contributions of the System R project was that they said that you'd want to use a cost based query optimizer that can then enumerate over a bunch of different alternatives for how to actually your query plan and then choose the best one based on this estimate. So this is essentially what we're talking about today is how do you actually build that cost model and do that estimate. So the important thing to understand is that these estimates are going to be only meaningful or only make any sense to internally to the database system, because they're not going to be tied to wall clock time or some real world metric. So that means you can't take the cost model approximation for estimate for my SQL and do some kind of comparison against the cost model estimation for Postgres. Completely different systems. Going to use completely different metrics to make these estimations. So this only allows to internally decide that one query is relatively better than another. And the other important thing to note also too is that everything we're talking about today here is independent of the search strategies that we talked about in the last two classes. So it doesn't matter whether you're doing a unified search or a stratified search or you're going top down or bottom up. The cost models that we'll talk about today can be used in either one of them. So the question is how are you actually going to build a cost model? So there's essentially three different components you could include in it. So the first is the most obvious one is to estimate the physical costs of executing query. So what are the resources that the database system is going to use in order to execute the query? The number of CPU instructions or cycles, the amount of IO you're going to read and write, whether you're going to have cache misses or not, how much DRAM you're going to use. So all of these things are essentially what the system is going to do when it actually runs your query. So you would think that these would be a good way to estimate whether one query is going to be better than another. The problem is that these metrics, with exception of maybe IO, depend heavily on the hardware and also depends on what the storage model you're using for the database system. So the most common one that people use in a disk base system is disk IO, which we'll see in a second. When we took the introduction class, we talked about comparing join algorithms and the way we estimated whether one was better than another is based on the amount of disk IO you would have to incur when you execute this. But for other things like CPU cycles and cache misses, this is really hard to approximate because, again, this depends on what CPU you're actually using. The case of disk IO depends on whether you're using a spitting disk hard drive or an SSD or the new non-valid memory stuff. This can change from one installation to the next. So this can be difficult to get right. The next approach is to use the logical costs of executing the query. And so these would be things like the amount of data the operator is going to spit out. The number of tuples the operator is going to admit that's going to go up into the query plan. And so in this case here, this is done at the logical level. So this is independent of the actual algorithm you're using for the physical operator. So a logical cost, it doesn't matter whether you're using a hash join or a cert merge join or a nest loop join, the number of tuples you'll spit out for all three different algorithms will always be the same. And so the way you need to make this work, though, is that you need to have a way to estimate based on the input you're given and whatever the predicates are in the query that apply to your operator. You need to use that to estimate how much data you're going to actually spit out. Which again, as we'll see as we go along, this is not always easy to do. And the last one is important and it's tied to mostly this one here. So this is not actually something you could use independently. But this is just taking the algorithm, the complexity of the computational complexity of what are the algorithms you're using for the particular operator. So this is where it matters whether you're using a hash join or a nest loop join or a sort merge join. Those different algorithms have different cost complexities that can vary a lot. But this is sort of the abstract, highest level way to consider this. And you can sort of do, like, you know, big analysis or take the worst case analysis. But typically you tie this into, with this one here. So now you include the logical cost of how much data you think that one operator is going to spit out. And then you can then tie that with the algorithmic cost to say, all right, I have this much data coming into my hash join or I have this much data coming into my sort merge join and how to use that to estimate whether one choice is better than another. So as we'll see as we go along, I guess the spoiler is that the in memory guys will mostly use this. The disk based guys will mostly use either the top two or a combination of all three. So in the case of a disk based system, again, as I said a few seconds ago, the primary cost or hardware resource they have to deal with is always going to be disk IO. Because that's going to be the dominating factor of executing the query, because the disk is simply just so slow. And so in this environment oftentimes you can just ignore the execution costs or the CPU overhead of executing an operator just because you know that if you ever have to go read anything from disk, that's super, super long, especially if it's a spinning disk hard drive. And any system that's been built actually even after the last five years, you always have to consider the difference between sequential IO and random IO. In a spinning disk hard drive this matters a lot because as it spins around, if you can plot the arm down and read a track continuously then you're going to do much faster than if the arm has to jump around and do random IO. So your cost model may consider that some algorithms can do sequential IO and that'll be better than doing random IO. And this has to be all again included in your cost model and understand how the data is actually laid out and what the operator is actually trying to do on it. So disk IO is nice because in a database system since we're not using M-MAP, we're not relying on the operating system to manage our memory for us. We have complete control over this. So therefore we know what's going to be in memory versus out of memory or at least have an approximation of how the buffer pool manager is going to work. So we include things like the replacement strategy and pinning and other things that would occur inside the system when we decide how data is moved in and out in our cost model and then use that as a way to get more accurate estimation of what we think the query is going to do when it actually runs. So if you have a distributed database system like in the case of M-SQL for the paper you guys read in the last class, you can essentially just replace disk IO with network IO because that's just as bad. And so in a distributed database systems, they have to figure out that they know how the data is placed across multiple machines and include that in its cost estimates as essentially more or less doing the same thing as you would with disk IO. So one interesting thing to look at, I always like to talk about in when we talk about cost models is how Postgres does this. The reason why I always use Postgres is an example because in my opinion, it's a great example of a textbook implementation of a disk based database system. So you take any introduction database class or you take any database systems textbook and how it's described in the textbook is almost exactly how Postgres implements a lot of things. So their cost model is a combination of the CPU and IO costs and it's sort of added together in a formula and then these different elements will be weighted by these magic constants that say how the relative cost of one particular type of operation is versus another. Or one cost versus another. So for example, you could say that or by default it'll say that if you process a tuple that's in memory, that'll be 400x faster than reading the tuple from disk and if you're doing sequential IO, that'll be 4x faster than doing random IO. So these are just flags, these weighting factors here are just flags you can set in Postgres's config file that when you boot it up it will use these when it does its estimations and these are the defaults. And when you look at the Postgres manual for all these different components they talk about here again that they weight everything based on the sequential scan cost. So sequential page cost is eventually set to one and then all other cost variables are relative to that. And so they have this warning here though that says like the way these default parameters are set is that they're basically trying to be a best case scenario of the average of what they think hardware looks like. So again that means that every single database system every single installation of Postgres if you don't change this config file they're assuming it has these four constants or magic constants they use for its cost estimates and depending on what your hardware looks like these things actually may not be correct or if you have a really fast disk like an NVM drive then weighting things the random IO to be 4x slower than the sequential IO may not be the right parameter. But this warning here is telling you that changing these values can be really risky and you may end up getting worse performance if you don't know what you're doing. So this is what an open-source system does MySQL is a little bit less sophisticated but at a high level it's essentially doing the same thing When you start looking at the commercial guides this is where you see things get way more complex because at the big three vendors IBM, DB2, or IBM, Oracle and Microsoft they have a lot of money and they can pay lots of people to come up with really complex cost models. For Postgres, are there benchmarks that would spit out what they think would be based on your system that would be the audience for that system? His question is, in the case of Postgres with these constants do they run microbenchmarks when you boot up to try to figure out these things out? Postgres doesn't, DB2 does and we'll see this in a second they do it as well. So this is the DB2 cost model and this is from a presentation by Guy Lohman the guy who invented Starburst that we talked about two classes ago from IBM and so he lays out all the different things that IBM is going to include in their cost model to try to allow them to come up with more accurate estimations of what they think Query's going to do So this shows you have to include things like what does the database actually look like so this includes things like the statistics you have about tables and columns and indexes as well as constraints you may have on attributes if you know that something can't be you have a check clause that says something can't be greater than this or less than that you can use that to estimate the cardinality or selectivity more accurately They're doing the thing that Matt asked about in the beginning is that you actually use some microbenchmarks when you boot the system up basically just running like almost like Bogo MIPS in Linux where you just do some quick I.O. or compute the milling digit of Pi and use that as an estimation of what you think the hardware is capable of doing when it actually executes Query's and you can use that as a weighting factor in your estimations if you're doing a you have a distributed deployment you can do the same thing you can do microbenchmarks over the network and see how fast that actually is they keep track of again these would be parameters you would set in the system about how much memory you have for your buffer pools or the sword heaps but actually what is really interesting and a good example to do is that they actually have a dynamic component in their cost model that is based on what is Query's running at the moment you're trying to do this planning so if they know that there is a bunch of queries running at the same time then one particular plan might be more expensive than another plan because you end up having more contention with with other things running at the same time as opposed to if you're running completely alone then maybe you can be more aggressive with how much memory you're using or how much disk you're using so when I say bogomips does everyone know what I'm talking about or no raise your hand if you know what bogomips is alright very few alright so if you ever look in linux when you log into linux if you look at like slash prox slash cpu info we'll have all the information about your cpu right tell you your cache sizes what the frequency is the clock speed it'll tell you the vendor and the model but then there'll be a little entry in there called bogomips right sensor bogus you know million structures per second and this is like a little micro benchmark that linux runs when you boot the system just to see how fast they think your cpu is right it's not a super accurate estimation but it's good enough for what they need to do and then they use that as to help them make decisions about how they do scheduling or do interrupt handlers right so it gives you a quick and dirty way to say here's what we think that the hardware capabilities are for the cpu so it's the same thing as a micro benchmark that they're doing to say what they think the hardware can do and they can use that to decide you know what will happen when they actually execute the query yes so this question is is there any substantial difference between what I'll call micro benchmark and the regular benchmark so there is no like a scientific definition of a micro benchmark I just mean like so they compute you know the moving digit of pi a little for loop that just does this that's a micro benchmark running tpcc would be a full benchmark yes but again it's not a it's not like a yeah it's not a scientific definition like oh clearly this is a micro benchmark it's just something we say right so again so again the main takeaway here is that the commercial guys do way more sophisticated things than the open source guys and then the ideas that they want to try to have better cost estimations right I would say in general the IBM guys have a lot more there's a lot more academic papers or things in the research papers that are available then the other two major database systems so IBM is pretty open about what they do oracle and Microsoft less so but I if IBM is doing this then you know oracle and you know IBM are doing Microsoft are doing similar things so now we can talk about what we want to do in an in memory database system so again as I said in the very beginning of this class the major thing about in memory database is that the disk is gone so we don't need to measure disk I.O. for our queries because that's not something we care about anymore but now the cost of execution is going to be shifted down or be more fine grain and now we have to consider what's actually going on in the CPU and now we have to also consider our memory access costs right because now cache misses instructions per cycle are the main things we need to care about because that's going to be a better approximation for what we think the query is actually going to do when it runs the tricky thing is unlike with the disk our database system because it's a user level process doesn't have the same kind of control over what's in our CPU caches or not as we did in our buffer pool manager right because the harbor is what's managing our CPU caches so there's a whole bunch of other things like in the case of buffer pool manager we know whether we're doing LRU or clock or arc or whatever and we can then use that in our cost model to estimate what's going to get evicted or not but in the case of the CPU this is sort of opaque to us and it's left to the harbor and the harbor can do whatever it wants then there's a whole bunch of other things like dealing with shared caches and we can't actually pin things in our CPU caches we gotta deal with and then we won't talk about this so much in this class we'll start talking about this next week but may as well now need to account for NUMO or non-uniform memory access where we can have two sockets or four sockets and the data that one thread might need running on one socket might be on another socket and then the communication of getting that data becomes more expensive so this is all really tricky to do and when you think about it like in the case of the Intel Intel spends millions of dollars to build these simulators before they put out a new Xeon or a new iCore 7 that allows them to get a good estimation about what the performance of the chips are going to do we essentially need to be able to replicate the same thing but put inside of our database system and also have to be really really fast because we want to do this at the moment someone executes query we want to be able to get these estimates so doing this low level approximation of what the CPUs actually do while we execute a query is super hard and as far as I know nobody actually does this and instead what we're going to do is we're going to use the number of tuples that we're going to process per operator as a good enough estimation of what the CPU is actually going to do right so all these things can be encapsulated in this metric here what I'll say also too is that there is there are going to be some additional optimizations we can apply that will actually matter that can make a big difference but a lot of times we can make these at run time you know and we can sort of do this as we're reading the data we can decide that we maybe want to run this thread over here that thread over there that doesn't need to be included in our cost model at this point we can just handle that as a scheduling problem later on which we'll cover next class and again if you're a distributed memory database like the memsql guys you'd have to include the network IO in this so as far as I can tell from reading all the manuals of the memory databases that are out there at least the ones that describe what they're doing as far as I can tell everyone does this so I want to do one example though that again is exactly what he was asking before from small base where they actually do again run these micro benchmarks to try to figure out what the hardware could actually do so remember small base was a research project at HP labs it was one of the first in-memory databases from the 1990s back then these systems were sometimes called real-time databases because they're meant to run in telco installations where you have to run a query or a transaction at the moment someone places a call so you need to have a really short latency in the millisecond range but now we just sort of call these in-memory databases so then HP spawn off small base as a start-up and that became times 10 and then Oracle bought times 10 in 2006 and it more or less it's in maintenance mode they don't really update it as far as I can tell and Oracle mostly tries to sell it as a cash for the transactional cash for the main database server where they make all the money but like I said it was one of the first in-memory databases so the way the small base cost model is going to work is that it's basically have two phases so the first phase is back at the development shop when people actually build in the database system by a bunch of low-level execution primitives that the database system is going to execute when it processes a query so don't think of these primitives as like at the operator level like a hash join or sort-maries join think of these as like evaluating a predicate on a single tuple or doing a lookup inside the index these are all sort of the low-level things you have to do as part of executing a query plan or an operator so we're going to generate a bunch of these low-level primitives and then we're going to create these micro-benchmarks that approximate roughly what a query will do when they actually execute these primitives so then when you boot the system up so this is now when you actually deploy small base when the system boots up you would run a bunch of these primitive test micro-benchmarks a couple million times and try to estimate what the CPU and the memory can actually do and then when now you compute your execution cost or operator cost in your cost model instead of using those magic waiting functions that we saw in the Postgres documentation you use these the results from these micro-benchmarks to help you come up with more accurate estimations so as far as I can tell it's not this is not used at all anymore in x10 I don't know whether this they use this in the 1990s and they eventually abandon it but right now as far as I can tell x10 is using the same approach that I said before you're estimating the number of tuples you process per operator but again this is a good example of doing this micro-benchmark approach alright so if I said that the estimating number of tuples process for operators is the most important thing the question is how are we actually going to do this and so getting this estimation is based on three factors first is going to be what is the access method we're going to have available to us for each table we need we need to access in our query whether we're doing a sequential scan how the data is sorted if it's sorted at all how it's compressed whether we're doing index scan what indexes are available to us then we have to deal with the distribution of values in the database attributes and this again we need to know this in order to figure out how many tuples we think we're going to get out of this in our access and then of course it's the predicates we use in our query for simple things like if we have a where clause where something equals something and it's doing a lookup on the primary key we know it's going to return back a single tuple and that's why these simple queries are easy for us to estimate but if it's a bunch of disjunctions or a bunch of regular expressions or light clauses then becomes more problematic and the sort of using real simple heuristics to figure out the number of tuples we're going to process per operator becomes more tricky and becomes more error-prim so the the main construct or the main factor we're going to use to figure out how much data we're going to have to process is called the selectivity and so the selectivity of an operator is essentially what percentage of the data that's being fed into it will be emitted out of it right, will be generated as the output for this thing so typically you model this as a probability where the probability is based on what percentage of the tuples in the input will satisfy the predicate and therefore be emitted as the output so if I have 10 tuples and my predicate has the selectivity or one tuple will match of those 10 tuples, I would say my selectivity is 10% so 10% of the tuples will match based on my predicate and the operator will split that out so there's a bunch of different ways we can actually compute this selectivity but this is going to be the main thing we're going to get to use to figure out how much data we're going to have to process so the first things that you can just rely on the domain constraints that are specified in the schema right, so if you have an enum for example, you would know that it can only have these certain values or if you have check clauses in your create table statement you would know roughly what the range of values is you can rely on pre-computed statistics so in the case of Oracle they call these zone maps if it's IBM they call them synopsis tables remember I said before when we talked about compression the zone maps were pre-computer aggregations or statistics about individual columns or blocks of data in the database so if we're given block you can pre-compute the min, the max, the sum, the average and the count for each of the different values in the block again you can use this as a way to say based on my predicate am I going to find a match on the in each block based on the zone maps you can also use histograms which is the traditional way we teach you in the introduction class you can use equity histograms to say here's the number of occurrences of a particular value for a column you can also use approximations or sketches as a way to approximate what the histogram tries to compute accurately and then the last approach is to use sampling where you collect some representative tuples from your tables and then you just run the predicates over them and then you use that as to compute the selectivity so I'll go through these last two in a bit more detail so again at a high level what's going to happen here in this case of histograms is that you run the analyze function in SQL and that does a sequential scan on your tables and then based on that you spit out some statistics to say this is what my data looks like and obviously you don't want to be doing this all the time so you run this every so often and then as your table changes and you update things insert things and delete things your histograms become stale or inaccurate and then you would run analyze again refresh them and get more up to date statistics so the way to sort of think about this though is that when you execute queries a lot of them could be sequential scans that's essentially doing the same thing as your analyze function so is there a possible way to piggyback the analyze operations you would normally do periodically with your query execution and that way you don't have to run make sure you run analyze as a cron job make sure incrementally over time refresh yourself and get more accurate results so this is what IVM tried in the early 2000s with this thing called Leo or the learning optimizer and again basically what happened is your query would show up and you do the parsing planning and you run it through the optimizer the optimizer uses the cost model the cost model says here's the estimate of the selectivity of your predicates and then you would actually record that and put that into the query plan and when you execute the query and you're actually doing the scan on the base tables you would check to see whether the estimates that you were computed in the cost model during the planning phase match up with the actual selectivity you're seeing in the real table while you're scanning it and then if you notice that you're off you send back to the cost model changes or deltas and say here's why you were off and here's what I actually saw when I actually ran the query so this seems like an awesome idea this seems to exactly solve the problem we're going to have of having inaccurate estimations due to stale histograms I don't know what the hell happened but apparently this never worked every db2dba I've ever talked to for the last five years all tell me that every time they install db2 they turn this thing off because it never worked and made things worse so I need to go back and maybe email Guy Lohman and figure out what actually went wrong why is this thing always inaccurate because it seems like this be the right thing to do, a simple thing to do but something happened and it didn't work but what I'll say though is that the way they were doing this was just based on statistics it wasn't anything like using machine learning or reinforcement learning but at a high level it is kind of doing the same thing you make an estimate and you do an action and you see what the actual correct result is and you can feed that back into your model and try to improve things so this is actually something we're trying here at CMU the group that is doing cost model stuff for project three they want to try this putting a deep net in our cost model but I also know that Microsoft is actually doing this in their system as well the paper is not published yet but they're essentially putting a deep net in the cost model they essentially do the same thing that Leo tried to do so I think this is at a high level this is what you want to do I think the I don't know what happened with the implementation so the alternative to histograms is to use approximations so for this again think of this as still you have to run analyze to compute these things this is not the learning stuff that I just mentioned this is like set of starting accurate histograms which can be difficult to get right and can take a lot of space you can use what are called sketches that allow you to approximate the same information you would get from histogram but with an error bound estimate right it's going to be wrong but it'll give you a bound to say how wrong it thinks it actually is so again you can do this for things that count distinct count the number of distinct elements frequent items what are the heavy hitters what things you're seeing most often so the reason why I bring this up is two years ago we had the CEO of the splice machine which is a distributed database that's trying to h-tap workloads it's basically h-base plus spark mashed together to do hybrid transactive processing and analytics so the CEO is Monty Zwieben he's CMU alum and he's actually on the board of advisors for SCS so he came and gave a talk in the introduction database class two years ago and he talked about how they were having a lot of problems getting good results in their cost model because they were using histograms the way everyone else does but then when they switched over to do sketches it turned out to be way less work to maintain these things and they end up generating more accurate results than they did with sort of the regular histograms and they talked about how when they added this in they could go from supporting 10 tables and a join up to like I think 75 or something like that so he was a big fan of this and then we actually tried doing this in Peloton I think the code is still there but we're in the process of verifying whether our approximations are actually correct I like how every time I talk about Davis company it's about query optimizers the metric they always use to say how good their optimizer is is always how many joins how many tables they can support to joins in their queries the Green Plum guys told me Orca could do 35 table joins then the Splash Machine guy told me they could do 75 then the M.C.Gler guys told me they could do 135 I don't know how realistic that actually is I'm sure there's queries that are doing those crazy number joins but I think it's the wrong way to measure how good your query optimizer is as we saw in the paper you guys read it's not just a number of tables you can join because anybody can do that it's whether you're actually getting proper ordering so the last way to compute your selectivity is to do sampling and the idea here is that rather than try to derive the selectivity from histograms you take a representative subset of the data or the tuples from the tables you're trying to scan on run your predicate on this sample then use that to compute the selectivity right and of course the number of tuples you need to examine in your sample obviously depends on the a lot of factors especially the number of tuples you have in your table if you have a billion tuples in your table and you take a sample of 10 that's probably going to be inaccurate so there's two ways to actually maintain your samples the first is that you could just have this internal table that's a read-only copy of the base table that the query optimizer and the cost model can go against without worrying about interfering with any other query running at the same time so every so often you appear to actually go through and you'll refresh your sample so that you always get more accurate estimates and typically what you do this is like if my table has changed by 10% if I have 10% of the table the table has been updated or deleted or inserted then I'll just go through and re-compute my sample the other approach is actually go run the sample look up on directly on the real data and so to do this you basically run your sample query under read uncommitted because you don't want this to be a transaction that interferes with other queries running at the same time because you're doing this inside the cost model inside the query optimizer so it's actually not running queries but you don't want the cost model to interfere with slow down real queries running so the challenge with this is that you may end up reading multiple versions of the same logical physical versions because if you're just scanning through and you don't really care whether something's actually visible or not within the same block you may read multiple versions of the same logical tuple and therefore that'll screw up your estimates as far as I know I think this is what Hyper does and this is what SQL server does and then for our system we end up implementing it this way as well so whether we're using sampling or using approximations or histograms once we have some data structure we can look against and try to compute our selectivity now we can generate the cardinality which is again going to be the number of tuples that we're going to spit out and we compute this cardinality based on the selectivity of our predicates multiplied by the number of tuples that are being fed into as input so in the case of accessing a base table if you're doing sequential scan this is just the number of tuples in total of the table times the selectivity of my predicate so in the introduction class the way we tell you how we compute cardinalities and any textbook that tells you how to compute cardinalities is always going to be based on three major assumptions that just we use to make these estimations easier so the first assumption is that we're going to assume that the table, the data we're accessing has a uniform distribution of values so if I have a hundred tuples and I'll say that every possible value can occur one over a hundred times that's the selectivity of it this is obviously not true because in real data it's usually heavily skewed we talked about the brown corpus we talked about compression a lot of times there'll be a small number of people who generate the most amount of data and the data will be really skewed so therefore this is not going to work out so the way to actually handle this is that you can maintain a separate data structure for what are called your heavy hitters the most frequent items and maintain accurate estimations of their cardinality for just those elements the top ten values that appear the most and then everything else can just be used as uniform data assumption the next issue you have to deal with is that we're going to assume that our predicates are independent so that means that in basic probability you would say if I have a the probability that I have for this one predicate it's one over ten of my ten percent of my tuples is going to match ten percent of my tuples is going to match then you multiply those things together and now you say one percent of your tuples is going to match and that as we'll see in a second this is not always the case because a lot of times your values be correlated with each other your attributes are correlated with each other and therefore you can enact an estimate there and the last one is that we assume that we have the what's called the inclusion principle when we join tables and this basically says that for every key in your inner table there will be a corresponding key in the outer table like think of like foreign keys but a lot of times in OLAP queries they're either on foreign keys they're joining on arbitrary attributes and therefore this thing doesn't always apply so to show you how the first two assumptions here can screw you up there's another example I like to use again from Guy Lohman where he has this simple database of automobiles and for a really simple table you say that if the table of cars and you have the number of different makes you have of the cars so this is like Ford, Toyota Tesla and then the number of models in your table is a hundred so this would be like Corolla, Camry, Accord so if you have a really simple query like this where make equals Honda and model equals Accord if you make the independent and uniformity assumptions from the last slide the cost model will compute the exactivity as 1 over 10 times 1 over 100 which is 0.001 but we know as humans we know that these values actually correlated because we know that only Honda can make the Accord so the true selectivity is is 1 over 100 for this and you see from this this is what the independent assumption will give us uniformity assumption gives us we're order magnitude off here right so if we make this assumption we're going to assume we're going to compute a selectivity that will generate an order of magnitude fewer tuples than what we'll get in the real system and you see I saw this in the paper you read all the cost models for all the different database systems are underestimating the amount of data they're generating and this because of these things here so one way to solve this is to tell the data system that some columns are correlated and they can use that when it computes the selectivity to treat them as a group rather than computing them individually and this can help you get more accurate estimations so you don't underestimate things as much so as far as I know this is only supported in the commercial systems this first appeared in DB2 and I think both Microsoft and Oracle have this now but as far as I know this is a manual process so the DBA has to come in and say oh my last example make and model correlated with each other so don't treat them as independent predicates treat them as a single group because otherwise if you think about this you have to do this automatically it's an exponential problem because now you need to scan every column and compare with every other column and that can be really really slow in a large database so this helps but this is a manual thing a human has to tell us this information so to give you an idea of how things can go wrong I'm going to use a really simple example where getting bad estimations can foul us up and you'll see how these things amplify as you go up so we have a really simple query here we're joining ABC and we're going to join A and B on their IDs and A and ID and CID together and then we'll have a single greater than predicate on the BID and for this this plan here it doesn't matter whether it's a logical plan or physical plan we don't care about that we don't care whether it's a hash join or server is joined again we're just talking about selectivities here so at the base table at the bottom of the query plan we need to compute the cardinality of accessing these base tables so in the case of A and C since there are no predicates on their attributes other than the join clauses we just say that their selectivity is 100% the total number of tuples that they have in the case of BID it's this predicate here so this is going to be again the number of tuples in my table multiplied by the selectivity of this predicate here we can use our statistics and our catalogs to be able to compute this so then now we start computing the cardinality join results for the first join on A and B we take the cardinalities of A and B multiplied together and we divide this by the max of either the selectivity of the join clause of A.ID with BID or the greater than a predicate on BID and then now as we go feed this estimation from this operator up in this operator now we see we're including this math here down into our estimation for doing the freeway join of A B joined with C and so the thing to point out here is that if this thing gets wrong right or actually even for that if our estimation here of the selectivity of this predicate here is wrong then this gets wrong and then this could seem more wrong when you feed into this so these problems start off super simple you know just a little bit off but then a little bit off goes on to the next guy he's a little bit off but he's a little bit often being a little bit off of four so that sort of gets amplified and going up and up and up as you add more tables things get worse and worse again this we'll see this in the case of the from the join ordering benchmark from the hyper guys this is showing you that the selectivity is down low well if you're wrong here then you get really wrong as you go up things don't get better things get worse right so this is exactly what I said here right so this paper I had you guys read it doesn't describe to you how to actually build a cost model as I said it's most for MRE databases it's going to be based on the number of tuples that each operator is going to process but the reason why I had you guys read this paper because again it's a scientific evaluation or scientific measurement of showing you what happens or how wrong these cost model estimates can actually get and so this is from the hyper guys what they did was they developed a generate a new benchmark called the join ordering metric or job and what they're going to do is they're going to load in some tables of the database that are skewed based on some real world distribution and they're going to generate different queries with different join orderings and they're going to measure the difference between what the cost model estimates as the number of tuples or the selectivity of each operator versus what the actual data looks like right so the ways of the work is again they'll load the database in for each of these systems then they'll run analyze so the system can go and look over all the data and compute all the best statistics that it can compute and then they run their queries and then they instrument the system to get out the query plans and look at what the cost model is actually estimating for each of the join operators so this is entirely a read only workload like this is the best case scenario for these systems right you load a static database you run analyze then you run your experiments they don't have to deal with doing insert, update and delete that can be modifying or changing the base tables right it's really what's the best these systems that actually can do if they have all the information ahead of time so there's two sets of graphs I want to show you so this first one here is the measuring the estimator quality for five different database systems as you increase the number of joins so along the x axis here each of these boxes represent a different database system and then the x axis is saying the number of joins that are happening in the query so you're basically going from zero to seven tables you're accessing and so this middle line here represents the when the the error is zero so if you're above this then you're overestimating the selectivity so you're saying more tuples are coming out but then if you're below this then you're underestimating the selectivity so you're saying fewer tuples actually come out then you actually think to come out right so the major trend across all these systems is that you see that as you add more tables as you do more joins things just die things just get worse so again this is what I said before we don't know whether the the commercial systems are making the independence assumption the uniform data assumption I suspect that they're not for everything but this is representative of what you would see if you had these problems where things just get worse as you add more tables so the first thing to point out is this guy here this one actually does the best that you see around up until there's about four tables here you can okay up to three tables and then around four tables it gets worse but of all the different systems it actually does the best its bounds are a little bit larger than this one but they're closest to that middle point compared to everyone else so it's just doing reasonably well these three here these all almost look exactly the same right they're all sort of trending down at the same level in this case here the bounds are a bit tighter whereas this one they're a bit wider but what's that sir yeah well we'll get there yes so yeah so again these look roughly the same and then this one here is the worst one right as you see after just was it three tables it just completely falls apart and its estimations are way off right alright this question yes since these error bounds seem very systematic why can't you just correct for it after a certain amount so his statement is since these these errors look to seem almost like programmatic meaning like they occur exactly the same rate the same way can you just put a little fudge factor into your cost model so that you shoot it up yes you could do that and this is essentially what we want to try to do with a deep net in there I don't know why they don't do that right this is what again this is what Leo is supposed to solve they're like oh I'm wrong let me try to get myself back up but it doesn't work yes did the paper seem to imply that if you're kind of uniformly estimating badly like at that certain number of joins that you're probably still going to end up like selecting a query that's okay like you're estimating all of them so his statement is that if you're always estimating uniformly badly does that mean that you will you'll choose the best plan even though it's the estimation is wrong but relative to all the other plans it's still doing okay yeah they all the same bias right they all the same bias so the issue with that and we'll see this in the next slide is that if you use these estimations to preallocate things like hash table sizes then you're going to have to do you know you're going to realize oh I don't have enough space or buffer sizes I got to resize my stuff and that becomes a problematic this is highly dependent on the benchmark right in the same level they mentioned for TPCS there is not much difference but for their John order so his statement is this is how it depends on the benchmark in TPCS there wasn't really big difference TPCS has zero skew right so that's why these guys work out just fine so there was murmurs about they were trying to figure out which system was which right so the paper tells you the first one the last one right the first one was postgres the last one was hyper so I'm going to take a guess what these middle three are he says the bad one right right let's start the worst one here right here if you think it's oracle alright so who thinks IBM Microsoft any other system what you think it's another system what's this mean you think it is you so he his statement is a commercial system can't be that bad okay so this is the best one who what do you guys think this is Microsoft right okay you're right SQL server oracle dv2 alright SQL server is really good I say this every year I say this all the time SQL server is I consider it to be the leading commercial system I think it's the most state-of-the-art it does some really interesting things and so they actually use Cascades they also do sampling the way that the hyper guys do and I think I think the quality of what their code really shows here it's very impressive believe it or that okay alright so the next experiment is going to show you the improvement that you can get when you have the correct estimates so what they did was they took Postgres 9.4 and then they added hooks in the cost model so that when the system go get to like it's catalogs to get a statistics to do an estimate of the selectivity of our predicate they would intercept that and then inject in the actual the real cost estimates and they ran their benchmark of all the queries running with real estimates and they want to compare what the default query plan or the cost model can do versus when what happens when you have the real estimates so the way to read this graph is that's the percentage of the queries in their benchmark grouped by how far off they are from the when you run the query with the exact estimation so like in this box here so this is saying that 60% of the queries are 1.1x to over x times slower than what the queries would do when you actually have the totally accurate estimations so what's going on here is that the because we're underestimating the size of the amount of data the operators are going to spit out Postgres is optimized in saying well it doesn't look like you're going to have to process much data let me use a nest loop join instead of a hash join right now in the intro class we always say the hash join is always faster than the nest loop join from a pure algorithmic standpoint right but there's all sort of a setup you have to do when you want to do a hash join that we don't really cover like allocating much of space to build a hash table then hashing your data and putting it into the hash table right and the nest loop join you don't do that you just do your loops and spit data out so what's happening here is that the we're underestimating the amount of data we have to read the operators and then it's falling back using nest loop joins so what they did was then you pass in the flag into Postgres and you tell it to not use nest loop joins at all and now you see that a bunch of the costs are now shifting closer to being 100% accurate so the way to think about this is like the closer you are to this way on this side of the graph the more your queries are performing at the same speed as they would when you have the true cardinality so now the question is what's going on with these other guys here where we're still running slower even though we're using hash joins the way we should and so the issue is that in the case of Postgres 9.4 they didn't support dynamic or incremental resizing of the hash table so again we're underestimating the amount of data our operators are going to spit out so then it says alright I need to handle 100 tuples so I'm going to allocate a hash table that can support 100 tuples and then when you're way off and you start inserting more than 100 tuples now you start to have collisions in your hash table and you have these long chains so now every single time you have to do a lookup or probe in the hash table to do your join you're following along and read sequentially these long bucket lists so they fixed this in Postgres 9.5 where they add a support for incremental hash table resizing and so what they did was they backported that feature from 9.5 into 9.4 and then now you see that when you have when you force it not to use hash joins or sorry force it not to use nest loop joins and you support dynamic resizing a large percentage of queries are now matching up with what you would get when you're 100% accurate so this is showing you the degradation of performance that you get from the last slide where I showed you how off their estimates are this is showing when you actually look at runtime performance how that affects you and again it's not just the choosing bad plans it's also choosing the bad data structures and other sort of runtime selections that you have to make in your query plan so this is a summary of some high level findings or conclusions that the Germans made so this was sent to me by Victor Lies when we were talking about this paper a while ago and so these are the four things and the main takeaways of all this so the first thing that he points out is that query optimization is often more important than getting a good query optimization is also more important than having a really fast execution engine so you have the fastest execution engine in the world but if you're always picking crappy plans then it doesn't matter how great your system is right and the example he brought up was TPCH so I was saying like TPCH the numbers look fine because TPCH doesn't actually look like a lot of things in the real world because it assumes uniform distribution but real data is heavily skewed and so therefore making sure you do you choose the correct orderings of things can matter a lot the cardinality estimates are between the wrong we saw that in the graphs and so ideally it would be nice if you can have operators in your database system that don't actually rely on having super accurate estimates because your estimates are going to essentially be wrong so the hash table or the hash join in Postgres is a good example of this so in 9.4 in prior versions it relied on having accurate estimations on the output or the amount of data that was going to have the process and if you got that wrong performance fell apart so if you can have these operators sort of be self managing and incrementally adapt itself if it recognizes that the data doesn't look like what it thought it was going to look like then you should be able to handle this more smoothly they also make the argument that in the case of OLAP queries just having a good hash join implementation and doing fast scans might actually be good enough or better than doing anything more sophisticated so even if you have indexes it may be better just to blaze through it do a parallel scan on your table would vectorize predicate evaluation which I'll talk about in a few more lectures that's actually doing the combination of these two doing this efficiently may actually be better than using indexes and then the last one is that trying to work on and generate a really accurate cost model like the thing I talked about in SmallBase where you're doing those micro benchmarks that actually does not actually help you you're better off making sure your estimations are more accurate because getting the join ordering is what matters the most it doesn't matter that you compute at a fine grain the CPU cost of doing an index scan versus a sequential scan if your join ordering is crappy then the whole query plan is going to fall apart this is the high pole in the tent that's going to be the thing we need to worry about the most and super fine grain or super accurate CPU estimations are not something we should concern about so one of my party thoughts about all of this so the as I said the computing the number of tubeless process in our operators and our query plan is a reasonable cost model for in-member databases because again it encapsulates all the various things that could be going on the system but as we saw your selectivity, estimations, cardinality, estimations correct is actually a difficult thing to do and I think it's a combination of doing sampling and sketching possibly also with deep nets as a feedback loop is the right way to achieve accurate estimations and again this is independent of whether you're using cascades volcano, stratified search, starburst doesn't matter right, you need this in order to make the other thing work well are any questions about optimization or cost models you're all dying to go build your own cost model right alright so I now want to talk about again some high level things you should think about as you get started working on your project in the back of your mind as a way to help you navigate the code and understand what's going on so as a disclaimer for this I'll say that I'm not perfect as much as I read reviews and the students write that I'm godlike I'm not so I would say that I've worked on two different database systems in my life, I've spent some time in Postgres I helped better distribute or was working on a distributed system or batch processing system before I started grad school so I have spent some time reading code and working on code I've done some legal work where they give you some a bunch of source code dump and you're required to sit down in a room try to figure out what the thing actually is doing to help somebody testify and say whether they violated a patent or not which I can talk about that stuff later that's actually always really fun so you call this enterprise code enterprise code is just as bad as research codes I don't want people for you to say research code is total crap so I'm not claiming that I know how to do everything for software engineering how do you get started on a complex code base but I've done it enough where I can talk about the things that helped me and use them as a way to help guide you guys so the first thing I'll say is a major observation is that not only is it true in this class but it's true the rest of your career if you stay in software development or computer science is that it's almost never the case that you're going to be working on a project from scratch it's 2018 nobody writes standard.io.h from scratch everyone is using existing code that's how we can get things done much more quickly so you should be expected that throughout your career that you're going to have to sit down at a desk or a terminal and start looking at code that you didn't write and as I said the person might be dead and the person might have hygiene problems so you don't want to go talk to them so you don't have a way to go talk to the person so you need to figure out on your own how you actually handled this and I'll say that in talking with my friends at various database companies and startups the number one skill that they always tell me that they want for new hires is that they're looking for people that can sit down and work independently and efficiently on these existing code bases so when I first started planning to teach this course a few years ago I asked all the different database companies and said hey I'm teaching an advanced PhD level course on database what skills do you want me to teach the students what's the one thing you want new students to be able to know you want them to know indexes what do you guys want and without me prompting them they all basically said the same thing and I didn't tell them what the other company said they all told me that they want people that can work on large code bases and it's really hard to do so what not to do to get started on this what's the reading if you're going to sit down and read code for the sake of reading code you're going to have a bad time this often happens when I have students they say they want to start on the project and we give them a small task to help them get their feet wet in the system and then when I go check on them a few days later their response is I haven't started writing any code I've just been reading the code that's always the bad sign for me and tells me the students are actually not going to work out because if you're just reading the code and you don't understand it without any purpose you're not going to internalize everything you're not going to absorb what you're actually reading and it's not going to make any sense so the best thing you can do to actually get started is actually try to do something in it we can talk about what that doing actually could be but that's the right way to get started in doing this we just sit down and say I'm going to read this file and this function it's never going to happen and I've known this because in various systems and personally I find that if I just read it for the sake of reading it it takes way longer for you to get started on anything so what are the things you can actually start doing well the first thing you do is to start writing test cases right and this adding new test cases or improving test cases because this is going to force you to understand what the code is actually doing in order to make your test case actually perform the thing you want to perform and again this is sort of building a mental model what's actually going on and can help you figure things out much more quickly so for these kind of test cases this would have to be unit test not regression test in the context of a database system if you just write test cases that write SQL and through JDBC and get back response as much as I love that it doesn't require you to actually go into the system and see actually what's going on you're not calling individual components so writing low level unit test is the way to get started doing this in the case of our system we have G test we have the C++ test code when you call make check this is what gets fired off nobody will complain that you wrote test cases unless you write bad test cases just one equal one if you have something substantial no one will ever say you spent your first week writing test cases how dare you this helps you understand the code and helps make the system more more reliable the next thing you can do is refactoring so this is basically you look at a bunch of code you think it's kind of crappy you think it's a better way to do it you're not changing the high level functionality you're just reorganizing what calls what so a good example of this would be if you have a giant function that should be sort of modularized you could do that break it out into different pieces all the things you can do is go add comments or update comments that may be out of date and then if you have really ugly code that's difficult to read you can go clean that up the one danger of this of course is that since you're new to the system and you don't fully understand what's going on you don't want to make too many dramatic changes because you may not understand some of the assumptions that other parts of the system are making about the code you're modifying and you don't want to end up breaking a bunch of things so hopefully there's test cases to know whether you've broken something but maybe to get started you may not want to be making dramatic changes because you're not completely entirely familiar with everything so the last thing you can do is actually go through the full process of actually building the software so all the different companies have different ways that they manage the source code they manage their build processes and so you should try to figure out and learn that as soon as possible as well Google is famous for having one giant source code repository that everyone builds upon off of they have their own version of make they have their own build and test system so by going through this process it'll get you to understand how different parts fit together and can help you then be more agile in making changes and testing it and making sure everything works so if there's documentation for tells you how to do this go ahead and just follow that or that documentation doesn't exist then again no one will complain that you actually come in and write it even if they didn't hire you to write documentation if your first week is writing documentation to help you understand what's going on then you'll make friends very quickly so any questions about any of this again passive reading of source code is often be fruitless and is not the best way to get started I think writing test cases is always the best way to go or refactoring and then this one here is sort of you should do that anyway so why am I bringing this up to you guys now well we have some deadlines coming up for project three so on Wednesday April 11th we're not going to have class instead there'll be a sign up sheet you come meet me in my office during the day and then it has to be all your team members and you come and tell me what you guys are working on what problems you're facing and what do you need help with and what's the overall status of the project right and then on the following Monday after the status meetings we'll have the status update presentations in class just like the proposal everybody gets five minutes Gus'll go first because he got cut last time and again you tell everyone else what's going on in your code so hopefully as I said in the beginning when we did the proposals you saw that a bunch of you guys need different need a lot of the same features there was a bunch of people that needed a lock table or table locks so you guys should be coordinating now you guys are actually writing this code and you guys can reuse it from each other and send patches to each other and that way you don't show up on this presentation day and come and say oh we wrote the table locks and someone else comes and says we wrote table locks too right and then you also need to submit your first code review so you need to submit your PR on github to another group to do the code review on Wednesday April 11th so on the sign up sheet I'll pair you up with another group you submit your PR to them they submit their PR to you and you guys have a week to do a code review and then write feedback to each other and then you as the group should take that feedback into consideration and use that to help you update your code and then the another code review at the later on at the end of the in the semester as we get closer to the deadline right so the first of my talk this course we did the code reviews at the very end and it was like super rushed and I was like oh I really wish you we did this earlier so this is a good way to get a checkpoint in your code and get feedback sooner rather than later alright any questions about any of this again I'll send notices on piazza to help schedule things as we get closer alright so that's it for query execution sorry query optimization and now we're going down lower into the stack now we have an optimized query plan presumably now we need to execute it and schedule it so that's the next thing we're talking about and for the for most of the semester now we're actually going to talk about how we take that query plan and actually execute it efficiently and this will begin this will all in the context of a system doing query compilation any questions so so so so