 See I was about to say okay, so right so we're not gonna do that. I'm gonna try to avoid that, okay Today we're gonna talk about we're gonna pick up what we left off on Monday talking about doing query planning query optimization So on Monday the lecture was really about how do you search for a good plan? And we've been ignoring how to calculate the cost so that's what we're going to focus on today And so we'll talk about how to do cost models and use those models to compute an estimation of what we think quality of A plan is and then I'll spend a little bit time at the end just Sort of talking about some ideas or things you should consider when you work in a large code base And these are mostly things that I've sort of found useful for myself And so we can talk a little bit about what what why it's important and what things you need to worry about okay, so again on Monday it was really about just Enumerating over as many plans as we can using heuristics using transformation rules to convert a logical plan that was generated by our parser and planner and then Coming up with with what we think is the optimal execution strategy for a particular query And we said that we were gonna ignore the old to be stopped because for the most part heuristics are good enough for those guys And it's really simple it's a really simple transformation you just find the index that has the The attribute that you want to look up on the best selectivity and you choose that and chances are that that's good enough So we're really talking about here is how do you deal with sort of complex queries things with joins nested subqueries and things like that? And so again the important thing also to remember that the all the estimates that the optimizer is going to generate for its various query Plans are only useful internally so they're not going to be based on Something like the wall clock time or something that the external to the system It's only really useful inside the optimizer to be able to say that this particular query plan is better than another one And we'll see how we can sort of extrapolate Metrics that we can capture about the extrusion run extortion of queries to to provide this what these estimates actually are And so everything that we'll talk about today for cost models is completely independent of what search scheme or search strategy We we would use from the lecture on Monday, right? Obviously the heuristic things is that's just sort of you know using rules to you know How to do transform a query that doesn't really have a cost model But like whether we're doing the system our style the the genetic algorithm simulated annealing the volcano cascade style optimizer All right all those things can use the cost models We're going to talk about here today right and they just have sort of different properties different characteristics of how How they search in for the plans all right, so within a cost model against the high idea of a cost model again is is the a way to compute a An analytical model of what we think that the the the experience the data system will have What resources it would use as it executes a particular query? And so there's three different sort of components we can include in our cost model to allow us to make this decision so the first one's kind of the obvious one and it's just basically the physical resources the physical consumption of of Either CPU cycles or memory that we're gonna have to use in order to execute the query So this means that our model is going to want to be a predict how many cycles that we think we're going to use how many Cachmases obviously if we were a disk audience system, we would count the you know estimate the number of blocks We're gonna have to read from the disk And so the reason why we want to do this is an analytical model and try to predict this is because if we had to actually Maybe do you know execute every single plan to see how it how fast it was it would take forever because we may be looking at thousands or possibly millions of different plans during our Optimization phase So the physical costs are nice because it sort of captures exactly how long the we think the query is going to take But the problem is that it depends heavily on the underlying hardware So then what I mean by that is if you have say an older CPU that maybe not have all the optimizations our newer ones have Even though the cycles may be the same It's not gonna have possibly the same like low the same cache sizes the same replacement policy And therefore this newer hardware may have a dramatically different Cost model then then then the older hardware So the the next approach is that to try to calculate the logical cost for executing the query So this is usually encapsulated in just estimating the size of the of the result of the number of tuples in the result That the operator is going to generate And this is usually independent of the actual what algorithm you're using right So if I do a hash join versus a nested loop join it doesn't matter that you know one may be better than other They're always can produce the same answer the tuples may not be in the same order. We don't care about that and so that in order to do this though we need to have estimations for the The the amount of data that's going to come out of each Each operator and could use that as the as the input for the next operator And then that will determine what we think is gonna we're gonna happen And as I'll see as we go along this is really heavily dependent also on what the data looks like and what our predicates look like And this is what most people do and this is but this has a lot of problems and the last one is just computing like from like a You know big O notation what we think the the the execution costs will be for a particular algorithm So in that case we know that nested loop is probably going to be slower than a hash join Therefore we would always want to choose a hash join But this doesn't help us for things like You know what order we want to get things it just tells us what algorithm we may want to may want to choose So the in a disk based system their cost model is super easy Okay, it's always going to be disk IO right when you take an intro course and we talk about oh You know cost models and in in that class We're always assuming that it's going to be IO because we're assuming that we have a disk based system And that's really easy to do In in a database system if you have complete control over the the buffer management of the system Right because this the disk is so slow even if it's an SSD That means all the any kind of CPU or memory overhead is negligible to the the the Ios because that's gonna be the main bottleneck So because we have for customer controlling the the the buffer manager We know what the database system for placement strategy and pinning rules and and all these other things We know how it's gonna move data in and out of memory right so in a predictable way So we can include that in our cost model when we make calculations And most of the times you also assume that you have exclusive access to the disk and you don't have to worry about things getting swapped out that That that you put in there Like you wouldn't fetch a page and then be surprised that it's not there because you're you're controlling what comes in and out so now in a Memory database, however, it's a lot more complicated because we don't have a disk anymore right now we have to count for the CPU and memory costs both in terms of the number of cycles that we're spending and also in terms of the number of You know fetches we get from from the memory controller also say to that the the in a in a distributed database system They don't have disk Ios isn't the same issue isn't as big of an issue in a distributed system Their problem is network IO so in a in a Distributed system you can just substitute disk IO for a network IO and you basically get the same the same result We try to reduce the number amount of network traffic So in a disk basis system and a distributed system. It's really easy again But in the end memory system we have to now count for more fine-grain things so the difficulty with this is that Unlike before in a disk base system where we were controlling our cache Now we're relying on the CPU's cache and that's completely outside of our of our control Right because the CPU is making decisions of how things move from L1 L2 L3 We can touch pages and try to pull things are prefetch But we don't have the fine-grained control that we have in a disk base system. So that means that we have a Cash replacement strategy we don't control We don't keep we can't pin things in our low-level caches And then we also have to deal with the non-uniform memory accelerations Right if you try to have a thread that accesses something in a different numeric region That's gonna be much slower than something was local to you now this last one here. We can kind of account for this But this is typically not done in in the cost model, right? We saw in the case of the hyper system with morsels When they computed the plan That had to touch data that was all in these different numeric regions It wasn't the cost model that was figuring out. Oh, yeah I'll have this cost because I know I can have these threads executed these numeric regions It was really after you've generated the plan then in the scheduling phase is when they would figure out how to account for this So typically you basically ignore this and you do all this later on so for most in-marry databases that I'm aware of the way they're gonna handle this is just always Estimate the number of tuples are gonna be processed per operator and that's gonna be our cost model That's that's the metric that we're gonna care about so where this sort of encapsulates both the memory and the CPU overhead Without having to get to real, you know fine grain You know fine doing a fine-grained measurements of different at these different parts, right? So this is what hektan does. This is what mem sql does. This is what alt base This does this is what both TV does they're all gonna basically estimate how many tuples ever am I gonna have the process and for Pretty much every single operator you can have a good estimate of like given my input Given a certain selectivity. Here's them. Here's the number two was I plan to process and here's what I plan to Here's what I plan to put out and they can use that as the input for the next operator so one more one complex Cost model that deviates from the disk IO one and the in-memory one that I think is kind of interesting Comes from small base. So small base was one of the first in-memory databases that came out of the 1990s It was actually originally developed in HP labs Then they spun it off into a separate company that became times 10 and then Oracle bought times 10 in 2006 so then what they would do is they would have sort of this two-phase approach where The database developer so the person actually building the database system itself they would identify all the sort of low-level primitives or operations that a Execution engine would do as a process as a query Things like you know, what's the cost of valuing a predicate? What's the cost of probing an index? And they seem to come up with all these low-level things and Then when the system started up They would run a bunch of sort of sample queries that that would capture or uncover all the things You identified in the first phase and then sort of get a rough estimate of what it costs for each of these right? It's sort of like, you know getting doing a micro benchmark for each of them Sort of like you think about Bogo MIPS when you start up Linux And so what would happen is it could use the measurements that are collected during this profiling phase to then feed into the formulas that would use to estimate the cost of a query Right, so we'd say well I my table size is this my selectivity on my predicate is this therefore I Anticipate that I'm going to process these number tuples and for each of these each of these steps in the processing I know I'm going to execute X number of these primitives and now it says I know the CPU costs because I benchmark them I have a good estimate of what what the execution calls it is for the query All right, so I kind of like this is this is sort of tying in like the actual what the hardware can do Without baking in constants from from the application developer But I the sense is I think this is overblown and nobody actually does this. I don't know. I'm not sure if times hand still does this. Yes Yes So his statement is for this The the immemory databases are just going to estimate the number tuples a process per operator But this doesn't account for anything like caches is prefetching all the low-level hardware stuff and answers. Yes, correct and the reason is Yes, you could be more fine grain In your cost model estimations But as we'll see later on the from the hyper paper That you guys was assigned to me today They claim it's not it's not even worth it You're better off trying to do better estimations on on on this because this is a more dominating factor than just like I Predicted 10 cache misses and I really had 20 Right the difference the difference in performance that it's just not worth it It's sort of like if you have if you try to measure what the performance of the system on the outside latency sort of Cap, you know latency of our query or transaction sort of captures or an umbrella for pretty much all the other You know lower and lower lower metrics right latency would tell me how many cache misses I possibly have if I have a higher latency that I can infer I have more cache misses Right, so this is sort of a catch-all in the same way All right, so again, so the small base one is kind of interesting They have this two-phase thing, but again nobody actually does this in practice practice okay, so If we're gonna go with the number tuples that are processed and emitted per per operator There's three factors. We've got to talk about that Well, we'll affect this this number So the first is obviously the access method that we're gonna use on the table Right, so if you're doing sequential scan, you're you're not to scan all the tuples But if there's an index that you can use for your predicate then that can reduce the number of Items you have to look at significantly Then we also have to deal with the distribution of the values for in the databases attributes, right? if we have a million values and Out they're all the same or a million tuples and they all have the same value for this one attribute Then we know we're gonna, you know push up a million tuples, but if everyone is single unique, then it's one over a million For for what we're gonna for a particular query that's doing a straight look up on it And sort of related in the same way Connected to this the number tuples dependent processing also depends on the predicate that we're using in the query if it's like Again, if it's an equality predicate on a primary key, then we know we're gonna get one tuple But if it's a sequential scan With you know in a range and most of the tuples are admitted in that range then we're gonna generate a lot of tuples So in our cost model, we're gonna try to estimate all of these things because that's gonna tell us How many tuples we're gonna process and how many tuples we're gonna generate? So again for easy queries, this is not hard to do, right primary key where id equals 4 And I do a lookup. I'm gonna get one tuple But for sub queries for things that have dependencies for things that are Generated from a nested query, then this is really challenging And as we'll see as we go along All the major database systems are really bad at this All right, so the We have to define some terms in order to show how we can calculate the The number tuples. The first is the the selectivity Of an operator and we'll just define this to be the percentage of tuples that will be satisfied or be generated or selected On our table based on a particular predicate And what we'll do is in our database system in our in our cost model We'll just model that this selectivity factor as just a probability, right? Because we can say like probably, you know selectivity of 0.2 means 20% of the tuples will be satisfied for a particular Predicate as the way we're going to generate these is using a variety of Internal statistics that the database system is going to maintain about the database So you can have domain constraints like you know the min and max ranges for your type and another another factor is You can maintain min and max statistics. So you know what the lowest value and the highest value is A lot of times you see this in the disk based database systems. You usually maintain aggregated or you pre-compute aggregations for blocks of Tuples so that you don't have to decompress the entire thing So for a particular block, you know the min and max and therefore you can use that in your predicate to decide Whether you need a bottle looking at the tuples inside of it and the last one that's probably the The most useful for us is to maintain histograms, right? So this is going to be histograms about the distribution of the values for each individual attribute and the maintainer for every single column Usually what happens is that we'll Probably make the assumption that the distribution of values for the most part are always going to be uniform But then we'll try to identify some of the the most common values and have more fine-grained details of them So if we have the selectivity then we can compute the result cardinality and the cardinality is the number tuples that we're going to be we're going to generate For a particular operator based on the predicate the access path and the distribution of these values Right, and this is simply just taking the selectivity and multiplying by the number tuples that are fed into it fed into the operator So the cardinality is really hard to get right And all the database systems have problem of this because they make the following three assumptions so the first assumption is that they're going to assume that the Distribution of the data is uniform so what that means is that they're going to assume that for every unique value you have in your in your your column or your attribute it's cur It is Will occur it can occur in a tuple with the same probability as all other values Except for the heavy hitters so this is where they track the values that occur most often So you can have more fine-grained Information about how many tuples you have of them So that way again for the most of the times you have a predicate that access them You know what what the distribution is going to be know what the car the cardinality is going to be But for everyone else you just assume they occur equally likely And this obviously doesn't you know in real work clothes This doesn't always hold true and we'll see how this has problems later on the way you usually keep track of the heavy hitters is to use like a Like an approximate data structure like accountants sketch so that way you're not tracking exactly how many tuples have But you have a good idea what the range is the next assumption they're going to make is that the Predicates in the query are all going to be independent Right so that means that if you have a conjunction statement in your where clause The the each of those two predicates have their own selectivity, which is a probability and therefore by the rule of Statistics you can take those two probabilities and multiply them together and that gives you the what you think is the true Selectivity of the entire where clause and I will show an example in a second and why this doesn't always hold true And the reason why they do this is simply because they just change usually they don't have information about how Column values related are related to each other. So you just assume that they're independent And the last assumption is that we're going to assume that the the join keys for for an interrelation and the outer relation Will always overlap so for every key that's in your interrelation It's guaranteed you exist in the joint keys of the outer relation now If you have a foreign key this certainly always holds to be true because you can't have something in the child table That doesn't exist in the parent table, but you're not always going to be joining on foreign keys right, so this would be a problematic because in you're assuming that The every single value in the inner table will exist in the outer table when may and it may not always be the case so an example of how the Independent assumption breaks and the uniformity assumption breaks as well I always like to show this this example that came from guy Loman the guy who did the starburst paper We talked about last class All right, so consider a database that has a bunch of tuples of automobiles and the unique number of makes we can have are 10 All right, so this is your Honda Ford Chrysler, and then the number of models we have can be a hundred so like you know cord camry Corolla, there's all Japanese cars. That's great And say we have a query when we do a look up find all the automobiles where the make equals Honda and the model equals All-cord so if we assume that we make you something should be made in the last slide that our data is Is is you uniformly distributed and all our predicates are independent Then the selectivity of this this predicate here is one over ten because it's one over The number makes right because we're doing the quality predicates. We know that can only one we're looking for Multiplied by one over a hundred same thing. It's an equality predicate on the models, and we're looking for exactly one So this was give us zero point zero zero one But since we know that these two fields are actually correlated We know that only Honda makes accords, so we don't even need this this part of the clause here Right, we just need to model what this is so in that case the true selectivity is just point zero one And this is this is pretty far off from what from you know, this is pretty far from this if you had like 10 million tuples This thing was like like a thousand or this one would select a hundred thousand, right? So your order is magnitude Difference in what the true selectivity is and this is why this is really hard to do and get accurate estimates in your in your cost model and As we'll see in a second this this is problematic when you're trying to determine the joint order So one way to get around this problem And this is is is to use what are called column group statistics or attribute group statistics And this is a feature that's mostly found actually only found in the commercial guys And basically what happens is use the DBA can say I know that these two columns are correlated Therefore don't compute statistics on them individually compute statistics on them together Right, so now in that case of the of the make-and-model thing It would know that the the make is correlated to the model and vice versa So the model is correlated to the make and therefore it can notice how to choose the to choose the correct selectivity So this sounds awesome. This solves our problem But the problem is that it's it's it's a very manual So use the DBA have to tell the database system these columns either pairs of columns or either some some combination of them And in what order are correlated and therefore you should maintain group statistics on them So again, this is what you sort of pay a DBA a lot of money to do to look to look at your database and try to determine When this occurs and provide hints to the database system so that's optimizer can make a better selection Yes So this question is can the database system and try to find correlations the background? Yes, it can I Mean it's very computationally expensive right it sort of means Not only you're looking at sort of exponential number of Attributes you're looking at a real, you know possibly a little really large data set So like when you call like analyze I can postgres it just goes down for every single column I you know, I actually don't know what the commercial guys and then maybe the newer versions can do this automatically But when the first came out it was it was it was manual Okay So let's look sorry. Yes, good. Yeah, you know actually yes So his question is if you have a query that has a user upon function a UDF How does the database and calculate statistics? You know what I should have talked about this You can't do anything right they treat the UDF as a black box. It doesn't know what it's actually going to do It doesn't know anything about selectivity and things like that and So I think for the most part everyone just gives up throws their hands up and They probably have some constant where they select something that like yeah, this looks good enough Yes, there's two types of UDS one can take a One can take a set of two set of tuples and produce I think a scalar another take a set of tuples and produce Another set of tuples with basic idea you're taking tuples and you're doing some processing That'd be very difficult to do in sequel You can write UDS in like PL sequel or sequel PSM You can write UDS in like C and then it forks out a caller to that And you just have to write Take the input in a particular format and generate an output in the correct format Sorry So his question is can you just take a can you just take a sample of tuples run the UDF on it and Use that at look at load this output and use that to try to determine what the selectivity is Yes, but if it's written in PL sequel, I think that there's no reason you couldn't do this because You can't write arbitrary code to do stuff inside of that in a In an external or sandbox UDF You can write whatever you want so you could have it send out emails every single time you invoke it, right? And I don't think the database system would can't control that so that usually it may not do that Actually one way you could do this actually is You can build what a called um I Forget what the call it is deterministic indexes or basically you can build an index on a UDF Right, so you can say for this table for this attribute Build my index on the output that is generated by the UDF and then what happens is when you want to go find your match You take give for a given you probe around a key you run the UDF on that that gives you a value and then you know how to you know map into The indexes of the UDF values in that case there then you can derive a bunch of statistics that you could then use for the query optimizer I don't think the open source guys do that, but the commercial guys might do that, right? That's pretty complicated Inserted into the index Yeah, so yeah, there's a semantic debate of like is an index index is sort of like a materialized view and In that in that case. Yes, you can think of the index as a materialized view, but it's in a b-plus tree rather than a whole separate table And it would give you the same thing, but then now you have the problem right if you you need to be have the You have they have him your your query planner recognize that you have a UDF index and they can use that in your in your in your query query planning Likewise, if you had a materialized view it needs to know Oh, you know you asked to look at this table, but I see that you're doing using UDF Well, I can really do this on the materialized view All right That's a whole other aspect we didn't talk about But the UDS is a good point All right, so just to give you a rough sketch of why having bad estimations is as hard Especially when you have joins We just take a simple query here. We're joining ABC and we're not gonna put in real values We'll just talk about you know, how we do you know cost estimation up the tree So assume that that we this is the plan that we picked We're not gonna try to do permutations of the join ordering would say this is we're gonna go with let's talk about why this is hard so in the first step what you're gonna do is you're gonna compute the cardinality of Accessing all the tuples from from the base table here right in the case of a it's just the cardinality a and C It's the same thing, but in case of B Bid is greater than a hundred we would use our knowledge about About what the distribution of is for this value to try to give you what the selectivity If it is and that'll tell us the number of tuples we're gonna plan to emit but the join now is hard because now our joins are gonna be derived from the Estimates that we made here. So if our estimates are wrong in the lower parts they get sort of Amplified as we go up the tree Right so in the case of this doing the joint a and B is taking the cardinality of of these two guys And actually that should be this one here And then we're going to divide it by whatever has the highest selectivity between Aid and Bid and Bid is greater than a greater than a hundred And then from that we take the selectivity or the cardinality generate from this and that gets fed into the cardinality of that So if we were wrong here, we'll be wrong here and then even more wrong up here So this is why as we see now when we talk about the the experiments that the hyper guys did with postgres This is why these things are widely off because the the If your estimates are bad at the bottom, then there is gonna be bad everywhere as else Right, so this this is there's no easy way to get around this We mean we could for every single query just do you know a a complete scan of the table Compute exactly what the cardinality is going to be so we have you know absolute correctness all the way around But then if we just did that we might have just executed the query so we're trying to make approximations about the What the data looks like we're going to access but then the problem is because we make all those assumptions about predicates and other things That's why everything goes wrong so to give you a Example of just how fun how wrong things get the paper that I had you guys read I actually really enjoy it because it just really gets down and dirty and says look how terrible things are and so for the first experiment what they're going to do is they're going to measure the Correctness of the cardinality estimates that a bunch of different database systems are generating And they want to compare that to what the true cardinality is so they developed this this new workload this new benchmark code the joint join ordering benchmark and They would load the database in invoke whatever the command you did you need to to have the data system go collect statistics about the data and Then they would extract what the cardinality estimates are for each query plan And then run the subset of the queries in regular sequel and see how how much they deviate and For this they're going to compare 500 and database systems that all produce the information that we need to make this comparison and they're going to test this using a hundred thousand different queries and the Difference between what this job benchmark they're using versus like tpch is that the job benchmark is derived from a real data set and therefore It exhibits the skew patterns that we'd expect in a real workload where it's tpch It's because it's supposed to be this portal benchmark They don't they don't have any like skew in it everything's perfectly uniform So therefore the query estimates the the cardinality cardinality estimates we make Kind of work out nicely, but in a real data set it doesn't doesn't turn out to be the case at all so this graph here is showing the five different database systems that they compared and Along the x-axis for each one of them. They're increasing the number of joins that they're doing in For each query, right? So you're doing a two table join a three-way join right going across And then the way to read this is that this middle part here corresponds to when the Cardinality estimates generated by their cost model exactly matches what the true cardinality is when when they run their queries Right, so you want to be in the middle here as much as possible So what we see are are three trends? So the first is you have whatever this database system here is actually doing pretty well, right? It's it's deviating a little bit as the number of joins increase But it's not as bad as the other ones and actually an overall trend for all of these is that they always are under estimating the the the cardinality for all these queries that's because they're assuming that the predicates are Independent and therefore the selectivity estimate is you know super restrictive when it actually shouldn't be The second thing we so the second other trend we see is that for these three database systems here They all sort of degrade in the same way as you increase the number joins, but they sort of degrade evenly going from You know going from one join size to the next and this one here actually does does a little worse than the other ones But they all sort of look look roughly the same And then you got this guy in the middle here where immediately even after three joins it just the variance in its Correctness, it's just way off Right and actually does worse than than all the other ones So in the paper, they told you that this one was postgres and that the last one was hyper When I take a guess what these three are I just pick the easy one this one's the worst. What do we think this is? What's that? He says DB2 He said he says my sequel Anybody else he says Oracle. All right, and then you have these two guys here They're sort of degrading this this one actually performs the best out of all of them They don't take a guess what this one is He says DB2 He says sequel server Anybody else? All right sequel server Oracle DB2 and part of this is that when they talk about the Miss equals like I said sequel server is very very good and it has some really awesome people working on it The There's nothing about other what they're doing I mean, they don't really say what they're doing to make this so much better they talk about how they do sort of the same sampling technique to to estimate that the the Selectivity of predicates that hyper does rather than relying on statistics DB2 and Oracle support the column group statistics But I don't think they tune the system to do that This sort of gets into like the black art of database systems, right? This is like the optimizer is one of the key differentiators between like what a commercials commercial system can do which is what an open source system can do and There's a lot of secret sauce in this that are that that they're using to make this go better that they're probably, you know Not documented anywhere Maybe it's in a patent. I don't know but it's so it's just They're doing a very good job into collecting statistics, and I don't think it's just the sampling thing that the hyper guys talk about All right, so now this is basically showing that the cardinality estimates for all these databases are wildly off and They're always underestimating what they think the the number two boys are going to the process per operator so now What they did was they went to postgres and they want to see well how bad are the query plans that the optimizer is going to generate When you have really bad Statistics I when you're making bad estimates and so the way to read this graph here is it's the slowdown of the of each query Compared to the query you would generate when you have true cardinality estimates So what they did was which is really cool They went to postgres and they injected like hooks in the in the cost model so that when they were Generating a query plan rather than doing looking at the statistics and computing an estimate They would feed in what the actual true value was right? They would pre-commute everything and give exact numbers to the optimizer So now what they're showing here along the x-axis is this is the percentage of queries that they tested that grouped by the slowdown of the experience versus the true estimate query right so over here you see that 60% of the queries are slowed down by With 20% by up to 20% more up to 100 100x Right so this is pretty significant This is showing that if your estimates are way off you could be 100x slower than What the you know the best query or the best query plan would do? So what they found was and this is why I like this paper because they really get down to the details of what the Optimizer postgres was doing what they found was for these guys over here is that because they're underestimating the Number of tuples that they expect the operator to process then they were choosing sub optimal algorithms So when the since it was underestimating how many tuples it was it would always start choosing the nested loop and nested loop joins right because I you can think of it as like the the Nested loop since just two for loops. You don't have to set up any Hash tables you don't have to do any sorting. You don't have to to copy data around So if you assume that the result sets gonna be really really small then you then it's just better to do those two for loops All right, but in practice so so so if they did the next thing they did was they went to The optimizer and disabled it from selecting the nested loop join and they run it again And now you see that things have gotten a lot better So you still have some queries that are about 100x slower But it's not as many as you add here and most of your queries now are around You know within you know 10% of the of the true speed So now they looked at this and said well, what's the problem was like what's going on for only doing hash joins Why is this still everything still slow here? And what they found was is that in this version of Postgres they were using the the Execution engine would rely on the cardinality estimates that were being generated by the cost model to Allocate the size of the hash table that you would use for the for your hash join So what happened is because now underestimating the size of your result set for your hash table you you would allocate a small hash table But then you would have these really long collision chains because you know, there's more tuples than you anticipated So in Postgres 9.5, which I don't think is out yet They have a new version that that have can dynamically allocate or resize the hash table during query processing So if now if you realize your hash table is getting too full You can reshuffle things around and resize it and not worry about these long long chains So they back ported that patch from 9.5 to verse to 9.4 And now what you see is that most of the queries here are within 20% of of the optimal query So like 60 plus plus 35 so 90% of the queries are Are almost as good as would be if you had the true estimate? All right, so the main takeaway for this is that the lessons that they provide in the paper is that the having a Having you know these really accurate cost models Doesn't really help you or having a cost model that's really fine grain and get to the low level details of the metrics that the That the database system is going to execute as it executes the plan is not nearly necessary and that's your better off doing Having better cardinality estimates So these are the sort of four rules that the the hyper guy sent me the main takeaways that that they want if they got out of this paper So the first thing he says is that query optimization Right of selecting good join orders is more important than just having a really fast execution engine So even though you're maybe in main memory and you can process things very quickly If your query optimizer is generating bad plans, you're gonna be executing bad plans very quickly Right, so you it's very important to get the join order correct and the next takeaway is that the cardinality assessments are usually wrong So you want to make sure that all your operator implementations do not rely on anything that the optimizer tells you right in that case That hash table don't rely on that the that the optimizer says I'm you're gonna process 10 tuples because it may be a million tuples Should have all your your data structures and implementation be able to dynamically react as it as it's going along another interesting takeaway to have is that they said trying to do more Trying to rely on indexes to cut down on the the number two bulls yet the process is Kind of a pain because when you have more indexes and you have more choices now for the cost optimizer to make What access path to use? It ends up Coming up with even more wrong estimates that then percolate up up into the tree So you're often just better off just doing a brute force sequential scan on the entire table and just doing You're using a really fast hash-join implementation So even though you could use an index and jump to a particular, you know Subsets of tuples and process just those it might just be better to again like a pickup truck just go through everything And then the last piece I think it's cool Is that they basically said trying to have a more accurate model like in the small base example that we talked about before it's usually a waste of time all right and just guessing the number two boys are going to process but having those guesses be accurate is is It will give you more bang for the buck okay, so the again the main takeaway from all from the query optimization stuff is that I think the volcano cascade search strategy that we talked about last class paired together with a Cost estimator based on the number two bulls that are gonna be processed and emitted per operator is like the right way to go And as far as I can tell I mean I don't know what search strategy a bunch of different database systems use But for the most part they all use the number two bulls as as the main metric Right because that's a good approximation a good enough approximation for for for most queries So any questions about cost models and estimations? okay, so Let's jump into sort of this this you know The other topic again, I think it's kind of important for you guys There's not directly rated just to database systems, but again, I think it should help you throughout your entire career So I want to talk about like things you need to be be mindful of consider when you start working on a really large code base You guys have already started on this in project number three, but when you leave CMU that you're gonna you're gonna encounter this Over and over again, so just a quick disclaimer. I'm obviously you know, I'm not like a grizzled You know 25 year veteran in software development, but in my life I have actually worked on two database systems And one very large distributed system the batch processing system So I I've spent some time working on really large code bases, you know when I was in grad school I worked on each store for you know five or six years and that thing got pretty big In addition to working on these systems. I've actually read a large amount of what I'll call enterprise code for Some legal work I've done over the years as well, right? I can't name names, but I've looked at what you know from from a major software vendor I've looked at their code for you know for a year or two to really try to understand it and be able to you know Write about it in litigation So I'm not claiming everything that I'm saying about is is gospel You take everything I say with a grain of salt and I'm not claiming to be you know Super knowledgeable in exactly how you know when you go to Google how they want you to you know work on the software So I'm just talking about high-level things that have helped me in these different facets of my life that I think actually are Should be you know for useful for you guys and also say to there's not gonna be like this magic product You know there's not gonna recommend you go you spend a thousand dollars from IntelliJ and get some software That's magically install all your problems and make it super easy to work on To work to work on you know large large projects I'm just trying to have you guys be mindful and think about these things when you go out into the real world and leave CMU that these Are some of the approaches you can take Rather than just you know throwing your hands up so the the main thing you got to be mindful of is that You are almost never going to start working on a new project from scratch Right, even if you're building a new data system from scratch There's still going to be libraries and things like that that you want to incorporate that you don't want to write yourself So there's always gonna be a lower really large code base that you didn't write and you don't know the person who wrote it And then they may not be any documentation on what it's actually doing and why it's written the way it's written so you need to be able to be able to like be very independent and just sit down with it and Start to make sense of it, right and the reason why I think this is important Is because when I've been planning this course for the last year or so I would talk to all these you know senior level people at that database developers And I always ask them like you know, I'm teaching this new class of me Davis internals What do you guys like if someone graduates from CMU and you want to hire them? What's the one sort of skill you think is really important to them that they'll learn? I do you want them to be masters and concurrency trial and walking do you want them to know sim D? LLVM do you want them to know, you know B-plus trees and without me prompting them and without like me You're telling them what other people have told me they all said the same thing They all said they want students that can take a really large code base and dive into it and start working on it and Making contributions right away and be very independent about it and actually I won't name names, but there's actually a database company that the when you go interview with them the Their test or their coding test is to give you memcache and Tell you to add a new feature to it. They don't they don't you know, they don't sell memcache They don't expect you to know memcache say here's a large code base Go add some feature into and they tell me that this is the number one This is the number one indicator whether someone's gonna be good at their company is whether they can pass that memcache test So part of this again That's that's a little bit different because you're trying to work on something in a constrained amount of time but again just being mindful of This is a skill that you definitely work on and again. There's no magic button. I can only talk about what has worked for me So I would say also to that like the first thing people would usually want to do When they want to start working a large code base is that they just start reading it just sort of for the sake of reading it And this is usually a waste of time because you're just reading it without any direction about what you're trying to accomplish You're not going to internalize what what is actually going on the code Right, you may get a big picture of things But when it gets time to actually really understand the details that are important You're not going to be able to guess grass is just from reading it and I also say to it in my own experience in managing students working on projects A lot of times, you know, I'll tell a student, you know work on this project and come back in a week and tell me what you've done The ones that come back and say oh, I was just reading the code aren't don't turn out to be very good It's the ones actually come back say hey look I wrote this little code. It's it may not be perfect yet But I leave started So I think it's important to If you really want to really understand you got to start mucking around with the code That's the only way you're going to be able to fully comprehend what's going on. So there's two ways to do this The first case is the first way to do is just write test cases Right, and I when I say test cases that mean like unit tests that actually invoke the functions in the actual program itself And not regression tests that are testing a high-level functionality, right? That may not, you know, call the thing that call the function of the part of the code that you're looking at The nice thing about this is that if it's going to force you to understand The what the code is actually doing Because if your test case fail fails, hopefully, you know, the program is correct and your test case is incorrect It'll force you to understand what the inputs and outputs that you're expecting to get from these different parts of the code that you're dealing with and It's also not kind of nice too because you earn kudos for people your co-workers the people that are working with you on This project because no one ever complains when there's new test cases obviously if your test case computes pie to the billing digit and takes 20 hours to run people we pissed But in general know, you know more tests are always a better thing so the nice thing about this doing test cases as well is that Your writing code you're interacting with the system, but you're not actually touching the the functionality of the system So that prevents you from making doing any real damage like if your test case is broken That doesn't affect the the actual system itself So this is usually a good way to get started and then another thing that I find useful as well is to do refactoring So this is sort of finding that the location of a code you plan on working on you want to get started with and just start cleaning it up I'm changing fixing variable names fixing comments Maybe break out logic that's in one function in it. You know, it's one giant function to sub parts Right because again, this is a nice easy way without having to think about how do I ask add a new complex feature or fix a bug That I'm not really really sure where it is yet It's allows you to you know get your hands dirty with the code without you know Exerting mental effort and truly understand it yet And this again the more you sort of do this and get and you can more comfortable with everything And now you understand what the different parts are because these are the parts that you wrote Or that you were working with you can then expand to that now adding, you know adding whatever it is that you want to add The only thing I'll say about this is that you have to be careful about the careful when you do this because obviously You're not as familiar with the code as as you would be later on So therefore you may not want to make a change that What could you know could affect correctness? All right, so then the last piece is that in this is something that we really don't talk about I think in CS courses is that Starting off on a new project working on a large code base is not more than just the code itself It's also all the mechanisms and the build processes and the and the tools that are going to be used to build it And so the I think it's really important also to understand what the protocol that the organization is using to build the software and Become familiar that understand that if I submit code, where is it? What happens to her? What's the life cycle of that change when how does it get tested? Where does it get tested? Who tests it? How does it get reported? All of these things are everything I've really important as well so at you know bigger companies and and And you even you know well run startups They'll either have they'll probably have on-site training to say like you know here's how you here's how we do it here You need to sort of follow the same protocol You know Google is notorious for having all their source code in one giant repository tree and then there probably also be documentation as well to say what's our coding standard is what the You know how do you write test where the test should go and things like that? So one way to get more familiar with all these things if the documentation is not available They're not going to provide you training. It's just start writing it yourself Right and because then it's going to force you to in order to write down your thoughts on What you need to do to build the software you have to actually truly understand it right the best way to learn something Is actually try to teach it So again, this is something if you write documentation assuming there isn't a tech writer already This is something that nobody will complain about if you do because it's sort of helping everyone and makes it easier for the next person To come along and pick up where you left off So any questions about this? It's like I said, this is sort of maybe seem kind of obvious It just I want you guys to think about these things again, either for this project or the next project, you know when you when you leave this class of Trying to be very independent when it comes to writing code, right? Nobody likes it if you know you you write some code it crashes and then you immediately throw your hands up and try to Ask your manager or or your peer why did why this doesn't work? Okay All right, so next class we will talk about query compilation. So the way to think about compilation is that the the query plan itself would then will be executed directly Rather than interpreted so the hash join you guys wrote was it was an example of an interpreted hash join All right, we took a query plan We fed it into an executor and then we parsed it and figured out what actually we need to do But in query compilation there is no interpretation. We just immediately execute our join And so we'll see we'll see how we do this in hyper and in a few other systems Okay Any other questions? All right guys, we're done. Thank you