 Real quick, the things that you guys docket, obviously the midterm is this Wednesday in class I posted the URL to the study guide online in Piazza and then at the end of this class we'll have a review of the midterm, go over the topics at a high level. And then it will cover everything you need to, what you need to bring and show up and do. Okay? And then project number two, checkpoint two is due this Friday at midnight. Quick show of hands, who here is done checkpoint two? Done the entire thing. We got one. Okay. Don't be so meek. He's done. Okay. Who here has not started on checkpoint two? I saw a hand. Okay. It's okay. All right. Fine. Okay. Again, this is what he made it do on Friday and it realized you guys have midterms in all the classes so that's why we push it off them. Okay? So the other thing I want to go over real quickly is some administrative stuff. So these are the department basically got back to me and they want me to curse less in this class so I'm going to try to do that. I can't always promise it. But there's some other things going on that I think are actually kind of interesting, kind of relevant. So first is that I found that there's a Chinese version of YouTube that's basically copying all our videos. Right? So good for them. Okay. And then the other thing is too because, you know, not that I post this on YouTube just for sh*****gles, right? I do it because I wouldn't you guys have access to everything. We get a lot of email or at least I get email from people complaining about the course. So this is, I don't have my list here. Hold up. Sorry. I don't want to use their names. Okay. Sorry. So this is J.L. in Seattle. He basically says that we're awful. He doesn't like us and he gave you back your money. This is C.M. in San Jose. He says I'm a joker. I have, I smell like I have diarrhea or whatever. So it is true. I did have a medical problem where I had bad hygiene a year before but I'm on medicine now. So this, I think this is old news what we got. But occasionally you get actually useful emails. Right? So this is actually somebody Eric Darling. He's from a major SQL server DBA consultant company and he watched the lecture from last class on joins and he said, he said that I said something that was actually incorrect and he's absolutely right. So when I was talking about joins, I was saying that, oh well, building hash tables like building an external fly and as far as they know, no database system actually would build a B plus tree on the fly. Only, they're only going to build hash tables to do hash joins and he actually says that actually this is incorrect. The SQL server has a feature called index spooling where they will build a B plus tree on the fly just for the query that you're actually doing. And SQL light does sort of something similar. But anyway, so this is actually a really cool website where you can dump out the SQL server XML of a query plan, the XML of a query plan from the SQL server and you upload it to the website and it gives you a nice visualization of this. So we thank Eric for correcting us. There's much more comments on YouTube for other things we've gotten corrected and correct but I'm not going to read all of them. So again, I take any feedback you guys can provide because I want to make the course better year after year. So don't send me bad emails and send me good ones. So today's topic is on query optimization. So this is actually one of my favorite lectures. We're only one of my favorite topics in databases and part of the reason why it's one of my favorite topics is because I'm actually really bad at it and this is the one area that I don't fully admit that of all of database systems that I don't fully understand maybe to the extent that I would want to. And for whatever reason that makes me like it even more. And so to understand what we're going to do in this lecture is we have to go back and understand what SQL actually is, right? So SQL is a query that someone sends to the database system and that says here's the answer I want you to compute. It's done in a declarative way. So there's nothing about what is in the SQL that says run this join algorithm, run your sorting this way. It just says give me this answer. And it's up for the database system to figure out the best way to actually do it. So we saw this when we talked about joins and sorting. I said there's a bunch of different algorithms you could actually use. In case of joins you could do nested loop, you could do hash joins, you could do sort merge and they have different algorithmic properties and trade-offs and under different scenarios you may actually want to run one algorithm versus another. But when you write a SQL query that joins two tables you don't say join table A and do a hash join on a table B. You just say I want to do a join or you're just saying what the answer you want to be. And so it's up for the database system and in particular the query optimizer which we're going to focus on today to figure out what the best algorithm to use for your particular query based on what it knows about the data. And so we saw that there was a big difference in performance when we talked about joins. We said that if you do the most stupidest thing with the nested loop join then you could take your query and take 1.3 hours but if you do a hash join then you're down to less than a second. So as we go along we'll see I'll sort of mention that in various cases that this is the one thing that really separates the commercial database systems from the open source ones. This is what makes commercial, part of the reason what makes commercial databases so expensive. Because it's going to be this query optimizer because they're going to have an entire floor at these database companies with much, you know, dues with PhDs trying to figure out how to squeak the best performance you can get out of a database system through its query optimizer. And there's been studies that show that the Microsoft query optimizer is actually the best one out there. Not to say the open source ones are bad in particular postgres, it's just in terms of the level of sophistication, the kind of things that can support the commercial ones are much better. So the original idea of a query optimizer goes back, you know, as many things do in databases, goes back to IBM system R. So remember I said that in the beginning of the semester system R was this internal project by IBM Research down in San Jose where they took Ted Codd's paper on the relational model and said let's actually try to build a system that actually implements this. And so Ted Codd never actually proposed SQL, he later proposed his own declarative language called alpha that nobody ever actually implemented. So IBM came up with part of the system R project, they came up with the earliest version of SQL. So given again it's a SQL to declarative language, they then need to be able to take that query and actually generate an execution plan that was efficient. So at the time in the 1970s people were arguing that a database management system could never produce a query plan as efficient than what a human could write. So therefore it was, you know, SQL was a dumb idea or having query optimizers was a dumb idea because it's never going to be as good as what a human can do. You can sort of think of this as the same argument they were having in the time of the 1970s about C, the programming language. They argued that no human would ever be able to write algorithms and write applications in C and then have a compiler generate an execution machine code for your program that would be as efficient as what a human could write in assembly. Of course nobody writes assembly yet anymore, right? Everyone relies on compilers. So it's the same argument but just in a different context, right? For SQL people argued that be better off people writing code of sale or whatever they wanted to do, their joins themselves rather than having a compiler figure things out. So as we go along through today we'll see that there's been many cases where the, so the high level concepts of what the system are guys came up with back in the 1970s were actually still used today. This idea of a call space query optimizer is sort of the key one. The algorithms may be slightly different and some of the assumptions that IBM made in the 1970s may not be still valid anymore but again the core concept is exactly the same. And the woman that came up with this at IBM, Pat Selinger, I said the system are they got like seven people with PhDs in one room and they said go build a database system and every person sort of carved off their own piece. So this one woman Pat Selinger carved off the query optimizer part. She was at IBM for a long time, she was at Salesforce and just retired and I had a former student in this class was working with Pat Selinger, took all my classes, knew about the Selinger query optimizer and didn't put one together that the woman she was working with was the same woman we would study in this class. So she's still around today and she has done a lot of great work in this area. So for query optimization we're basically going to have two approaches. The first is going to be using heuristics or rules written by us as humans to do some massaging of the query plan to apply some obvious optimizer or fix some obvious inefficiencies and apply some simple optimizations. So the idea would be you have your query shows up and you do some line of rewriting of the syntax tree, you never operate directly on the SQL and the idea is that there are certain things we know we're always going to want to do to make the query run faster so we just have rules to go ahead and just do them before we go into the second phase where actually now we do a cost-based search. So the second phase is where we're going to say alright we now need to figure out other things about our query plan and I don't have rules that I can apply to say here's exactly what you're always going to want to do because the choices you'll make will depend on what your data looks like and I'll explain what that means in a second and so we do a search to find what is an optimal query plan for our query. The idea here is that we're going to have this cost model to say given to potential query plans for a single SQL statement which one is actually better in terms of performance. So just understand the pipeline of this in a real system at a high level it would look like this. So your application submits a SQL query and the first step is that it always goes through a SQL parser and the SQL parser is going to then convert it into an abstract syntax tree. It breaks up the tokens and the select statements and where it causes and gives you this tree. But inside this tree now all we have are just the strings or the names of the things that were in the SQL statement like select star from table foo in our syntax tree at this point we just have the string foo. We don't know anything about it. We don't know that it corresponds to a particular table. So in the binder phase we go do lookups in our internal system catalog about the database. It's the metadata about the data. What tables do I have? What columns do I have? What indexes do I have? And it's going to map the names or the strings that are in the abstract syntax tree to internal identifiers. I just think of like it's a internal ID number for a particular table. And then we feed now this annotated abstract syntax tree to this query rewriter the thing I showed in the first step in the previous slide where we can apply some rules that we know we're always going to want to apply to our query plan to make it go faster, make the query run faster. And I'll show examples of what that is as we go along. So not every single database system has this. You can still just have a query optimizer without this initial step but this is actually very common in most systems. And then in the last step when we take the output of the rewriter and feed this into our call space query optimizer again think of this as doing like an optimization search. Looking at different query plans, getting information about what the tables look like from the catalog and then using some kind of internal cost model to say here's the potential runtime cost of exceeding this query plan. The idea is you want to be able to examine as many query plans as possible within either some kind of budget in terms of time or the number of plans you'll investigate. And then whatever which one is the best one you've ever seen that's the one you'll spit out as the final query plan for this query. So as we see, as we go along the optimizer is actually not always going to pick the globally optimal plan because that would actually be really expensive. So it's going to try to use methods to prune down the search space so that it can generate a good enough plan within a reasonable amount of time. So I say this in the beginning and I'll say it again. This is actually really hard to do. So the joke in databases is that if you try to do query optimization and you fail, the fallback plan is you go work on rocket science. Because query optimization is harder than building rockets. This is again, this is the hardest part about building any database minute system. This is part of the reason why a lot of the NoSQL guys didn't end up using SQL in the first place because if you have SQL then you need some kind of query optimizer. This is hard to do. It's hard to get right. There's all these different corner cases. SQL expands. You have nested queries and other things. This is really hard to get right. The one thing I will say also too is that if you're good at this, if you really like query optimization, you will have no problem getting a job. So I have friends in various database companies and they email me asking for what students are looking to graduate and looking for jobs and then one thing I always get is like yeah we want database students but we really want anybody that knows query optimization. So if you're good at this, you can have no problem getting paid a lot of money because when you think about it, what like 22 year old kid coming out of grad school or undergraduate program has experienced query optimizers. It's like trusty old guys from the 1990s and they don't really want to move maybe to a startup and things like that. So this is the one thing that database companies are really, really interested in. So as we go along I imagine and talk about various problems that we have to solve in our query optimizer. A lot of you are going to be saying in the back of your mind, oh can't machine learning solve this? Can I use AI to make this all go away, make it much easier? Yes but no. There has been some early research in the last one or two years of looking at using machine learning to improve the accuracy or improve the quality of the query plans that the optimizer generates. But as far as they know, this is still early work and no major database system actually supports this. Postgres has a genetic algorithm based query optimizer that if you try to do joins with more than 13 queries then it falls back to the genetic algorithm. But what we'll describe today is basically how most systems actually do this. I'll say also too for the applying machine learning to query optimizer is actually something we're interested here at CMU. We do have students looking into this now but it's still very early. I love query optimizations so much, again I'm really bad at it and it just makes me want to do it even more so I get better at it. So in the advanced class, you know, previous year we already spent two weeks on query optimizers and cost models but if you take the advanced class this spring I'm probably going to add another week to go over, you know, more complex scenarios. We built our own query optimizer from scratch here at CMU and we just, you know, I'm very interested in making it better. Okay, so for today's class we're going to talk about, we're going to have sort of four parts talking about different query optimization steps we have to do. So we'll talk about how to do relation algebra equivalencies to do the rewriting that we talked about before, then we'll do plan cost estimation, how do you actually say for a potential query plan how many tuples, how much disk am I going to have to read to make it, make it execute it, then we'll talk about how to do enumeration or searching for different query plans and sort of build up a query plan holistically and look at the cost at each step and then we'll talk about how to do the two ways you can do optimization of nested queries and we'll finish up talking about again, the remaining time going over the midterm in a high level. Okay? Okay, awesome. All right, so we didn't really mention this in the beginning of the semester when we talked about relation algebra, but the high level idea is that what we're going to try to do in this rewriting step is our query is going to show up will generate the relational algebra from the abstract syntax tree and then because of rules we know about the relational algebra to determine, you know, to permute the expression and still have them be equivalent with each other we can apply certain rules to generate a more optimal query plan that way and this has nothing to do with what my data looks like, these are just general rules that we can apply for every single query because it's always going to be the almost always going to be the right thing to do and so the idea here is that this can be done efficiently, we just have some some sort of a rule engine that can check patterns in our syntax tree to see whether we have matches and if we have matches then we can easily apply those rules and modify the tree. So to give you an example what I mean by that, say we have a simple join on the student table and the enroll table, right, and the, so we have our join predicate on the student ID in the student table on the enroll table and then we have an additional predicate on the enroll grade and so the, if you just take the sequel and almost generate exactly the relational algebra for this, you would end up with a query plan that looks like this right, but what's one obvious thing we can do to optimize this query, yes. Exactly, check the predicate before we join. So we have our filter on grade A here but we're doing that after we do the join. So again always try to think in terms of extreme numbers to see how something would actually affect performance. So say we have a billion records in the enroll table and only two people got an A in the entire university right, so that means that when we feed this data into this join operator we would do a join on 1 billion tuples from the enroll table then we have 1 billion tuples come as the output which is fed into now the filter operator, but then we're going to go now check grade equals A but that's only going to match two tuples. So we did a join on a billion tuples where all but two of them is just wasted work because it's not going to get produced as the output to the user. So a simple optimization we can apply here is called projection pushdown, predicate pushdown where we just move down the filter operator to be below the join. This is sort of like a no-brainer and we're always going to want to do this. So when you look at now the relation algebra for these two different query plans we again know that they're equivalent. So in this case here we do the join first then do the filter in this case here we do the filter first then do the join. There's rules for relation algebra that would tell us that these two query plans are equivalent so the query optimizer can do rewriting of this query plan and apply this change without worrying about guaranteeing that it's going to produce the correct result. All of these optimizations don't work if you produce query plans that are going to give you the wrong result that doesn't help us. So there's some sort of standard rules you can do for selections obviously perform the filters as early as possible. You can reorder the predicates so that you apply the more selective ones first. So say I had something like grade equals A and age above 30 if there was fewer students that had the age above 30 then students had with an A then I want to apply that age filter first. So even though the grade may appear first in the SQL statement I'm free to reorder the expression any way that I want. Maybe apply the most selective predicate first so I throw away as much useless data as possible before I go to the next stages. And then you can also do break up complex predicates and do push down so you don't always have to have if they have a where clause and a bunch of conjunctions with a bunch of predicates you don't have to take all those predicates and push them down in the query plan. You may want to select some of them and this may be because some predicates are more expensive to compute than others. Say one predicate might be computing a hash of some value so maybe you don't want to do that down below before a join maybe you want to do it after a join so you're only wasting the computational resources for things you know are going to match or at least get through the join. There's other optimizations you can do as well again relying on the transitive properties of Boolean logic. So here we have a complex predicate like x equals y and y equals 3 but we would know that in this case here if y equals 3 then x always has to equal 3 so we can rewrite this expression to be like this. So now maybe we can break up the predicate and do the filters on the x table or maybe filter on the y tuples do them separately. Yes. Is it always true that you should break a complex predicate because you don't have an index on r and you're used to know you don't have to do this to understand if you just apply all the predicate and what do you have to think of scan? So his question is if you're doing sequential scan does it make sense to always move all the predicates down close to the scan as possible and never break it up? Yeah. Right so in the definitely the open source guys I think they always push things down there could be an example where it could be an example where a predicate would be expensive to execute. Predicates can be anything. You can write a user defined function that makes an invocation to an outside system to compute whatever you want and say that call costs you money I'm not saying any system would actually do this but you could recognize that this is going to be expensive to do so I don't want to do it before my join I'm going to do it after my join because it might be a join filter that's going on. So you may have a billion tuples going to the join but then join may only produce two things but then rather than having for a billion tuples invoke that offensive predicate you do it after the join. So that the UDMS actually is aware of the cost of the predicate? For that for that particular example you have to be yes but my UDF one is a bad example because UDFs are treated as black boxes that the data system doesn't know anything about that but that would be maybe the difference I'm trying to make here is that it's the when I say cost I think in terms of how it's related to the actual database itself the computational cost would be is sort of independent in some ways of how many tuples you have competition costs are invoking one predicate it's the number of times you would execute that predicate for the total number of tuples that's something we'll deal on the cost model separately for projections but this is mostly going to be true for row stores and for distributed databases but the idea is that you want to push projections down as early as possible so that you reduce the size of the tuples that you're copying from one operator to the next we talked about the materialization model the iterator model and the vectorization model you're copying data that the operator generates from one operator to the next pushing it up the tree and so if your tuples are really wide and you're copying the entire thing all the way up then you're wasting a lot of space for columns you know you're never actually going to need later on so you go back to our example here get at the top of the query plan we're doing a projection on the student name and the course ID so say that the student enroll tables were really wide meaning they had a lot of columns a lot of attributes it would be better for us to move down the projection before the join so we're only passing in the student ID in the name and this one only passes up the student ID and the course ID because that's the only thing you need to do for the join here and the idea is that from one operator to the next we're copying less data again in distributed databases this matters a lot because if you have to then send data from one node to the next then you're going over the network and that becomes very expensive to do and I already said this wasn't going to be an issue because the column store can be smart and recognize that oh I only need to read data from this one column so I'm not going to stitch the entire table back together then pass it up from one operator to the next it just passes the bare minimum data you need but if you're in a row store then as you're scanning this you're basically getting everything alright so let's look at some other examples how to do optimization without without a cost model so this actually comes from a really great blog article here from a DBA where he basically came out with a bunch of different types of optimizations you can apply for queries again without having a cost model and he looked at how the different database systems actually would support them and whether they can recognize the inefficiencies and what's in the SQL query and try to rewrite them to be more efficient or better alright so the first class optimization we can apply or is recognizing that when we have impossible or unnecessary predicates right so select star from table where one equals zero right obviously one can never equal zero so it doesn't make sense for us to actually scan through every single tuple and apply this predicate because it's always going to be false and no tuple is ever going to match right so some systems can recognize in this case here that this will never evaluate the true so it just skips the scan entirely it'll give you back a result set because if I execute a query I always want a result set this is not an error there's no reason this is still a valid SQL but it can just be smart and say I don't need to run anything here's your empty result another example would be like this select star from A where one equals one obviously in our where clause one will always equal one so the optimization we can apply here is just throw away this predicate entirely rewrite the expression to not include this and then now we're just going to basically take this do a complete spectral scan on the table and just dump everything in our output and the thing we're trying to avoid is again not having to check whether one equals one for every single tuple in our table because that's useless so it can just rewrite this like this and these are equivalent another optimization we can apply is join elimination to say we have a query with a self join right from A A1 join on A as A2 where A.ID A1.ID equals A2.ID and up above I'm showing the schema and ID is the primary key so this is a unique column so there'll be a one-to-one match between the ID in A1 with A2 because the exact same table so the data systems can recognize that this join is actually useless right and just remove this entirely and rewrite it like this this is still equivalent again I don't need a cost model to figure this out I can just look at my catalog understand that the column trying to join on for this table is the primary key or unique column so therefore I just throw away throw away the join throw away the where clause alright another one is to remove useless projections so here we're doing select star from A on as A1 and then we're going to have a in our where clause we have an exist function that says with the inside of this we're now doing a join between the outer A and the inner A so A1 is the outer A2 is the inner right why is this useless what does exist do yes it checks whether at least one row return is returned and then evaluates to true so this is basically saying for my tuple do I exist in my own table yes it's always true so this is useless and we can rewrite this to be like that alright the last one is to do merging the predicates so in this case here we're doing a select star from A and then we have in our where clause we have two between predicates so between there's a shorthand way to say like greater than and less than or less than and greater than right so this is saying where val is between 1 and 100 or val between 50 and 150 right what can we rewrite this to it's a disjunction 1 to 150 right because these two ranges are overlapping so again instead of looking at every single predicate every single tuple and evaluating this predicate I can just rewrite it to be like this and that now that's much more efficient to execute right again so there's a bunch of these different rules that the the various data systems support and can have some are better than others and we can try this out real quickly to see what kind of things they support so let's do this so this is going to be a sample database from the blog article and I highly recommend you guys check it out the link is in the slides and so he had a he provides a sample table sample database that's based on that doesn't work that's based on a like a video store so there's films and people like rentals right so it's sort of a sample schema and what's awesome about it is they provide it in for MySQL Postgres, SQL Server Oracle SQLite they provide it for a bunch of different data systems so we can try it for all these systems in the same scheme in the same query alright so this one we have Postgres we have MySQL I got the Linux version of SQL Server 2017 running on my machine and then we also have SQLite so let's start with the first one in Postgres select star from actor where 1 equals 0 and here I'm doing explain analyze because I actually want to execute the query and see what Postgres actually does and so again this is an impossible where clause so we know that it shouldn't actually look at any data in this case here as far as I can tell actually it looks like it didn't actually run at all right because it says rows equals 0 so the one time filter that's always valued to false so it knows that this thing will never value it to true so it just doesn't execute anything and this gives you back an empty result let's try this in MySQL so MySQL doesn't have explain analyze like Postgres does and their explain is not as good as Postgres Postgres gives you the query plan tree which is actually really nice so MySQL gives you this is 5.7 it gives you some metadata and you see here it says impossible where clause so it didn't actually run anything it just recognized that this thing will never evaluate to true so it knows it doesn't actually need to scan anything on the table we'll try this in SQLite I don't think SQLite gives you a good explanation yeah it doesn't really tell you anything it just says I did a scan on table if I remove the query plan clause from it it gives you a bunch of this I won't explain this now but basically the way SQLite actually executes queries is that they convert the query plan in these op codes think of like the byte code in the JVM and then they have an interpreter for it which is kind of interesting so that's what all these op codes are and then this is sort of roughly showing you what they're actually doing so as far as I can tell it doesn't it doesn't look like they're able to recognize that it's an impossible query alright the last one was SQL server so SQL server as far as I know does not have does not have explained you have to do this like show plan thing right and tells you that it's turned on and then we can now run a query and it doesn't actually run the query it just tells you that I did it and here again there's no metadata to say whether recognized as impossible it says constant scan and I suppose it would evaluate the false and knows it doesn't actually do the scan at all alright let's try the the range query one so we go back to Postgres so Postgres is not able to recognize that these predicates be merged and they rewrote the between clause to be greater than equal to and less than equal to but it still looks like they're still going to perform the the scan for both of those predicates both of those ranges we go to my SQL no matching row and constable sorry let me go back to this I'm not doing the same range scan I had before I'm actually doing an impossible range between 1 and 2 and between 1.99 and 200 these are disjoint sets we're doing a conjunction on them disjoint ranges so it should never actually match so my SQL actually does the right thing and is able to recognize that this is actually since here no matching row and constable so maybe it actually did apply it but in case of again it didn't recognize that these are impossible ranges and still did the scan on it let's try this in SQLite again as far as I can tell it's still doing the it's still doing the range scan on the query it doesn't recognize that these things are disjoint and the last one would be where are we there we go we'll try this in in SQL server again it doesn't tell you anything because you're constable so the way to turn on the more verbose like XML query query plan of the SQL server but you need a viewer for that which I don't have right it's just we just have the terminal so the last one I want to do is the projection pruning so this is back to postgres so here we're going to do a select statement and all we have is just exist where select is 1 divided by 0 right so again this returns true whether there's at least one thing that will match inside of whatever we have inside of it right so I could do something like this select 1 right something will match because there's a tuple with one attribute the value is 1 and then the outer select will say alright is there anything inside that inner select if yes then return true right so that's what we expect like this so the question is whether postgres can recognize that no matter what value I put in for the inside this inner part here right no matter what's in there there's always going to be a tuple so therefore it's always going to evaluate to true so therefore it doesn't need to even execute whatever that projection is in there so I can try to compute the millionth digit of Pi it's still going to evaluate to true so you want to see whether postgres can recognize this and not actually execute anything and the way we can test this is by doing it invalid division right 1 divided by 0 so if it actually executes that predicate then it'll throw an error if it doesn't then it'll just say true raise your hand if you think it'll execute it 1 2 5% who thinks it's smart enough to know not to do it no one's going to think that few people it can do it right so we can try this in in sql light it comes back as true let's try this in my sql it comes back as true the question though is I'm actually not sure if my sql actually executes this right so again if I do this if I just like that it'll it gives you back null but then it gives you a warning and then for my sql you have to do this command and then it says division by 0 right if I try this in postgres it should actually just throw an error right division by 0 so I'm not sure whether my sql is actually executing this or not but we can try another case where we can try to give it an invalid power and that comes out to be true too so that I think it's doing the right thing right in that case it throws an error if you try to do that whereas like the divide by 0 doesn't throw an error again so the main takeaway from these are all rules that the system can use before you even get to the call space search because it knows that these are some optimizations that can apply because they're almost always going to be the right thing to do okay okay so for single table queries you know the correct addition is pretty straightforward for joins is the tricky one this is the one where we're going to have those problems alright and the part of the reason we're going to have problems is that we can apply both commutativity and associativity rules for our joins and reorder them in any way that we want and we're still always going to produce the correct result right if I want to join R and S that's equivalent to joining S with R right it's still going to always produce the correct result or if I want to join RS and T I could join R and S first and then join that output with T or I could join S and T first and join that with R all these are equivalent all these are valid and so we hit but they're going to have different performance characteristics we solve this when we talk about join algorithms you always want to put the smaller table as the outer table so now how are you going to pick which one which to join first and which one to be the inner versus the outer so what makes it so hard is that the total number of combinations we could have is 4 to the N right you know it's a big number when it has a special name right so it's called a Catalan number right and so then because there's so many different choices we could have for our join orders and that we're not going to be able to compare every single one we need a way to prune the search space right because doing the results of search is just going to be way too slow right because doing the results of search is just going to be way too slow so we're going to hold off on this for now we'll talk about how you're going to estimate whether one join is better than another and then we'll see how we can then figure out what the join ordering should be so the in the the call space search optimization the idea is that for a particular query plan so the idea is going to take a query you generate the syntax stream and then now say there's joins you want to figure out what order you should actually join your tables and so you're going to try a bunch of different combinations of those different join orderings and you need a way to figure out which one is better than another so you're going to use the cost model to be able to do this estimation right and so the cost model is going to generate a number that's internal to the database system meaning it's not really going to be a you know tied to wall clock time it's not going to say 10 milliseconds the square is going to take 100 milliseconds right is this internal metric that it uses just to compare whether one query plan is better than another but it's going to be derived from a bunch of different information that it can collect about your data and about your hardware right so you can try to think about like alright well this join algorithm is going to take this many CPU cycles and I'm going to have this many cache misses so you know wait that be more than the other one know how many blocks may have to read from disk or how much data at the center of the network how much memory I'm going to use so this cost model is going to be a combination of all these things and again the commercial guys are way more sophisticated than the open source guys and then it's going to use that again to determine whether one is better than another so the way you need to figure out how much resources the query is going to use is based on the statistics and information you're going to derive about the underlying database you want to know how many tuples I have for the table from table A and join with table B how wide are those tuples how fast is the disk that they're stored on all this information you need to maintain in order to figure out how much is going to cost to execute it so every single database system that has a query optimizer has to have it's going to have a cost based query optimizer has to maintain these internal statistics that it's going to collect about the data so if your data is read only then this is not a big deal because you just sort of scan it over at once and maybe try to collect as much data as you can I shouldn't say it's not a big deal but it's it's less challenging than if you're in a dynamic environment where your data is always getting updated so the way database systems manage this is that they typically have a background sort of stats collector I think that's what Postgres calls it that's going to scan through your data every so often whenever you invoke it and derive some information about what the distribution and the values look like and maintain that in the internal system catalog that it can then be used by the optimizer's cost model to make estimations so again the different data systems have different rules about when you actually collect this data it can be like if I update 10% of the tuples in my table then I'll fire off the background stats collector if I delete all the I'll load it back in, I'll run it again you can also mainly invoke these things Postgres and SQLite and a lot of other systems have the analyzed keyword and it just tells you to again scan through a table and collect this information SQL Server updates statistics and then DB2 it runs stats so there's nothing in the SQL standard to say do this most of the times you see analyze but Oracle MySQL have to be different when analyzed table so what do these statistics look like well the highest level the most easiest thing is obviously the number of tuples I have in my table right if I want to know how many tuples I'm going to feed into my join algorithm after I do my scan you need to know that and you can simply derive that based on the number of tuples but then we need to know something about the values inside those tuples for all the different columns so the most basic thing we can maintain is the the numbers distinct values for a given attribute in our table so let's find this as a function VAR where this just says again the number distinct values I have for attribute A in table R and then for this we can now derive things we actually can use to estimate the cost of executing query plans right and this is going to be the first thing I'm going to derive is called a select selection cardinality and this is just going to be the the average number of records that have a particular value in an attribute for the table so you can think of something really simple like I have 10 tuples and I have a single attribute and there's 10 unique values for that attribute then the selection cardinality is 1 because before I begin to predicate if I want to say does something equal something I'm going to get back 1 tuple and so this is just taking the number of tuples I have and dividing it by the number of distinct values for a particular attribute so what's one obvious limitation about this we're taking the number of tuples we have and dividing it by the number of distinct values it assumes so he says you need to maintain a set for the distinct values thinking more high level than that because the distribution is exactly yes so a big assumption we're going to make here and this goes back to the system R days is that we're assuming we have uniform distribution of values to make this work right but we know in reality it's not the case at all right just think of like just in CMU right CMU has 10 colleges right the school computer science is one of them uh there's way more students in SCS than what the other schools are right music or something like that right so if we assume our data is uniform then every college at CMU will have the same number of students but that's not the case at all there's way more people that live in New York City and those zip codes then live in Wyoming and Montana so real data does not fit this pattern at all but to simplify our explanation of what our cost estimations are doing we're just going to make that assumption right but again we can go through examples of how it doesn't work out alright so if we assume our data is uniform uh then it's really easy for us to estimate the number of tuples we'd have for uh on unique keys when we do a quality predicates right so say I have select star from people where ID equals one two three ID is the primary key so the selection cardinality for this is obviously one because only one tuple could ever have a particular value for that for the ID field so therefore this predicate does something equal something has to always return one things get tricky though is when you start having other predicates range queries, inequalities and inequalities are easy range queries and other things like that and now complex are conjunctions and disjunctions as well so now how do I estimate things like value greater than 100 or age equals 30 and status equals something right so now we're going to introduce a new concept called the selectivity of our predicate and just think of this as like the fraction to qualify for a given predicate for like one component of a where clause and so the we'll have a bunch of formulas we can use based on the statistics we'll collect um to estimate the selectivity and they're going to be based on what kind of operation you're doing a quality range predicates and then conjunction and disjunction so we're going to go through a bunch of these examples again we're going to assume that our data is uniform and the math works out in most cases but in reality again this is not always the case and in commercial systems don't always make this assumption alright so we already showed how to do predicates where it's something equal to something a quality predicate on a unique key let's look at an example when a key may not actually be unique so if we assume this we're going to have the table people and there'll be an attribute age and the number distinct values for this age attribute will be 5, 0 through 4 and then the number of tubeless we have in this table will just be 5 so if you want to determine whether something equals something you know does the attribute equal a constant again assuming that this is this is not unique then our the math works out to produce the selectivity is just to take the selection cardinality of this predicate and divide it by the number of unique values so say we're in this example here where age equals 2 we can represent our data as a histogram so for each age the number of tuples we have of people at that age and again assuming in this case we only have 5 tuples so each of these bars equals 1 so the first thing we would get the number of distinct values along the bottom the x-axis is just 5 and then the selection cardinality of age equals 2 is just 1 so therefore the selectivity of this predicate is just 1 over 5 1 fifth simple example and this works out if we look at range queries where age is greater than equal to 2 you can use this formula where you take the max value subtract it by the the value trying to do a look up on the constant and divide that by the max value minus the min value and then that produces roughly the right roughly the estimate not exactly accurate so the way you think about it is we want to know what the selectivity is for this range here where greater than equals to 2 so we had the min and the max and we can do the arithmetic for that and then we produce a 1 half because we're taking the the ceiling of this again it's not entirely accurate because in this case here it's actually the correct selectivity would be 3 fifths the formula comes out to be 1 half so this is a good example where the math is not exactly right there's some corner cases like this where you don't get it perfect if it was instead of greater than equals it was just greater than then it would just be from this over and then it would be 2 fifths but if we use that simple formula then it doesn't work out for negation queries this is actually really easy to do right age isn't equal to 2 but we already know how to do the selectivity of age equals to so we just take the selection cardinality of age equals to and just negate it so that we get these other bounds here and then now our selectivity is just 1 minus 1 fifth so we get 4 fifths there so it should be kind of clear at this point that these the selectivity looks a lot like probabilities right it's like what's the probability that my predicate is going to match particular tuple in my input and so if we assume that their probabilities which is not again it's not always exactly not entirely accurate but if we assume that they're actually probabilities then we can take all the rules that we learn from probability 101 and our intro math courses and apply them to our to our predicates here so say again if you have a conjunction and clause in this case here age equals to and name like some wild card then I can just take the selectivity of age equals to and multiply that by the selectivity of the predicate name equals wild card and then that produces my total selectivity for the entire predicate so the first one is this range here or the tuple level match the first predicate p2 is the range of the tuple level match the second predicate so the area in which the both predicates will evaluate to true is just the intersection of this part here and we get that by multiplying the two probabilities together so the big assumption now another big assumption we're making here is that the predicates are independent of each other and therefore you can multiply them together right think again think of real real data sets I could have a table of people their addresses and I have zip code in city and for a particular zip code for a particular city it's always going to be in one and only one zip code so if I assume they're independent then if I say where zip code equals 15217 if I take those two predicates and multiply them together I'm going to be sort of underestimating the number of tuples that will match me because it's really if I'm in zip code 15217 sorry if I'm in city Pittsburgh then I have to be in 15217 therefore they're not independent because they're correlated with each other so this math doesn't actually work out in real data sets there's a bunch of different ways the commercial guys can handle this but this is again another big fallacy assumption that was made in the 1970s that still carries through to gate because this makes things easier to reason about disjunctions it's the opposite of the conjunction right again we want to take all the predicates that either match this part or this part and it's just the entire area and then this is the math up here that makes that assumption that they're independent and adds them together in the back yes the typo where sorry on this yes yes I'll fix that thank you that's a good correction thank you and you didn't curse okay alright so where is this all leading to so we want to now estimate the result set size of joins this is the thing that all the optimizers are going to get wrong and this matters the most right the idea is given two tuples sorry two tuples r and s and some where clause of how you're going to join them we want to estimate the number of tuples we're going to produce as output right if we're just joining two tables this doesn't matter that much because we really care about the number of tuples getting fed into the query plan because we can decide which one is the inner versus the outer but now if you're doing multiple joins in your query that it matters a lot to know how many tuples are getting fed by each join because you would then want to figure out who should be at the bottom and who should be the inner versus the outer that way because now you're joining the derived tables or joined tables rather than the base tables themselves so in the sake of time I'm going to skip the math here but the basic idea again is just we're estimating the number of tuples getting fed in from one table and the other table and then we have some simple formula to estimate the number of tuples that will match on this but as I said in most systems they get this horribly wrong actually in all systems they get this horribly wrong because the problem is that if you get the selectivity slightly incorrect for the joins at the bottom of the query plan then as you go up that error increases because now you're throwing errors on errors and it gets worse and worse and worse so again we'll cover this in the advanced class we don't have time to do it here but SQL Server actually does the best job of figuring this thing out and most systems underestimate it and you would get underestimations if you make the assumptions like uniform data distribution or independent probabilities alright so this is fine so now we have some basic formulas we can use to estimate the selectivity of predicates and we can use that to figure out how many tuples are getting fed into an operator and how many tuples it's going to output arriving those that selectivity like what is the underlying information we have so the way the data system is going to maintain information about what your data actually looks like is through histograms right the basic idea is that for every you know every unique value or range of values how many tuples actually have that particular value so in the case here along the x-axis we have the number of distinct values for an attribute and then the y-axis is the number of occurrences again for this particular example I'm assuming our data is uniform so there's actually nothing I really need to store here I just need to store one value and say all tuples have all values for every value in my table for this particular attribute there will be five tuples that match it the problem is real data looks something like this right it's the distribution is all over the map sometimes it's a ziffian distribution right so like a power law curve but for our purposes here this is fine and so now the problem is for a particular attribute a particular value in that attribute we would have to maintain a giant map that says for this value here's the count for it right and if you think about how you actually would implement this if you just for every single distinct value you maintain this mapping it would be huge right for a column where you have unique numbers if you have a 32 bit number and that's unique throughout the entire column in your histogram map you would have another 32 bit integer just keep track of the number of tuples that have that value and it's always going to be one right always or some smaller so you're essentially doubling the size of every single column because you have to store this additional information so the way to fix this is to use buckets right so the idea here is that for every three values in our histogram we're just going to bucket them together, count the number of tuples that they have a number of values or a number of occurrences that they have and then for every single bucket we just have keep track of the min and max and then the count for that right so we're condensing down the information so that it's a more manageable size of course what's the obvious problem with this right say it again sorry he says the problem is you have to maintain these numbers so we're not so much yes you have to maintain the numbers we talked about that before you run analyze or you you have this background process that actually does this I'm thinking more terms of the accuracy of this information right he says there can be largely different numbers right so we're sort of blindly bucketing together contiguous values in our range list and that may actually not be the right thing to do because there may be since we're taking essentially just computing the average now it could be some of the really low value and really high value and we're missing those outliers so the way to fix this is use what are called equidwidth histograms and for this one I'll use quantiles and the idea is that for every as we go along for every single bucket the counts always going to be the same it's just we're going to have a variable number of of elements or values in that bucket so in these three buckets here bucket one, bucket two, bucket four the count will be 12 so we try to add as many values as we can to get that count to be 12 this middle guy here is sort of an outlier he has a larger bucket size a bucket width and the count is only 9 so then we end up something like this so the number elements or the count per bucket will be the same roughly but the number of values distinct values in a bucket can vary another option also to do is to maintain a separate data structure for heavy hitters like if I know I have one value that has a way different number of occurrences in our column than all other values I'll maintain that as a separate data structure because I wouldn't have really accurate predictions for that one so sort of a combination of these things to make this all work so again these histograms are essentially you're compressing the metadata about your data and we have to do this to make it more manageable but again we sacrifice accuracy well it's another approach we could use instead of maintaining histograms to estimate the number of tuples for particular value I try this every year and no one actually can guess it sampling right so the idea is that rather than checking the histogram to say what's my selectivity for a predicate for this particular value I'll just collect some sample data from the underlying table and then this run a complete sequential scan on that sample and that helps me use that to figure out the selectivity of my predicates so let's say I have a query here I want to select all the people with age greater than 50 say I have a billion tuples right so what I'll do is for that query I'll just sample every other something like that it's usually a much smaller amount store this in a sort of temporary table that's only accessible to the query optimizer so you're treating this like a regular table but it's just you can't run queries against it right so using all the same you know table storage and heap storage stuff we talked about before and then when my query shows up and I say alright what's the selectivity of age greater than 50 I just do a sequential scan on my sample to figure out what it is I don't look at any histograms and the idea is that if I have obviously a reasonably large enough sample size then that would be an accurate reflection of what the underlying table looks like yes one slide yes so this question is why is this called equidwidth whereas the buckets are not called equidwidth good question so this would be equidwidth right because you always take three elements let me double check that what's that equid what? equideth alright is that what it's called? okay we'll double check this okay alright so we have like eight minutes left I'm going to pause this now we'll come back to this after the midterm because I want to get through the review but to understand where we're at we've able to we have a way to derive information about the tables we run queries on we have some formulas we can use to estimate the selectivity of predicates and then the next phase we want to figure out how we actually can use that information to help us guide our search to find a better query plan so let's pause on this for now because we only have ten minutes left and we'll go to the midterm review okay so again the midterm is this Wednesday in this class at the same time right everyone needs to take the midterm no one's excluded right you should bring your CMUID because I need to class a hundred people so I need to be able to check who's who you need to bring a calculator to help you do the basic formulas that we've talked about and then you're allowed to bring a one-page double-sided of hand-written notes with anything you want any formula or things like that so that means you can't take the slides and sort of press them down and try to fit them onto a page everything has to be handwritten and I do that because I think you get more out of it that way instead of just copying and pasting so this is my list of what not to bring and it gets longer every single year so few years ago someone brought a live animal don't do that right and then last year somebody brought their wet laundry from their dorm and I was like what the hell's going on so they brought it and it wasn't this room it was another room they brought it to dry while they were taking the exam and it smelled kind of funky because it was like sitting in the washing machine overnight so you do not bring your wet laundry if you have something crazy you want to bring beforehand and let me know let's figure out whether it's okay because otherwise I got to add you to the list here so no snakes no dogs, no wet laundry so they may cover everything up to and including the joint lecture lecture 12 it's going to be closed book you get one sheet of handwritten notes double-sided and then if you need any special accommodations please email me now and make sure those things happen and then this is the link to what I posted in Piata for the midterm study guide that has the listed chapters that are covered in the textbook that are relevant to the exam and then a link to go find the solutions to any I think the odd problems if you want to expand the things that you're looking at and a link also to the practice exam that we gave out from a few years ago okay so what's going to be on the exam so obviously the relational model the focus on integrity constraints what does that mean and relational algebra we didn't cover any relational calculus so don't worry about that if it's in the textbook I'm not going to ask you to write more complex things just understand what these basic operators actually do for SQL again everyone should already know SQL from the first homework assignment and so I'll ask high level questions about the more complex things that we covered joins aggregations table expressions these are the things that you guys did in the first homework assignment I won't cover window functions because you guys weren't actually able to do that in SQLite again think about it we're not going to ask you to write really complex queries because you have to write it on paper we're not going to be able to run it and test it to see where it actually works so we're more interested in the high level concepts about what these queries are actually doing and then this comes up every year and then for commentable expressions it would include recursive expressions so then we get into the actual implementation of the database system so we covered things like buffer management policies so LRU MRU clock how to handle sequential flooding we talked about how to actually represent the database on disk or heat files or link lists page directories things like that then we talked about what's actually inside the page we talked about slotted pages layout, we talked about log structure layout, we talked about what the tuple how to actually represent data in tuples what's in the header, things like that then we talked about hash tables so again there's two high level approaches to hashing, there's the static hash tables like linear probing Robinhood, cuckoo hashing then there was the dynamic hash tables extendable hashing, chain hashing and linear hashing again we're not going to ask you to write code, it's more about the high level concepts of what these different hashing schemes actually do I don't care about hash functions it's more about the hash table itself and then we also want to care about how these data structures relate to or compare with B plus trees when do you want to use the hash table versus the B plus tree then we talked about order preserving tree indexes then we spent a lot of time talking about B plus trees you guys are building this for your first second assignment so insertions deletions splits merges how they differ with a B tree or a B link tree at a high level and how to actually do latch crabbing, latch coupling including doing leaf node scans and any basic optimizations like delayed parent splitting and then we talked about a little bit about radix trees and skip lists I realized this is not in the textbook so we're not going to ask you complicated things it's more about a high level what are the high level design differences between these different data structures versus B plus trees then we talked about sorting in particular we focused on the external merge sort we talked about two variants of it there was the two-way merge sort and then there was the K-way or the general purpose external merge sort and the basic questions are going to be things like as you're seeing in the current homework assignment if I have a data file that looks like this and I have this in my buffers how many passes are I going to do or how quickly can I sort it query processing we talked about different processing models iterator or volcano versus vectorized or materialized again what are the design tradeoffs and performance characteristics of these different approaches what kind of workloads would you want to use one versus another right and then we talked about joint algorithms nest loop joins hash joins and sort merge joins again what are the performance implications of them if I give you a data set this many tubeless and many pages this many blocks how many disk Ios is it going to take you to do the join right so this is it so any questions everyone feels fairly confident it should be fun okay that's it then we're done so next class after the exam we'll cover parallel query execution we'll finish off on the query optimization stuff then we'll talk about doing parallel query execution the idea here is that everything we've focused on so far has been sort of running single threaded to your joins or whatever you want to do now you want to say in a modern database system that has multiple cores how does the execution change if you can run parallel threads just to say this is not the same thing as distributed execution that'll be covered later on this is like a single box how can I run things in parallel all right all of us hours now email me if you have any questions on Piazza we'll take care of that later today alright guys see ya go with a flow to the eye shot here I come will he eat that hack 40 act against the real boss I drink proof but yo I drink it by the 12 they say bill makes you fat but saying eyes is straight so it really don't matter