 Okay, let's get started so we'll just jump right into it so the schedule for you guys The major thing obviously two days from now on Wednesday in this classroom. We will have the midterm exam The study guide has been online since last week, but also the practice exam I've also uploaded all the lectures all the lecture notes and all the slides from everything that's covered in the exam over this weekend And then there's Additional problems in the textbook if you want to look at those and then the solutions for the odd ones are available online So any high-level questions about the midterm? You need to bring your CMU ID because it's a class of a hundred people I don't know everyone you should bring a calculator if you can't do logs in your head and then your lab have a one One eight half I let me sheet of paper Double-sided with handwritten notes. No shrunken down slides. No shrunken down lecture notes No shrunken down textbook everything have you written my hand because you'll get more out of that way So any questions about any of these things? Okay, good And then project two will be due on on Sunday At midnight as well. I want you to be posted on Piazza with questions things like that right and then the tears are helping us need it Okay All right, so we have a lot to discuss today. I don't think I'm going to get through all of it So it might spill over to next week after the the exam But what plot through and see what happens? So today we're talking about query optimization and query planning so the high level idea what we're trying to do today is that given that sequel is declarative meaning the The application that sends us the query the query says what answer they want us to compute Right select this from this table to use all the the the employees that have this particular attitude Right, and then and it's not telling us how we actually supposed to compute this Right doesn't say do a hash join versus a nest loop join although you can do hints and things like that But we can ignore that for now, right? It's just them telling us what answer they want and so it's up for us now inside the database system Building the data system software It's our responsibility to take that sequel theory and figure out the the best way the most efficient way to execute it and so we saw from from Last week or the previous weeks when we talked about joint algorithms And they can be quite a big difference in performance depending on what algorithm you choose We had the stupid nested loop join would take 1.3 hours for a table of 6 megabytes, but if you do it in the hash join Worst case narrow was 0.5 seconds So, you know, it's it's for this one is pretty obvious, right? We don't want to be the stupid thing we can figure this out, but when we start getting more complex queries more complex joins Then figuring out how exactly go from this to this is not always obvious And that's why in this you know This is what's going to separate the the high-end database systems that are very expensive the oracles the DB2s the terror data The sequel servers versus like the open source ones and the free ones The Postgres is still very good, but it's nowhere the query optimizer is nowhere as sophisticated as sequel servers for example So the idea of a query optimizer goes back to the 1970s So I don't think I've talked about system R in this class a little bit, but so far but Back in you know in the 1970s with Ted Cod and wrote the first paper on the relational model There was basically two people or two groups that picked it up and try to actually implement it because again Ted Cod was a theoretician. It was a mathematician So he didn't actually have any software. He just proposed an idea of all the relational model said this is the right way to actually build software and then there was a group at IBM in San Jose and then a group at UC Berkeley Who ended up taking that paper and actually built the the first two relational database ministers at least the most famous ones The one at Berkeley was called ingress That was You know, if you ever know why Postgres is called Postgres the same guy that made ingress also made Postgres because it's post ingress That's where the name comes from He was my advisor when I was a graduate school and then the The IBM guys they built this thing called system R and the project is actually really fascinating because they got like Eight or nine people that all had PhDs in math and computer science again This is early computer science, but it wasn't many people that had graduate degrees of computer science They put them in a room says we're going to build a relational database system And then every person with the PhDs are carved off the room and part of the problem One person worked on the storage layer one person worked on the execution engine one person worked on concurrency control And there was one particular person Pat Sellinger. She picked query optimization a query up in the query optimizer and so back then The idea that you could have a database management system take a query a declared language like sequel Get Ted Cod never actually proposed a language initially with the relation of model sequel came later on because it was embedded by IBM The Berkeley guys had this other thing called quell which looks a lot like sequel, but the syntax is different Mike Stoenberger in the guy working in the grad school. He claims quell was superior to sequel I disagree with you know nobody writes in quell now So but in the back then people argued that there's no way a database management system is going to take a high level language like sequel or quell and Generate a query plan as efficient than what a human could do writing by hand because that's what people were doing before the rational model They were writing these these query plans by hand like writing the for loops to do joins and scans by hand And so in the same way that he will argued when the C C language came out ATT people think oh C's too high-level It's the compiler is never going to generate machine code as efficient than what humans can write today or human humans Can write an assembly and of course now we know nobody writes for the most part assembly by hand Everyone writes in higher-level languages even higher than C and compilers do a pretty good job in those cases They can do even better than what the average human can do This is what IBM proved back in the 1970s that you could take a declared language like sequel and Have the query optimizer or the plan or generate a query plan that was As good if not better than what a human can actually do or at least the average human right So we'll talk about about how the system or optimizer works But what I'll say is we go along and talk about different leading up to actually how we do the cost-based search that I I'm demented invented back in the 1970s a Lot of the designing decisions and assumptions that they made about what the data looks like and what the query plans look like To simplify the problem to make it tractable are still actually used in practice today Right and I'll go through what those are as we go along There's essentially two approaches to doing this and again, this is also gets into like what distinguishes the The good optimizers versus the bad optimizers So the first approach is use static rules or heuristics So the idea here is that we can look at our query plan and it matches a certain pattern Like a portion of the query plan matches a pattern that we know about Then we fire off a rule that I knew some kind of transformation or rewriting of The query plan to make it more optimal Right, maybe like if you identify stupid things Where one equals zero right then you can have a rule that can constrict that out very easily So the important thing to understand about these rules is that We may need to look at the catalog the system calendar tells us what our database looks like where our tables Okay, it's that the catalog and is the metadata about the data what tables or I have what columns What attributes do they have so forth so for these rules we may have to look at a catalog consultant to understand what you know What are online tables actually look like? But we never actually have to go into the actual data itself Right, we can fire off these rules without actually going to the table say well, what did it look to be? What what am I to was actually contain? right the The alternative is due call space search, which is instead of you know Well, we don't look at the data this one. You are gonna have to look at the data in some way See the idea of the call space search is that we're going to do a We're gonna enumerate a bunch of different plans in different ways and ideally in intelligent manner It's around looking at redundant or stupid things or when I knew great I've done a bunch of different plans for us that we could choose to execute a single query And then we're going to use some kind of cost model to approximate Which one is the execution costs of these different plans? And of course the idea is that we just want to always pick the one that has the lowest cost So obviously getting the that that cost model to have accurate estimations is super hard We'll see why as we go along and then also in the way you enumerate the plans It's difficult too because as I said you don't want to look at You know you have a finite amount of time to look at a bunch of different options And you don't want to spend you know hours and hours doing a branch of balance search For a query that may take one second to run So it's actually quite amazing how how how fast these things can actually be like every time in this class And we'll show some demos later on every single time I open up the quick You know the terminal and I wrote a single query hit enter it was doing both of these things in middle seconds Now again for our queries. We look at it. They're pretty simple. So it's not that mind-blowing but even still You know you you can't just run this thing forever and find it you know the best possible plan, but you try to approximate it So the pipeline for our our query optimization path is looking at the following So at the very beginning we have our application the application is connects to the database system and then it's going to send us a sequel query So the first stage would go through our optimization pipeline is called the sequel rewriters So the idea here is that we're given sequel We can have some some some some transformation rules to allow us to rewrite the sequel in certain ways So sometimes this occurs for distributed databases or if you have views like they'll have an alice or a table name And this thing can say oh I see this table name Let me rewrite it to be something else, but let me annotate it with additional information to say This particular table can be found on this node over here or this disk over here So this is optional right? This is not that common I don't think most data systems don't want to operate directly on sequel But you know this is something you could do to do some additional optimization here Then we take the sequel query that comes out of this if we have it otherwise We go back to the application and they pass it through our sequel parser And this is just getting the Lexar token stuff again you read in a pilot class. There's nothing real There's no fancy stuff. We're doing here. We're just converting a sequel string into the actual tokens the syntax tree Then now we feed the syntax tree into our binder and our binder is responsible for converting the Named objects referenced in our sequel query to some kind of internal identifier So we're gonna do this by consulting the catalog. So let's say I require a select star from table foo I don't want to have the rest of my query plan. I have to operate on the string foo I go to the catalog and say hey, do you have a table called foo if yes Give me some internal identifier to allow me to find it later on Or if it doesn't have it like this is, you know, the table foo doesn't exist Then at this point we could throw an error and say, you know, you're looking up a table We don't have or a column we don't have So now that the binder is going to admit there's a logical plan I'll explain a little more detail what this is in a second, but the high order thing like this is There's a logical plan and a physical plan a logical plan can say at a high level what the query wants to do I want to scan this table or I want to read data from this table or I'm going to join these two tables It doesn't say how you're actually going to do that. That's the physical plan. The actual specification with the algorithm using comes later on The binder is without a logical plan. It's basically a conversion a converted form of the Synex tree now with internal IDs And we have some of the relation algebra approximation and then we can feed this now into a treat rewriters Again, this is optional. This actually is more common than the SQL writer This one actually most systems have this because these are the static rules So to do the tree, right writing we have to go to the call up potentially and ask them Hey, what does our tables look like? What are attributes look like? But then we don't have to go to the cost model because we can do this as Just static rule that we want to do for every query no matter what the actual data looks like So then the tree writers need to spit out still the same logical plan that the binder spit out Then now we feed this into our query And this is like the black magic part This is where we're actually going to do the search to try to use a cost model to figure out what the The best plan for us is so this is using in combination of scheme information that the Calif provides us as well as some of these Estimates that our cost model can provide us All right, this query plan is going to take X amount of time. So the query plan is going to take Y amount of time X is less than Y so I want to put this other one But we'll see as we go along this this cost model is going to be a Typically, it's a synthetic number that the database isn't Computes internally that has no meaning to the outside world You can't take a cost model estimate and say, oh, that's gonna be 20 seconds 20 minutes It's just internal thing as I was to say this query plan is better than another so just used to compare Relatively inside the database system between different plans and has no bearing to the outside world So if you take Postgres's cost model, it spits out a number my sequels cost model spit out a number You can't know it's apples and oranges. You can't compare that some systems will try to to Have the cost model essence be termed in actual time, but that that's tricky and usually not reliable because Harvard changes And the environment changes and so forth Right, so now the optimizer is going to spit out a physical plan And the physical plan is what our data system can actually execute So when we talk about query execution, I showed these plans like it's the hash join that you know You have doing the sequential scans or index scans feeding the hash joins or sorporate joins Do my hash aggregation sorting aggregation that specification happens here So once you get outside the optimizer, then there's not really anything else we have to do to other than executed query plan So again at a high level, this is how every single database management system implements it some things like a secret writer a tree writer You don't necessarily have to have but the parser binder and optimizer is this the standard pipeline so just to reiterate between the logic versus physical again the the way to think about it is that the the data system is going to generate a logical plan that is roughly equivalent to the the relational algebra expressions within our within our query It's not always a one-to-one mapping, but at a high level you can think of that way I want to do a selection on this table. I want to do a filter. I want to do a projection I want to do a join right those are all contained in a logical plan and all those query plan trees that I showed you before Wait, I sort of just had the relational algebra symbols and didn't annotate it to say what algorithm I was using and those would be considered logical plans The physical operators and the physical plan that's where we actually specifically define What execution strategy we're going to use for those different operators in our query plan? I'm going to do an index scan using this index in this order. I'm going to feed my output index scan into a hash join operator Right. It's all the low-level details of how we actually execute it So there's a bunch of different metadata We have to maintain attention in our physical plan, but we don't really care about at this point or this class But like if I know I have a order by up above and I could do a sort merge join On the same the joint keys the same by the order by key then I can annotate my physical operators and my query plan to say This data that I'm spitting out is oh by the way is sorted based on this key So then you can reason and that can drop the order by up above So the It's roughly a one-one mapping from relational algebra to the to the logical plan Not always true, but hi all it is But for the physical plan, we can assume it's a one-one mapping from the logical to the physical right again if I could have a Join plus an order by but if I do a sort merge join in my physical operator Then I can get rid of the the order by operator up above So the the the way to think about this is that the rewriting stuff We'll talk about and then they get to the cost cost-based search. Those are all operating on logical plans But the final result always needs to be a physical plan okay so Before we get into now the nitty-gritty details. I'm gonna say that this is super hard This is the hardest part about database systems. This is actually the part. I know the least amount This is why I'm so fascinated and I always want to try to do more of it It's because like I don't know it. I don't understand it If you can do this if you if you're good at doing credit optimization You can get a job immediately when people pay you sh!t as a money to do this because like as I said IBM met this stuff in the 1970s There was a lot of credit optimization work done in the late 80s early early 90s, but now it's all like crusty-goal people right Who are all like retiring or moving on? You know what 21 year 21 year old knows about credit optimization other than my students right so If you can do this, I need you job immediately because that's the one email I always get I should share I should show screenshots of this the email I always get from from people friends at database companies. I'm not wearing a mic Maybe it'll pick up most of it All right Right, so if you can if you can do this you'll get paid a lot of money because it's like super hard and It's hard to hire people to do this kind of stuff One database company told me that they can find people that have a PL background They can they can do this Another database company told me that they had people that have backgrounds and high-energy physics Can can do query optimization? So there's this joke in databases that says like people always say oh query optimization is as hard as rocket science And the joke is that no if you fail at doing query optimization Your backup career plan could do rocket science because it's even harder than that right and again This is what's going to separate the high-end guys versus the the open source guys or the the smaller systems Oracle SQL Server and IBM and Teradata and all the the the enterprise systems have around for a long time They have spent millions and millions of dollars have hired hundreds and hundreds of people to work on these things And they're quite sophisticated And so if you know if you can do this kind of stuff you'll you'll In demand So another way you talk we're not gonna talk about in this class is that you may say all right Well, this is like super hard. Can machine learning solve this? So can AI solve it? No, but yes, right? So people have tried the applying machine learning people are trying to apply machine learning now more recently and Seeing some promising results, but it's still not not anywhere near What the you know the commercial systems can do IBM actually tried something similar back in the early 2000s this thing called Leo The learning optimizer turns out it was it sucked and everybody, you know They shipped it in production But every DB2 DBA I've ever talked to says the first thing they do when they installed DB2 is turn off that learning crap because it never worked made things worse, so Machine learning is a potential way to improve things, but it's not going to be you know a magic bullet to solve everything So as I said, this is the one thing about the any systems that I'm probably most excited about and know the least about Last semester we covered in the advanced class. We did three lectures on this I think I'm expanding that to make four or five next semester just because like we built our own query optimizer at CMU It's super hard for me to get students come work on it But they all get jobs immediately, but like this is this is something we'll probably cover more in the advanced class so Yeah, I don't want to badmouth any of the systems some systems are the creo optimizer is total crap Yeah, we'll cover that later. We'll do that. I'll find okay All right, so we have a lot to cover today and like I said, I don't think we're to get through all of it So what this might spill over into next week? I don't know why I just didn't break it up to two semesters or two two lectures Before this but it is what it is So first we'll talk about relational equivalencies like the core underlying concept That's gonna allow us to do manipulation and transformations of our query plans to find better alternatives Then we'll talk about how we do at cost estimation How do we take a query plan and in our cost model and estimate how much how much work We have to do to execute it and then we'll spend more time talking about enumeration This is like the search strategy to find different options for the cost-based search model And then we'll talk about how to handle some queries. We probably only get through half of this, but that's okay All right, so as I said the core underlying concept that we're going to take take advantage of When we do query optimization is the fact that we understand the high-level properties of Relational algebra and therefore it can permute or transform the relational algebra or a plan that's equivalent to some relation algebra statement In different ways and still produce the same result So we're gonna say we have two relational algebra expressions Or two query plans so to speak that If they produce the same set of tuples then we know that they're equivalent and therefore it's it's it's it's valid for me to Transform my original relation algebra expression into a new expression So notice I'm here again I'm saying a set of tuples because remember Relational algebra or relational model is unordered. So I don't care unless I have an order by statement or order by clause I don't care that One query plan produces tuples ordered this way and another query plan produces tuples ordered in this way There's still a equivalent and that means there's me more options available to us to figure out What is the what you know, what's a better query plan for us to use? So we can apply the sensitivity the transmissivity and the commutativity properties of relation algebra and standard you know standard logic to Change the expression in different ways move operators around to produce a more efficient plan So this high-level technique is called query writing So this is that rewriting step that I was talking about before right after the binder. We can do our tree rewriters Where we can look at the relation algebra representation of a logical plan and then move things around to produce something that's more efficient So let's look a really simple example here Let's say I have two tables student in a road and I want to do a join where Between the two tables where I want to get all the students that are enrolled in the class where they got a grade A So if I would almost take a literal Translation of the sequel statement. I would roughly come up with a relation algebra statement like this Do a join on student enrolled then apply the filter on grade and then have the projection where I just produced the course ID and the name of the student right So what's a really simple optimization we could do for this? Yes Right he said push the filter to be inside and rolled right so he's exactly right so this technique is called predicate push down So the reason why it's called he said push into I'll say the typical term is pushed down because you're pushing it down into The query plan so this is say this is the this is the the query plan tree and So we can take this filter and have it be before the join Right because what are we doing here? Well? We want to reduce the amount of work we have to do in our joint join operator So rather than having the filter be above the join operator where I'm just taking all the students and all the enrolled records And then doing the join on them then apply the filter. It's better for me to do the filter early Think of again always think of it in extremes say my enrolled table has a billion tuples But only one student ever got an A in any class Then I would do a join on a billion tuples only end up producing a single result as my as my output But if I do my filter early here, then now I'm doing a join on one tuple because again, this is this is more expensive than this in general Right, so this is the general idea what we're trying to do here and quit optimization. We're trying to identify Cases where we can end up doing less work because that's what that's the net and in the day That's what we want to do if we do less work It's gonna have lower cost have faster run times and require potential less hardware Yes So this question is would a query optimizer also be able to change this or change change let me know Yes, we'll get there but not yet So his question is in my mice. This is like a super simple example like like I can do this in heuristic I have a query plan. I always want to push down the predicate I don't need to look at what the data looks like to do what you're asking should this be the inner versus the outer I need to know how much data is getting fed into this because again the smaller table should always be the outer table That requires a cost model. We can do this. We can do this prep push down without ever looking at the data. Yes Right, so the question is what does it look at in relation algebra? It's just moving the the grade be inside of this, right? There's no Say it again because It's like this I mean there's a parentheses. So this is saying do the filter on the rule table Oh From from an algebraic standpoint. Yes. Yeah. Yes Correct yes, yes E-dot grade. Yeah Good question. Okay. The question is can predicate push down ever be bad Yes So in this really simple example here like doing that something equals something that's easy to do There are some predicates that can be expensive to compute So now this is where they end the data systems cost model can try to say well So simply give an example. So sometimes there you could have We'll cover this next class on Monday next week, but there's things called user defined functions So instead of having like a function, you know So an expression either touch a constant or another another an attribute in a tuple I can have a vocal function that can then be any arbitrary code like I can have a user defined function written in C Or Python and that Python code makes a call out to a remote server Does some kind of computation? You know pays in Bitcoin for the microtransaction and it comes back. So if I know that For this predicate here to do the join this is super selective meaning There's not gonna be very many tuples coming out of this it may I may be oak better off applying this up here because that this is You know monetarily is more expensive than doing this So you don't always want to push this down But in general, you know, yes, you do my microtransaction is a far-fetched example you could do it, but like It's stupid. He wouldn't want to do that Right, but there's other things to like if there's computing a hash or something like that or some kind of crypto stuff You may want to put that up above there here But again, the data system can reason about that. Okay, so Let's go through now a bunch of these different operators to talk about what kind of optimizations we can apply For the other relation operators in our query plans So we've already covered this one right the predicate push down Right the idea is that we want to do the filtering as early as possible Ignoring the cost of applying the filter But then that's gonna allow us to throw away data more quickly in a disk-based system applying the filter sooner Sooner rather than later is usually better because then I'm not copying as much data up, you know up above and polluting my memory You can also reorder the predicates themselves so that the more selective ones are applied first So let's say in my last example where I had great, you know Someone had has a great equals a let's say I had another predicate where age equals, you know great age greater than 99 So find me all students at Carnegie Mellon that I've got a in the class and are older than 99 years I don't think there is anybody right? So it's better off me to apply the the age predicate first because that's going to feel throughout way more things before looking at the at the great predicate Now you may say all right, isn't that just you know a slight computational overhead? Yeah, for a disk-based system, maybe it doesn't matter that so much But in in other systems like a memory system then that can matter a lot Yes Yeah, so he's right. Yeah, so so His statement is like for this one. I'm claiming that we're not looking at selectivity of the data So you couldn't actually do this one. Yeah, that's true. Yes for this one. Yes, you're right But pushing down you always want to do that. Yes So It's question is why would be breaking it up be better than just Having them be all at once again if the computational cost of one of the predicates is more expensive than another Maybe you want to break off that one and have them be up above in the query plan. Yes Do you always want to break it up if you're like saying like a column store setting where like Yeah, it's not a place so if it's a column store Then maybe I want to do a pass through the column first That's everything out and then for the things that match jump to the next one. Yes, that that is the right example And then for this one here again, you can simplify complex This is it you know simplify complex predicates to be more easy to compute. This is a trivial example But x x equals y and y equals 3 well again through through the transitivity calls We know that x is just this is just x equals 3 so we can rewrite it like this So now what happens is that in when we was going through a billion tuples We can just look at each attribute x and y check to see whether equivalent to a to a constant Which is way cheaper than having to go get the reference of the y attribute in the tuple and then copy it You know to some register or some variable and then doing the comparison This is like a micro optimization that mostly only matters for the in-memory guys All right for projections we can also push them down as early as possible And the idea here is that we want to minimize the amount of data We have to copy going you know from one operative the next and this mostly matters in the row store systems Because if the rows really wide and I'm moving that entire tuple up from one operative the next then if I can Strip out as much data that I don't need it early as possible then I'm copying less data from one step to the next Right so and say in this example here say the the student table has a thousand columns But my query only needs two attributes and he's the student ID and he's the student name So I can then introduce now Projection in here before I feed into my join operator So I strip out the the 998 columns that I don't need and only pass in the two that I do need All right, this is very actually very common in distributed databases because Moving data over the network is expensive and slow So if I'm getting this is on one node This is on another node, and I'm doing the join on the same node that enrolls on then I want to strip out as much As I can before I send that over the network All right, so everything I showed you here so far. These are doing Applying optimizations on on you know relation algebra operators relation algebra expressions We can apply these same optimizations for the underlying predicates and expressions themselves in our queries So I'm gonna show you a bunch of the examples of how different data systems can rewrite query plans To to simple by them or avoid doing stupid work And this comes from this great blog article written a few years ago where somebody Listed out all here's what different oppositions you can do that doesn't require a cost model And we'll we'll give a demo of some of some of his comparisons But it's really nice because he provides a the github repository that has the sample database He uses for his analysis and it works on SQL server DB to Oracle like works on a bunch of different database systems You can take the same database same queries and see how the different optimizers can can can rewrite things and fix things Okay, so the the first kind of thing we can rewrite is to remove Stupid predicates or unnecessary predicates So in this case here select star from table a where one equals zero What does it evaluate true? False so what does that mean that means for every single tube? I'm gonna scan on table a I'm gonna check Oh, is the predicate false equal to true it never will be equal to true. So therefore no tuple would match Right so if the data is if we can recognize that I have an impossible predicate where no tuple could ever possibly match this It can just actually skip the scan entirely and just produce you back a an empty result right away So you may say all right. Well, this is stupid any why would anybody write one equals zero? Nobody would actually do this in the real world correct theory People you know people shouldn't be that stupid but a lot of times in applications The the query is not gonna be constructed from a single line of code You're not gonna have a string variable in your in your application code that you then just you know immediately send off to To the database server a lot of times these sequel queries are constructed from like dashboards and tools Where people are clicking different things adding different options are composing the query So now a bunch of different libraries a bunch of different functions are all gonna be constructing a sequel query So there's not one single place you could look in the code and see you know where one equals zero It may come from smaller functions from smaller part of your organization From code that you didn't write So if the data isn't can recognize this that this is stupid and not do this This is a huge win for us because now we don't even have to look at the data Likewise you can do the opposite you can have one where everything is gonna match one equals one right so in this case here Every people is gonna match But I don't want to actually I don't want to actually apply this predicate and see whether they're gonna match Again, if I have a billion tuples, I don't want to have to go through every single tuple and say hey for this tuple Does one equal one true? Yes, I'll put it next tuple. It is one equal one. Yes true I'll put it right instead. I can say well I don't need to do this one equals one at all and now I just do a sequential scan straight dump of the table as The output and not worry about applying any predicate Right, and that's cutting down on the computational cost And I think I essentially just rewrite it like select star from a But so these are pretty simple, but we can do this for other things more complex queries We do the same kind of thing for joints So here now on my table a this the scheme is up there the primary key is the id column So here I'm doing a self-join where a as a one join on a as a two on a one to id equals a two id So what is this query actually saying? It's saying for every single two point a Check to see whether it exists in a And of course is always gonna be true because we know id is the primary key So do I exist in this other table the answer is always gonna be yes So it can identify that this predicate is entirely wasteful This joint is entirely wasteful and just rewrite it as select star from a and again relying on relation algebra Equivalencies we would know that this is equivalent to this yes Because it His question is why it's conditional on id being the primary key Because you know, it's unique and you know, it's not null so therefore you can rewrite it if it is if It could be null then no doesn't equal null so therefore you wouldn't have a match Right no is always null like no we can open the terminal inside us One doesn't equal null. Actually the the result of this one equal null is null Right, you it's unknown. So your question is is null equal to null Unknown is no null. I think that one actually will evaluate to true But in this case here if it's null then it won't work All right, so what are more complicated things we do so we can also now ignore projections that are unnecessary So in this one here, this is sort of a rewrite of the of the last query the self-join So this is select star from a and then in my where clause I have an exists Where I just say this basically says return true if there's anything that matches any any result inside the the interquery here so this is saying if For every two point a Produce the output if there exists a tuple where the adi the id equals the a 2 to id and against the primary key So it's always going to match. So inside of this we're materializing or have a projection on the vowel column But this is completely unnecessary, right? This is us having to do not only the wasteful join but also copy this projection Which is got copy this attribute out as part of the projection and that's unnecessary as well Right, so in this case here we can just rewrite this Entirely this part here. It is entirely like a select our from a the last one is to do emerging predicates So this one here we have select star from a where bow between 1 and 100 and bow between or bow is between 50 and 150 So in this case here if you just think think of the number line This is completely redundant because if it's between 1 and 100 Then it can also can be between 50 and 150 if it's between 50 and 100 so for this one here I can just rewrite this to be a single between calls where one between 1 and 150, okay? So let so this is how you know, this is sort of obviously see how we want to do this Let's see whether who actually does it Because everybody loves demos right so as I said that blog article that I mentioned has Source code you can download And includes the the schema and insert statements for for a bunch of different data systems So unfortunately, I broke my sequel server installation yesterday So I can't get to work but we'll do a demo on postgres my sequel oracle Maria dv and and sequel light So again the the schema looks like this. I'm not what up. Sorry Right, so this it's so the the sample data is we're going to use it's it's like a it's like a online video store Right, it's sort of like the IMDB stuff you looked at there's actors and then people are renting movies, right? But we're going to focus on the the actor table. So again the nice thing about it. This is that the They provide for the same you know the same database for all these different For all these different database servers, right? So again, there's 200 tuples in this and then we'll just go into my sequel real quickly select star second star From actor right there's 200 in every single one. Okay So the first thing we want to try out is that impossible query where one equals zero So again, what I'm doing here is For postgres, right? This is postgres. You have explain keyword We'll spit out the query plan and then the if you put the analymed analyze modifier after explain That'll actually run the query, but then still show you you know the query plan and what it actually did Right, so in this one here it says up above one time filter is false So it recognized that one equals zero is Always going to evaluate the false so it applied the filter once and saw that for the entire table No tuples ever going to match so it didn't actually Didn't actually run and run read the data. It said oh, it's false. It's nothing Gives you back nothing right away. So we can do this in now my sequel So then my sequel explain is not as good like if you just do this Then you get something like like this But if you do the for whatever reason the dash or the slash G then you actually get a tree It is what it is But here you see the in the extra parameter. They say I've been possible where so I recognize that nothing's ever gonna match Not don't even bother applying that wear clause It also has this little warning thing here, which is annoying. This is my sequel 5.7 I I don't know whether it's been fixed in in My sequel 8 But you have to Run show warnings and then it spits out like what actually happened So you see here in the where clause they rewrote my one equals zero. They just wrote that as zero So everything evaluates to false right So now we'll try in And it's do Oracle So with Oracle the way it works is I say explain plan 4 And the syntax is always slightly different and it says explain Which now I got to go now to another table that has the the plan that it got explained so the Oracle explain is actually is Exactly better than then It's actually better than post-quest. Post-quest one is very good This one is actually pretty good too because it shows you like, you know Computational time the number rows and data they're accessing is actually really nice So what are they saying here? All right, so this says that it knows I'm doing select statement and it has a filter and it says how I'm gonna access the table, right? So the thing to point out though down here. They have information of the predicate so they say where filter null is not null so Oracle doesn't doesn't support bullions, right? So it has no way to say true or false So a rewrote my does one equals zero to be null is not null which is always false Because no is null So this is there. This is the way they're representing false So they rewrote now the query plan recognize that this thing is false and therefore you don't actually have to run it Let's just go back to a post-quest real quickly and to What he was asking before right so I can say select right? I can treat the data since like a calculator, right? So I can say select null, right? That's nothing, but is no equal null Nothing, it's null is no equal false No, nothing, right? Just is false equal to false true is true equal, you know, just Right, so anything that any anytime you say there's no equals something the answer is no The empty space is no the way to get around that is no is no Then it's true. So what Oracle is doing is no not no false Okay, all right, so then last one. What do you? sequel light Again, the syntax is always different. They have now you say explain query plan without the four and then Produces this so this is not actually that helpful, right? This is just telling us that we're doing some kind of scan on the on the actor table So if we remove the query plan part Then you get something that looks a lot different than what everybody else showed Now you get something that looks like assembly or some kind of machine code so without going to the details of ways sequel light works is that They actually convert the logical plan or the physical plan of the optimizer spits out into a bunch of these op codes Think of this is like the JVM bite code and then they have an interpreter that can they'll get and now execute these bite codes So this is like this is literally the program to execute this particular query. And so from what I can tell right they say Where R1 is not equal R2. I don't know exactly what that means then you go to 10 But in this case here, it looks like they're actually doing doing the scan here So the best I can tell I'm not sure whether they're actually stripping this out Actually, no not equals this yet a baby truncating it there recognizing that I don't have to do a scan Right so everybody can handle this. This is good So let's try now the useless predicate. Let's try the one equals one. Let's go back to Postgres Well now it looks like it's telling us that it's actually doing it Why sorry it should do it because it's one equals one right? So this is just saying that I Recognize that I can remove the filter and I just execute the query right away Try the same thing in In my sequel no analyze sorry Right and this one it figured out that it can just throw away the filter entirely and just scan everything. So that's good Let's try oracle explained and then go back to this Right the filter is completely gone. It just announces straight sequential scan and then for sequel light This is not helpful so we'll remove that the query plan part and Then now you see that it's inside of this the for loop here Where they're scanning over the table that they've removed the predicate All right So let's look at an example where you have more complex looking example where the the heuristic Could look at the catalog and try to figure out what the right thing to do is so The schema for the for the the actor table is pretty straightforward Right. We just have an actor first name last name and last update. Okay, so What we're going to try to do here is we're going to try to have a query Where we say the where the actor ID is null So we went around this query And so what should happen here is that we would look in the catalog and say well the actor ID is the primary key and It can't be null So therefore I know that no tuple will match My predicate and therefore I can remove that that were close or just in return back immediately the the first you know an empty set So this is an example where the rewriting phases rules could look in the catalog and understand something about the table or the attributes that are being Being you know, and they're being accessed in the predicates and start throwing out crap. That's useless So just to prove that that our table cannot support Null attributes for the idea for the in the actor ID if I try to insert a tuple with a null for the first column Then it you know post us picks it out and says there's an error if I try to do the same thing in in Oracle Oracle flips out and says that the actor ID can't be null throws an error Let's try this in my sequel. My sequel doesn't have default. So I think we got to go current timestamp Actually no default should work here It let me insert it Let's see whether that's actually in in there Where last name equals Pablo? Let me do it Right because the way my sequel works is the the the actor ID is an auto increment key So instead of in this particular version of my sequel Instead of throwing an error if I try to insert null it says oh you're an auto increment key So go go ahead and let me just run that and replace the null with the next value Where's every other system will throw an error to try to insert null my sequel will let you do it All right, but let's let's try to run our query So we know that that no attribute should actually be null the note no tuple should have a null actor ID So we try to run this query now Or is null Right postgres looks like it actually is going to run it and actually did run it Right applied the filter and remove 200 tuples So postgres was not able to recognize that no tuple with it No tuples at actor ID could ever be null and not actually run it and actually and still ran it Let's try it in In my sequel my sequel figured it out and possible where? And can be like me made north Try it in sequel server or sorry in in or in oracle Explained and then there's our lovely null is not null. There's our false So it recognized that this is this is never actually going to evaluate the true and can throw it away So postgres actually got it wrong My sequel in oracle got it right. All right two more examples. Let's do now that range query. We have a four where we want to say We want to get all the films that are where the film ID is between one and two and Film ID between one and nine and two hundred. So this is again another impossible where Right, so nothing should come back We're on explain So my postgres couldn't figure it out Right, it's doing that look up that that should never happen. That will never produce any results in In my sequel This one, I'm not sure about says no matching no matching row and constable. I think this means that they were able to figure it out that they Yeah, this one. I'm not sure about This has a warning though. See what see what he wrote it to This question is isn't this a satisfactory set aside build a problem which is MP complete. Yes, but like But for basic things I can I could identify that it's never going to you know produce anything and throw it away Actually, no, sir. Yeah, so this is this one actually requires the no no no for this one You could have a rule you could write a heuristic that says if I have if I do all my look ups on on the on a But give give an attribute right and I have a range here between one and two inclusive and between 199 and 200 those sets don't intersect and It's conjunction where I have to have both of them. I know it'll never match. I can throw it away That's like that one. He That's not sad. That's pretty straightforward. All the parts were satisfiable. You have to deal with that. Yes All right, so any questions about what I'm when I'm showing here I'm showing that you can just reason about what the query looks like Without actually looking at the data without actually having to run anything and determine whether you know Ways to simplify the predicate will lead into now the discussion about how we're actually complete these estimates So the for these simple examples here that Joins a pretty straightforward So we can you know easily maybe reason about which one should be the inverse of the outer Things go bad though when you start having a lot more joins Because the complexity of the problem of making this decision is going to explode so the Number of different ways we can do a join for for a query when we have n tables to join is Gonna be four to the end This clearly is no way we can enumerate every single possible join ordering and for our query and try to figure out the best one Right, you know, it's a large number when it has a name right so We're gonna see how the data system is going to be able to strip down The the number of different combinations and plans that have to enumerate for these joint orderings And that's going to reduce the search base of the problem But before we get there we need to understand how we're actually going to estimate How much work we're going to do for these joins or for these different scans? So I've already said this a little bit before but the The cost model is going to allow us to to assess and estimate how much work we think we're going to have to do in our database system Again, this is an internal synthetic number That is only allows us to compare the relative performance of different query plans and within the same database system We can't compare across all of them So this is going to be a combination of a bunch of different things We've already seen that when we talk about join algorithms that you want to do you want to base it on the number of Disguise amount of disk IO that I'm gonna have to do It also need to be on how much DRAM or how much memory space you're gonna have to take up on your buffer pool as you're computing the query If you're on a distributed database then the number of messages is also the an important thing because the network IO is slow and efficient so the the main takeaway for this is that The cost model is going to allow us to say is one plan better than another Without actually having to run the query plan because that's the only way to get the actually the truth the true cost It's actually run the query plan, but you don't want to do that because again, I have I have for the end different join orderings I can't run forward it forward to the end different query plans for one query because that'll take forever So a cost model is going to allow us to approximate this You know we're giving up sacrificing accuracy of our estimates in exchange for efficiency now there's some database systems where They don't have a cost model. They don't have a query optimizer. They actually just fire off all queries and see which one ever comes back first That's the one they pick So the only system that I know that does this is actually MongoDB Right, and it seems kind of stupid. Why would you do this? Why wouldn't just build a cost model? Well, they weren't supporting joins back in the day and They needed a cost model. They needed a query optimizer So if I could figure out what index to pick just fire up all the queries whichever comes back first that's the one that's the fastest and then I just remember that every single time I see that same query over and over again and eventually, you know They'll they'll run the trial again and first as sort of simplistic as it is it actually works quite well for them This is what their rights doesn't matter, right? What is a right? Right is doing a index lookup to find something Right, and then you make make the modification the index lookup is the more expensive part finding the data is more expensive than the actual right, but yes, if I'm updating the the table and my index is now changing its distribution Then the what what they would do is they would have to have a trigger to say as well I've run the same query a thousand times. Let me refresh and rerun everything again or my table My table or in their world collection has changed. I've updated by ten percent. Let me just rerun that trial again It's just heuristics Yes His question is are they running sequentially are they running? One after another so like MongoDB is a distributed database as far as I know the way they do it is I Have I have a query that needs to run on a bunch of different nodes And they have a different segments of partitions of the data and they all have the same indexes There's they're based on different parts of the you know of the table So for this one, I'll use this index this one I'll use that index and so forth and see which one actually is the fastest and then you see the same query again That you need to distribute across all these different nodes. You just pick that same index so that was the fastest one for everyone Again super simple, but it works pretty well All right, so the way we're going to be able to approximate the cost of executing a query is the Is by maintaining internal statistics about what our tables look like this has come up multiple times throughout the semester Where I say oh the day system can know something about the tables. This is what I mean This is how they're actually able to do this So inside of the system catalog We're going to maintain metadata about what our indexes and what our tables look like what the values are in in the tuples and Then how we're actually going to maintain this information is going to vary based on the different systems So I said like again a simple heuristic like if my table changes by 10% Re-collect my statistics other systems can say well while I run the query Also look at look at the tape look at the data as I'm going along and then propagate you know information about what I see back to my my Back to my internal statistic catalog You can also run this manually So in again this they all vary in the different systems have a how what what the syntax actually is But in general again analyze is so the commonality of across all of them You call this analyze function and that kicks off a sequential scan that does looks at the date the data and Again updates this internal information about what the distribution looks like And then you think it's you you could trigger this based on updates to the table or like a cron job that runs every day or so forth And we talk about multi version courage control Let's talk about vacuuming think of that like a garbage collection in the JVM You can run analyze at the same time as well because when you're doing special scans You're looking at everything and you can you can update these things all right, so in the sake of time I'm going to cut it off here because now we're actually going to get to the formulas of how you use the statistics to estimate These values well, we'll pick up on this next week And then we'll get into actually how you do the enumeration in the algorithms as well Okay All right, so again on Wednesday in class. We're having the exam It's real. This is not a joke one year. Somebody was like, oh, I thought you were kidding. No, this is real Bring your CMU ID bring your Sheet of paper with notes and then bring a calculator And I'm having office hours immediately afterwards But if you need to talk to me and not and you know at a different time send me an email. I'm happy to meet with anyone. Okay All right, guys. Good luck and see you on Wednesday System