 Okay, let's just start it All right, then always thank you DJ Dr. Okay, so real quick some announcements for the course in terms of Projects project three will be released today ish tomorrow ish And that'll be due in November on Sunday. I just be doing the 15th I've updated it yesterday and now let's do on the Sunday on the 17th Homework three will be released next week. So we're not worried about that yet But that'll be due before the 4-project 3 the other major announcement too is that With my as I said in the beginning of the semester my wife is pregnant. I think it's mine And so they are taking it out of her on On Wednesday night. I had him scheduled after the Sigma deadline on Tuesday, which is the big database conference So it's happening on Wednesday. So next week. I won't be here DJ drop tables will be here and then my PG students will Will teach those two lectures and then depending on what happens. I should be back Maybe the following week or so and we may adjust. There's that I think we They told me originally it was coming out the 30th So I canceled class on the 30th, but now it's coming out whatever this Wednesday is or Thursday So I might adjust whatever that that day off is but the goal is to get all the material need to do homework Number three in the next two weeks. Okay Any questions about this? I will post us on piazza with with you know more information It's real The other change also too is that the original schedule was that Last class before the midterm was all about query optimization. We obviously ran out of time So I've decided to split that lecture now into two parts So this is part two and then I've dropped the lecture on Embedded database logic, which is not really something we need to know in order to build a database system It's sort of just a an overview with you for you guys understand like are there's other things you can do other than Then just throwing a sequel at at the database So if you're curious about that, you know, I'll send the post a link on piazza You can watch last last year's lecture on that topic Query optimization is kind of more important. So I think it's better to spend more time on this. Okay All right, so Last class when we start to tell in query optimization We focus on this first part here right how to do apply rules and heuristics to make changes to the to the query plan Without having to exam the data understand what the what the database actually looks like on the inside We may need to look at the catalog to understand what our attributes are whether they're unique or not and whether we have foreign keys But we don't need to know anything about, you know Our distribution of values in the tables look at look a certain way. We have this number of tuples Right. So these are all rules that we could do without actually looking at Understand, you know, what are tuples look like so today now we're going to focus on the second part here Which is the more complicated complicated part where we're now going to use a cost model to allow us to assess the the quality Or the amount of work we're gonna have to do for a query plan without before we even actually run it and The idea here is that we want to be able to enumerate as many query plans as possible And then pick whatever which one we think is the best And so the more accurate our cost model is the more accurate our selection will be for what the best query plan is But as it seems to go along It's this is super hard and everybody's gonna get this wrong and then what I'll teach you today is the way So the textbook shows it tells you how to do it, which is which is really really wrong But we'll talk about a little bit how we can possibly fix these things But this is something we could recover in the advanced class or if you take more You know you can do people have done dissertations on this kind of thing and it's still still a unsolved problem All right So today's agenda we're going to first talk about how to do again plan cost estimation with our cost model Then we'll talk about how to do enumeration how we're actually going to intelligent intelligently Iterate or a different possible query plans in order to find the one that we think is the best Right because again, this is NP hard. We can't do an exhaustive search So we need to be smart about what we're actually looking at and then we'll finish up talking about necessary Sub queries because this is slightly different than everything else and there's rules we can use to to rewrite them and to make them more efficient So we already talked about this last class we thought this we said that you know What is a cost estimation what is our cost model actually doing and again? It is Essentially, it's in a the way to approximate how much work or how long it's going to take to execute a query And in general we always want to pick the one that's going to have the lowest cost So this cost could be a combination of a bunch of different underlying hardware metrics You know because that corresponds to the work. We're actually going to do so it could be how much CPU We're actually going to use this is typically very hard to do And we don't do this for a disk based system because the disk is the major bottleneck, but an in-memory system we care about this We've already talked about how to do Counting disk IO for our joint algorithm to sort of algorithms This is probably the major thing that we're going to focus on We also care about how much memory we're going to use there could be one algorithm that uses a lot of memory and gets faster performance But we may not have that much memory actually used so therefore we want to choose a slower algorithm that uses less memory Because that'll be less pressure on the system and then for distributed databases It's again. It's the number of network messages is the high pole in the tent because Sending things over the wire to between machines is always expensive And so in general at a high level these are all going to be a proxy for Are we going to use the number two boats? We're going to access as a proxy for all these things right essentially determining How much data we're going to we're going to pass one opera to the next and we can use that to derive Which one we think is the best so as I said We can't just you know the way to get the most accurate estimation of what a query plan is going to do is actually just execute the query plan But if we're looking at thousands and thousands of different possible query plans We can't possibly execute every single one so we need a way to to approximate this And this is what our cost model is going to do and the underlying concept we're going to use or underlying component in our database system we're going to use to do these estimations is the internal Sysistic catalogs of the database system So every database system that does has a query optimizer that is using a call space search It's going to have this in statistics module Which is going to allow it to collect information about what the tables look like what we're actually what are actually inside of the tuples And how you collect this information can vary based on the implementation So all the major systems have a way to force the data system to collect new statistics Right analyze analyze table update stats run stats I this tells the data system. Hey do a sequential scan on my table and update my statistics information Some systems also can run this in cron jobs like every every every so often periodically just do a pass Other systems can piggyback off of the queries as they run and say alright as I'm doing sequential scan I'll also update my statistics as I go along Other systems have triggers to say if the 10% of my or 20% of my table has changed Let me go fire off the run stats command and update things There's no one way to do this better than another a common setup would be You like if you're running an OTP system You would disable this during the day when you're doing most your transactions But then at night time you could take take passes through and update your stats So during the day it's gonna be slightly off, but that's still gonna be okay Like is this this is expensive to do because this thing and this is a sequential scan on the entire table So let's dabble in a little math, but it's anti math not hard math. So Everyone should be able to follow this so the basic main information we're gonna attain about every table is just the number of tuples that they have and The the number distinct values we're gonna have for every single attribute within our table So we're actually gonna maintain this as a separate counter Because we just can't assume that you know I have x number of pages and therefore I can fit X you know y tuples in each page and it's x times y Because again not every slot in every page will be full and then when we talk about multi-version concurrency control We have multiple physical copies or multiple physical versions of every single logical tuple So we can't just you know count the number blocks. We have we're actually want to maintain this as a separate count And then we'll talk about how we're actually going to compute the Maintain this information to get the number of distinct values for every single attribute So now with this basic information we can now derive a New statistic called the selection card analogy defined by this function SC and This is just going to be compute the average number of records. We're going to have for given attribute with that same value So for every single distinct value, I would say you know Here's a number of times that that it occurs So we just take the number of tuples that we have and we divide by the number unique attributes that we have And that tells us for every single attribute how many times it occurs What's wrong with this? Right says one like so he says one could be a thousand one could be one and this formula clean this is it Absolutely, so this is one of the big assumptions. We're going to make throughout the entire lecture and that is we're going to assume That we have uniform data So this formula basically just saying every single value occurs Every see for every unique value that I have in my table for this attribute it occurs the same number of times as all other values But we know that's not how the real world works Right, so take like CMU for example CMU roughly has 10,000 students It's more than that but it's simple math and it roughly has I think actually does have 10 colleges So if you assume you have uniform data Then you would say for all the 10,000 students you take 10,000 divided by by 10 and that's the number of students that are each college It's exactly the same for every college, but we know that's not the case right the school computer science where I'm in That has way more students than the school fine arts It's a real-world data is skewed, but to make our math easier for what we're talking about today We're going to assume that everything is uniform But again, this is another example where the real world doesn't work this way real-world systems have to account for this And we'll briefly talk about how to do that All right, so with this selection cardinality. What can we do with this? Well, the goal is for us to now figure out How many tuples we're actually going to to select during our scans using our predicates Because that's what we need to figure out how many tuples each operator is going to spit out and feed them to the next operator And then we can use that to figure out how much work they're actually going to do how much disk they're going to use How much memory they're going to use We're using the selection collection cardinality figure out For for the given input were provided from our children operators how much how much data is coming out of us so If we want to get for an equality predicate on a unique key This is the easiest thing to do and our math will work out great Right, so say we have a simple table the people table. We have an ID column. That's the primary key So if I have a lookup says ID equals one two three then that's easy I know the cardinality is going to be one. I'm going to have one tuple that's going to match For no matter how many tuples I actually have in in my table Because it's a primary key. It's unique Where things go get hard is now when you have more complex predicates like range predicates or conjunctions Because now I need to be able to combine the selection cardinality for these different predicates in certain and non-trivial ways so, oh shit, sorry so the Based on the selection cardinality now we're going to produce this idea of of selectivity of a single predicate so Selectivity is basically a function says for a given predicate on a table What is again? What are the number of tuples that are actually going to qualify? So the formula we're going to use to compute this will depend on what kind of operation that we're actually doing But the last one I just showed you was an equality predicate on a unique unique attribute But you know now we need to account for the case where it may not be unique or we're looking at non inequalities or range ranges So let's look at some simple examples here So assume now in our people table for the age column. We only have five unique values Right zero through four. I said the thing of this is like instead of storing the exact age of somebody We're putting them into two groups like internet advertising or advertisers do this all the time like people under the age 18 18 to 35 35 to 50 and so forth So we have five distinct values and for our table here. We have just five people So if we want to compute now the selectivity of an equality predicate, right? You know where something equals a constant then we just take the the selection cardinality of our predicate Divide that by the number of tuples that we have and that's going to tell us what percentage of the tuples are going to match In our table but so in this case here for selectivity of age equals to Assume we have like a simple histogram of all the tuples We have since we said that the we are assuming that our distribution of values is uniform Every every every distinct age has an exact value or the same number of occurrences So to compute this selectivity. It's just taking this Which is the selection cardinality of age equals to because it's only one you're only looking at one value We just look in our history and we find exactly You know the number of currencies of this Right, so it's just one of our five Again, so this one here. We're assuming uniform distribution and we're assuming that We would know exactly what this value is and therefore this math works out great, right? This is exactly what we want But we now so do something more complex things like getting a range pedicat So now we say where age is greater than equal to two. Well, the formula is assumes here that we're that we're we're only looking at integers That are continuous. We have a continuous range of values. We look at so you just take the max Value divided by the minus the one you're looking for divided by the range of the max minus min And that I'll tell you what roughly what the selectivity is so in this case here We're looking for everybody that's two or greater So we take them in the max subtract that that's four Then we just take the the value we're looking for and the high value that we want and it's four minus two So it's one and one over two So this is wrong right the real answer is actually three-fifths, but the way the formula works out we get one half So this is a good example where like these formulas don't always work correctly And they're going to produce errors in this case here. We're underestimating the the selectivity it should be three-fifths not one half So this is going to be problems when You know when you start doing estimations of a complex queries that have a bunch of different predicates and bunch of different operators Because now we have errors built on errors built on errors So say this is like we're doing the scan at the bottom of the tree And now we have an underestimation of the number of tubeless. We're gonna we're gonna produce as our output Now when we do calculations up above Now we're taking wrong inputs or wrong estimations as our as our input to our operators and then doing more wrong math on them producing more errors So in many cases or the research shows that for almost actually for every single database system They evaluated for this one particular paper Everybody underestimates the selectivity of all these operators And you may say all right who cares how that why is that big of a big deal? But now when you start sizing up your You know your data structures you like your joint your hash tables for joins your buffers for sorting now You're gonna underestimate what these sizes are and you may have to correct that once you realize I have more data than actually And I expect it So these have real runtime applications for in systems and also to we're making now wrong estimations about You know what plan might be better than another All right the last three I want to look at our negation. This one's pretty straightforward, right? It's just one minus whatever the selectivity of the predicate that we want So in this case here this like the selection card in Audi of age equals two is one So the negation is just the boundaries outside of that Right, and you get four-fifths, which is the correct answer for this one, right? Because assuming that something equals something So the major observation we can make about this is that this select selectivity estimate for predicates is Basically the same thing as a probability Right, it's just saying what is what is the probability that a tuple is going to match My given predicate So if we make this assumption now we can use all the the tricks that we we learn from you know statistics one-on-one to start combining together these These predicates in more complex ways So let's say that now we want to have a conjunction, you know age equals to and name like You know a wild card So we would have this compute the selectivity on the first predicate age equals to compute the second Selectivity on the second predicate and then now we just combine them or multiply the two The two probabilities together and we get their intersection here, right? This is where you know we'd say it has to be an exact match or sorry We have to match both and would be this inner part here All right, and so the same thing for disjunction disjunction that the formula slightly different But you're getting you're just assuming that the that they're that they're independent And therefore you can use the standard math trick to figure out what the union is here So I've talked about two assumptions so far that are problematic But this again this is what the way every textbook covers it The first is that we assume that our data is uniform But I showed a simple example where that's not the case and then here now we're assuming that our data is in Our predicates are independent That's also not always the case too There's actually a third assumption. They're always going to make that's problematic It's called the joint inclusion principle. So again, so there's a bunch of somethings We're making about computing the cardinality of our predicates that make the math easier But are going to end up having us produce incorrect approximations So again uniform data assumes that everything is always going to be the occur of the same probability The way to get around that for for heavy hitters So heavy hitter would be like if you if you really skew data and there's like 10 Columns or 10 values that occur, you know majority of the time You can maintain a separate hash table or histogram to keep track of those guys And then everyone else you just assume is uniform and derive the the cardinality cardinality estimates based on that So that's the standard trick to get around the uniform data issue We then we talked about the independent predicates So that allows me to take two predicates and if it is a conjunction just multiply them together to produce the If the combined cardinality and then the inclusion principle says that If I'm doing a join two tables Then for every single tuple in my inner table, I'll have a tuple that matches in the outer table Right, but that's not always the case But you know the way to think about this is like why would I join two tables if there if there's no way to actually join them There's no actually corresponding values that would match So we make that assumption, but in the real world, it's not always going to be the case because you could have dangling You know or you could have references that don't you don't exist anymore in the outer table So these two are the ones that are probably the most problematic this one occurs and more advanced things We don't need to worry about So I always like to show this one example to sort of emphasize and show Exactly why this is problematic And this comes from a blog article written by a former IBM researcher and so Guy Lehmann worked on like the early One of the early IBM optimizers from the late 1980s early 1990s. That's actually still used today in DB2 It was pretty influential. So we had a blog article that he likes to show you Here's why the assumptions we're making here are problematic Let's say we have a database with a single table of cars and We have two attributes. We have the the make and the model. So the make would be like Honda Toyota Tesla the model would be like Camry Accord You know escort and then say we have a query that says where make equals Honda and model equals a court So if you make the assumption about that we that we've made so far the two assumptions about the independence and uniformity of our data Then when we combine this these two predicates together We would say one over ten because that's that we have ten makes and honos one So that's one over ten and then we have one over a hundred models because we have a hundred models a court is one of them So we'd multiply them together and our cardinality estimate would need to be 0.001 But we as humans know that these values are actually correlated But these two predicates are correlated like you can't make an accord like there's no other car manufacturer It's going to make an accord. It's only a Honda. So if you know the model equals a cord You can then know that the the make has to be Honda And so the correct selectivity for this particular query is actually one over a hundred So we're order a magnitude off from what the the formula would actually tell us what we think we are from what what we actually should be Yes So a question is if you had a foreign key would that make your life easier? No, you would have to know whether the the foreign key child is Is unique because it could be one to n or one to one you have to know something about that But even then the foreign key I think for this particular example Doesn't help you I Think about that though But for this one, it's one table, right? Like this is just like one table give me all the cars or the make equals Honda and the model equals accord When it has nothing to do with like, you know, we're doing a joint right I Need to think about whether whether the joints get where the foreign keys the top of that look we can take that offline So again, like we're automatic to off So now we start making estimations about how much work we're actually going to do for our query plan And sizing up our intermediate data structures and our buffers We're going to be way off So the independent assumption and it's going to cause us to underestimate how much work we're actually going to do So the way to get around this particular issue and this is something that the only I think the high-end commercial systems actually do is To do correlated column statistics so I could tell the database system. All right model is correlated to make Right. I can't make an accord if I if I know that my model is an accord I know what my make is Another example would be like if I know my my zip code of an address field is 15217 that I know the state has to be Pennsylvania So if you if you declare these these columns as being correlated now the database system can You know special case its estimations to avoid these pitfalls It can know that these things are correlated and therefore it can use the right formula to derive the selectivity of it But as I said only the high-end systems do this. Yes Your question or statement is it doesn't need to know how it's correlated just that they are correlated I'd have to go look to see what the syntax supports. I think you just say they are correlated and it should figure it out. Yeah Like I think only only Oracle SQL server DB to Tara data Maybe snowflake can do this but like my sequel and Postgres can't do this as far as I know SQLite certainly can't do this All right, so let's talk about now. How we're actually going to get this information that tells us like the number makes a number model thing There are a number of occurrences of every value So I've already sort of mentioned this before but the database system is going to maintain Histograms on the inside to keep track of these statistics So the most simplest histogram would be for every single distinct value that I have in my column I just count the number of occurrences that I have Right so in this case here, this is our uniform data. So I have 15 unique values and You know each one occurs five times And so now I want to say, you know, what's the number of? Tuples that are going to match, you know, just something equal five. I could look at this and say I know it's exactly, you know five But again real data doesn't look like this real data is more skewed and So now again if we have a history like this This is fine because now we can say, you know, how many how many tuples have five we would know the exact value What's the problem with this though? For every single value I have in my column. I'm storing an entry in my hash table from a histogram That's gonna be a lot right soon that soon this this count here is like 32 bits So my simple example here I have 15 unique values. So, you know 15 times 30 bits roughly 50 60 kilobytes or 50 sorry 50 60 bytes. It's nothing But now if I have a billion that unique values and I have a 32 bit integer for every single unique value Now 1 billion times 32 bits is 4 gigabytes And that's just the sit that's just the histogram from one column So now you do this for every single column So nobody's actually gonna store exact values like this except for the heavy hitter stuff that I talked about before So if the heavy hitter you would have the exact value, but you only store maybe like the top 10 or 20 Unique values for every single column. You're not storing this for every single possible one So the way to get around this is to start combining together These values into buckets so that we only store a single value for the bucket rather than a Individual value for every single element of the bucket, right? So this would be called an echo with histogram so basically would take the Every three values here You whatever the count is the sum of all the the occurrences for every single value in that bucket and then now my new histogram Just has that aggregate value Right, so I'm doing I'm doing buckets of size three, but you can you can you can size them any way you want So now the way to get an estimate to say, you know, how many times does say the number number two occur I would look to see what what bucket does my value that I'm looking for fall into so two is between one and three And then I would say what's the count here? So this case roughly nine I have three values so I take nine divided by three and now I'm estimating that two occurs You know three times So again, we're saving space for saving computational overhead of maintaining our histogram But now we're again we're introducing more errors in our approximations Because we there's no other way to get around this other than storing exact values So this is not so great either because now I could have going back here between this bucket, right? Eight had a high count seven and nine were much lower, but then when I'm combining them together I don't know which one actually had the high count, right? My heavy hitter could handle that but we can ignore that for now So a better way to do this is actually use quantiles so with this one we're going to do is we're going to have the We're going to vary the width of our buckets to the last one the buckets are always the same width But now we're going to vary the width such that the the sum of the counts for each bucket is Roughly the same So in this case here I can have the first bucket would have values from one to five the count goes to two Six seven and eight has has three three values for the counts 12 Sorry, the accounts 12 counts 12 nine and 12 and so now I have variable length buckets But now I potentially have more accurate estimations on of the the the currencies of values within those buckets and Then this one here. I'm showing quantiles. You can do the desiles and other other grouping sizes So any questions about this again? This is what we're going to populate when we run analyze or run stats in our database system It's going to generate this information for us and store this in our catalog and It's durable and disco when we restart the system we come back We don't have to run analyze again. All are all of our statistics are still there Yes This question is if I add now If I add 10 more 10 more values to five so now it's shoots up What will happen in my histogram? So this is blown away every single time I run analyze I recompute everything so yes in that case it could could could vary the now the size of the bucket Correct so as far as you know in most systems They don't maintain these things as you do inserts and updates because it's just too expensive because again We'll talk about transactions on Wednesday, but when I'm running a transaction I want to minimize the amount of work I have to do so anything that's not important right now I want to I'm going to put off till later because I'm holding locks on on tuples And that's interfering with other other transactions running at the same time So I don't want to maintain this as I go along now You could say all right I could have like a Separate background thread could look at recent changes from the log and then go apply these changes Yes, you can do that some systems might do that, but in general everyone blows it away and restart from scratch The one system that does try to do the updates on the fly was With IBM DB2's Leo the learning optimizer they would be good They would run a scan they can go back and update this thing, but you know it has issues Histograms and these and the sketches and we didn't talk about sketches But like these histograms and these heavy hitter stuff. That's the way you know the sort of the most data systems do this another alternative Instead of using these these these additional data structures is that we could just maintain a sample of the table and derive our systems from the sample so the way to think about the histograms is like it's a essentially a lower resolution copy of The database of the tables Right. It's an approximation of other contents So but rather than having these histograms and try to derive statistics from them What if we actually just took a copy a smaller copy of the table itself and Then ran our predicates on that smaller copy and then assume that the distribution of values within that that sample is The same as it exists in the real table and therefore any our selectivity estimates We derive from the sample will accurately reflect what the real what's in the real table So let's say we have and our people table We have a billion tuples and then but let's say we just take a sample. We're just going to get every other Tuple and copy it into a sample table But there's obviously more sophisticated sampling algorithms you can use but for our purposes now. This is fine so now when my query comes along and I want to compute the the selectivity of age equals great or age greater than 50 I go to my sample and I say well Obama's over the age of 50 So therefore it's it's one-third and therefore I can assume that the difference in values in my my full table will match that And in this case here just like in the histograms We could make we could maintain this as we go along right as as as you know Epirotically refresh it or we could trigger it whenever you know We know that a large portion of the table is change or we do a bulk load a bulk bulk delete But the idea here again is that rather than maintaining histograms who may could be inaccurate We just maintain a sample So this only occurs as far as you know in the high-end systems So sequel server misfamously does this and their optimizer is probably the best one They but they actually do a combination of the histograms and end the sampling Which I think is the right thing to do, okay? Yes Sadie what what makes this more difficult than what? This question is why is this the case that Only like the high-end commercial enterprise systems actually do this versus the you know the source guys good question. I think the Yeah, that's a good question If you already have analyzed you're gonna do a sequential scan anyway to compete your histograms Might as well just generate this thing Yeah, actually don't don't know the answer It might be the case that just like the histogram ways the way it's always been done Right, that's not a very satisfying answer I mean Here's one thing so I think like with the histograms You the way you have in your optimizer you just have this cost model Actually, here's probably the right answer the histograms. We will weigh faster, right? Because again, I'm enumerating all these possible different query plans I can go to my histogram real quickly and derive this The statistics I need to estimate that the the selectivity of a predicate or an operator Whereas this thing to compute the the selectivity estimate. I had to do a sequential scan on it That's definitely gonna be slower than the and running through the histogram right so We can cover this in the advanced class that the way it probably works in SQL server SQL server play says If I recognize my query is super simple just use the histograms if I think it's gonna be a lot of work Like it's gonna take maybe minutes or hours to run Then who cares if I spend an extra couple seconds doing my same my sampling technique because that'll make a you know Big difference when I actually run the query that's probably that's the reason The histograms gonna be faster This one takes more work and it's also sort of weird too because like you're like you're doing a scan on something Why you're running the optimizer so it's like a from engineering standpoint might be hard to set up Sorry Running stuff on this table is expensive No, but like this is the model like this is super simple like like it's You know, what's the sliver to keep flexibility of this predicate? that component of the cost model itself is is Independent I think of whether it's a histogram versus a sample But yeah, the yeah, so up above is the formula that I'm able to say I'm gonna do this man of this guy Oh, I'm gonna you know this this hash joins better than this other join Yeah, if that's not of all that other parts not very sophisticated then maybe this doesn't matter question over here or yes How do you create a Sample which comes out to be accurate for every query that you're trying to put right? So this question is how do you actually create a sample that is gonna be accurate for every single query that? For you can ever possibly throw at it My stupid sampling here is every other one But clearly that's like we know it's stupid because maybe data inserts come arise at different times And therefore the data that I insert today versus the data certain yesterday has different distribution And so maybe I want to sample differently or I could look at my predicate and say well I It's I'm only looking at you know data that was inserted today. So therefore I will make sure my sample only includes that This is this is where it gets hard It's getting probably the reason why that the advanced advanced systems do this better than or do this and the open source guys don't do Like there's there's reservoir sampling there's a whole bunch of other sampling There's a lot of sampling techniques to try to come up with different as ways to do this. I Don't know what the commercial systems actually do But hopefully you see why the query optimization is super hard because like now you need to some know some gnarly math to figure out Like what is the right way to sample this? And after you write it could depend on the query some queries, you know uniform sampling might be perfect other sampling techniques might be better for others Yes Like doing the table thing if you're essentially when you're doing a scan like if I do essentially you have to build a separate table And then do a search on that table. So if you're engineering in this or like, how do you Say so statement from an engineering standpoint This could be hard because like now you have a separate table and then you Want to be able to sequential scan on it and ideally use the same execution code that you have To do sequential scans and in order to compute these statistics. Yeah, it is it is like a chicken for the egg I can't run a query until I have a query plan, but I can't get a query plan until I can run a query that can compute the sample Yeah, okay So this point what do we have we can now roughly? You know Emphasize and roughly We can rough roughly estimate the selectivity of our predicates. What do we actually want to do with them? Again as I said in beginning this is where we're gonna do our cost cost model or cost-based search to do query optimization so for this one Again in the pipeline after we do all those those rewrites with just the rules now We're gonna run enter this this cost model search cost-based search to try to figure out How to convert the logical plan into a physical plan? The physical plan is what the data system actually executes the logical plan that says I want to join these two tables The physical plan says join these two tables with this hour and this buffer and this sort order and all that good stuff so For single relations, it's pretty straightforward. We'll briefly talk about it The one we're gonna spend most of our time which is the hardest one is the the multi multi relations or n-way joins because now it's not only worrying about the The you know what joint algorithm I want to use but what order I want to do my join and I'm I said last class the number of Possible query plans we could have this four to the end where ends the number of tables we're joining Right goes again. It's for all my different join algorithms I can I can join them in different orders and I can join them either one with the inner versus the outer Right, so the search space explodes So Because this is incomplete and NP complete We don't want to actually do an exhaustive search because we're never actually going to complete It doesn't make sense to run our query optimizer for an hour. If our query is only going to take, you know One minute to run that's not a good trade-off. So we need a way to figure out how to sort of shed work or Cut off query plans that we don't want to examine to reduce our search space so we can make this problem more tractable So let's first talk about how we want to handle single relations and then most of the time talk about multiple relations So for single relation query plans The the hardest problem we have to deal with is picking our access method Right the the fallback option is always a sequential scan. It's the slowest, but it's always it's always correct Then we can maybe want to do a binary search if we have a cluster index or picking pick a pick a you know One or two one or multiple indexes to use for an index scan The other thing we can care about also is the order in which we evaluate predicates Like I have something and something if the second predicate is more selected than the first one Maybe I want to evaluate that one first. So I throw away more data Sooner than rather than later and the second predicate evaluates always, you know always true So we you know, we want to put that as the second one So in most new database systems, like there's all these startups all these new data systems coming along the last 10 years If they have a query optimizer, they're probably in that, you know, they're probably just using heuristics in order to pick these things But you don't actually truly need a sophisticated cost model to do this I just say, you know what what index is the most selective and that's the one I always want to pick Or what predicates most selective and that's the one I always want to pick So for all to be queries, this is especially easy to do because They're not going to access much data and you know, they're doing, you know single-table lookups and for the most part And so for all to be queries the the query planning we're going to do is essentially try to identify whether a query is Sargeable And this is some terms in the 80s. I don't know who invented it. All right Sargeable just means search argument able And all that basically means is that there's an index we could pick for a query. We know that's the best one to use That's it So again, we don't need to have an exhaustive search We just look at all our query plans or sorry look at all our possible indexes that that could satisfy our Our query and pick the one that has that has the best selectivity Because that's going to route us to the data more quickly Again really simple. I had my my my select star from people where ID equals one two three I just have a heuristic that says oh, I have a primary key On ID, but therefore I have an index done. I just pick that as my I'm doing index scan on that on that index Right again most newer systems that come along that are doing transactions or doing OTP stuff This is what they support first For the joins though that for that that's when things is hard So again the as the number of tables we're going to join our table the number alternative plans are going to grow So therefore we need a way to prune that down so we're going to rely on what I'll talk about here we're going to rely on a Core assumption that the IBM people did back in the 1970s with system are when they built the first query optimizer And that is they are going to only consider left deep join trees So that means that any other alternative join tree structure. I'll show that what that looks like in the next slide They're just gonna not even bother doing any search or cost estimation on it. They say that's we're not we're not even going to consider it So a left deep trees like this where along the on the left side of the tree. That's we're doing all our joins Right so we join a and B and then the output of this join a B is then joined with with the input of C the scan on C And so forth right this middle guy here is sort of a hodge pod right? It's it's it's You know you someone on the left someone on the right Right and this one here is called a bushy tree where I do the joins on you know C&D and then do the joins in a B and then the output of these two joins are then then joined together at the end So IBM and system are they're just gonna not even consider these other guys here. They're only gonna look at this one Yes Yeah, this is actually again, so his statement is the result is always the same. Yes, this is the beauty of relational algebra, so The join operator is commutative So I can put I can join these things in any way that they want and the final result is always the same It's always correct. So therefore it is perfectly safe for me to go ahead and do this Everything I guess why they do this other than just reducing the number of plans. They have to look at Think back to when we talked about query processing models All right, so back then they were doing the iterator model the volcano model It's also called what a pipeline model. So in this case here with a left deep join tree I don't have to materialize any output from a join operator Right. It's always been fed into the next join operator So I do my join a and B then I take the output of that joint And I now build my hash table to do or you know do whatever kind of join I want to do on C If I have a bushy tree here, I would do the join and C&D That output then gets written out to look a temp file on disk because now I need to go back over here And now do join a and B and then now I go back and feed that back in the the join I just did over here Read, you know, read that back in build my hash table do whatever I want to do and then do the join with this other guy here so let deep join trees not in there are Not always going to be pipelined, but It makes your life easier and back in the 1970s They didn't have a lot of memory so they would have to spill the disk a lot So you can minimize the amount of work you have the amount of data to write the disk in order if you always go left deep Right, so this is all I think I just said here. So in in today's systems. Not everyone makes this assumption But again, I think every textbook talks about this You know, they cut these things out. All right, so how are we actually going to enumerate our our query plans? so The first thing we're going to do is enumerate at the enumerate at the logical level all the different orderings of our of our tables We could possibly join Say I want to join RS and T I could join R&S first maybe you know T&S first I enumerate all those things and Then for each of those I could then now enumerate all the different possible Join algorithms I could use hash join certain is join message loop join and Then for all those now I can then also now enumerate all the possible query plans. I could I could have As you can see how this is like search space is exploding So what the IBM guys came up with in the 1970s was to use a technique called diamond programming To make this more tractable by breaking it up into smaller discrete problems We solved the smaller problems first and at the very end we combine everything all together So let's look at a really simple example here So let's say I want to join three tables RS and T So the way to think about this is like it's a it's like a sort of search tree that I'm showing horizontally So this is our starting point here for our logical plan where none of the tables are joined And then our end goal is end up here where we have RS and T join together So in the in the in the first step we want to figure out You know, what's the first joint ordering we want to do so we could possibly join R&S first or T&S first And then for the sake of space, I'm not showing all the other ones But for all the possible join orderings for this first join we want to do we enumerate them down here And then now what we're going to do is we're now going to have a compute the the cost of Doing whatever join we're specifying here in the first step with our different join algorithms So again for sake of simplicity, we're saying we can either do a search merge join or a hash join And then now we just use all those formulas that we talked about before to now compute the cost of executing each of these these join operators Again approximating the amount of disk IO we're going to have to do And so for each node we have in the first step We're just going to pick whatever path actually has the lowest cost Whatever join algorithms can actually have the lowest cost and that's the one we retain And then now starting from each of these these each of these nodes in the next step We do the same thing and try to compute the estimate cost for doing different joins to get to our end goal here Right, and then we just end up throwing way for each node here We end up only keeping the one with the lowest cost to get to our endpoint here and Now we go back and try to figure out which path is going to have the lowest cost for us And that's the one we end up using for this query plan This is over over simplification of actually how this actually works But this is the general idea from system R that that they've been in for dynamic programming and at a high level There's sort of two categories of query optimizers We're only talking about one of them This one's the most common most systems sort of operate this way Postgres does it this way my sequel does it this way Oracle does it this way right you start with the The first node is always the starting point when nothing's joined and then you work from the beginning to the end to figure out How to get to my end goal where everything's joined together? The other thing also over simplifying here There's no information to tell me about what I'm showing here about the physical properties of the data were you know We're emitting from one opera to the next So I'm not keeping track of whether things need to be sorted whether things are compressed or roast or a column stir all that extra kind of information get you have to consider in your search algorithm here, but For our purposes we're ignoring that all right, so Let's walk through this example a bit more concretely. So that's it. That's the dynamic approach We're going to build out our search to try to figure out which one has you know What path to get me to the end goal of everything's joined together that's going to have the lowest cost Let's now sort of all put put this all together and do the three steps We talked about so you want to enumerate all the join orderings all the algorithms and all the access methods and again I'm emphasizing there's no data system does exactly the way I'm showing here It's way more complicated, but at a high level you hopefully if you understand this you can then see how to apply it to more sophisticated configurations and setups So the very first step to join join rsnt. I'm just going to enumerate all the different possible joiner rings that I have But we said that for for For system are we're going to prune anything that is is either a cross product, but that's not a left-out or join Are those things we can just drop immediately So then for each of these guys So let's pick this one here for each of these query plans now We're going to go inside of that and now start enumerating all the different join algorithms. We could possibly have Right so for to do this join RNS and then followed by T I can either do a nest loop join I could do a hash join so now I'm going to enumerate again all possible configurations of those and those are my edges going In that dynamic programming graph those are my edges going from one step to the next And then we do the same thing for all the the other the join orderings from the previous slide so then now we're going to pick one of these guys and Now try to you know enumerate all the different possible access methods you can have So we either do a sequential scan or an index scan and then for each index scan You know for each index we possibly have we would have another enumeration of that So again, you sort of keep fanning out and having more and more options and then you use the dynamic programming Technique to figure out what the cheapest path is Okay, so is this clear roughly how this works? All right, so I always like to show this every year so again I'm going to show you how Postgres has a specialized optimizer, but in general what I've described here is at a high level how every System are based query optimizer works They have a cost model that allows them to estimation of as they're doing dynamic programming search to get to the end goal Postgres actually has two optimizer search algorithms again So they have the system R1 that I just talked about, but then they also have the special one called it the genetic optimizer or the GEQ Genetic query optimizer GEq o and what happens is that? If you have a query that has less than 12 tables, they use the system R approach Then if you have 13 or more then they start using this genetic algorithm because they can deal with you know a larger search base so With Postgres, they're going to support all different types of joint orderings So left deep right deep bushy doesn't matter And as I said, and then they'll fall back to this genetic one when it gets too complex So at a high level this works as your standard genetic Search algorithm, so my first generation I'm just gonna have a numerate a bunch of different random configurations of my query plan Right and that's the joint ordering plus the index scan or for sequential scan plus the actual joint algorithm I want to use and then for each of these I'm going to compute the cost And then what happens is I pick whatever which one is the best right? So this one has the lowest cost I'll keep track of that up above and say here's the best plan I've ever seen Then I'm going to throw away the one that has the lowest cost and then now do a Mix up of the traits of the ones that weren't thrown away So now I'm going to do random flips of the the genes if you will of the components of the query plan to produce new query plans right And so it's sort of like a random walk So now I'm going to do the same thing in my second generation I find the one that has the lowest cost in this case here This one up here has a cost of 80 that now becomes the new best cost I've ever seen I throw away the one that has the lowest cost Then I do a random mix up of the plans that I kept around and I generate the next the next generation And they'll keep doing this for until a certain amount of time There's a timeout says I'm not I haven't seen anything else I haven't seen anything better than what the the best one I've seen so far in a certain amount of time Or I need to exhaust my my my fixed time limit, and then whatever comes out of this is the best Is the one I'm going to use Yes Yeah, the question is For simplicity reasons, I'm only showing you left deep trees You could mix it up with right deep and bushy trees But you think this is only three tables We imagine you had another three tables you want to join and maybe one part of it is right deep one part It's left deep I can I can mix and match them as needed Yeah, but it's hard to draw that How does it pick out the plans that it needs to consider in the first generation? So there's a question the first generation how does it generate this? Random It has to pick out some specific number out right because if it does it for all the query plans then you don't need to do anything apart from the person right? Yeah, your question is how many candidates? I don't know how I think it's probably configurable. I don't know I don't know what the number is, but yeah, it can't be everyone Yeah, it's it's some percentage of like actually pie some fixed amount. I don't know what it is But again the this cost estimation is the same thing we already talked about before right? This is the same histograms are sampling and you're just applying this as you go across Right So Postgres is the only one that I know that actually does this There was some work in doing simulated annealing other techniques This is like a random algorithm because it's again it's not guaranteed to converge You're not guaranteed to see exactly the best possible option You're doing a random walk in the solution space and hopefully you land on something that that's reasonable So actually to his point you obviously want to pick things that are Your initial candidates should be you know reasonably good how they actually do that? I don't know because if you have all crap here you have ugly children in the first step and have ugly children in the second step It's not gonna be good. So there's there's probably some way to figure out what your initial candidate should be All right, so any questions about this? Yes Okay, so this question is how often you how often do you get queries that have 13 more more joins Very often It depends for all OTP you want for For analytics it's quite quite often yes So think about this So in a data warehouse a very common setup does have what's called a Snowflake schema and the idea is that you have this single table that's called your fact table And then you have these dimension tables that are around it so fact table using Walmart example Walmart has a fact table that's every single item that anyone's ever bought a Walmart so that table is massive It's billions of billions of things, but you don't want to store like the name of every single product the price and so forth So you have these dimension tables on the side that says you know here's the product they bought here's what store his location So those are all your dimension tables So when you do a join to say find me all the the find me the best-selling item in the state of Pennsylvania During the winter for this month range for people you know over the age of 35 Those are all joins now with those dimension tables. So that thinking rock up very very quickly The other thing I'll say too is like I always talk to the two the data companies and I ask them like you know about the query optimizer because that's the part I'm most interested about And the metric they always give me in terms of like oh our query optimizer is good They always claim the number joins that they can support and I don't think this is a good metric but like I've noticed that this occurs every single time I ask about this So I think like Mem Siegel told me once that they could do 35 table joins And somebody else somewhere they could do 75 and then splice machine told me they could do 135 Snowflake said they can do thousands and they have customers that actually do that It's hard right yeah Like in this case, what is happening is that t r t and r has been joined on one attribute and r and s are being joined on the Separated if all of them will be joined on the same thing like the same column then we will also Consider plans which first join t and s then join s and r Like this sequence you have maintained everywhere t, r, s, t, r and s right You're reading too deeply into the ordering things it like again joins a commutative you can join them in any any possible order you want No but like they are being joined on the same thing right so t and s then s and r If like the same column is present in all the tables You could permute them any way you want yes Now I thought where you're going is there like I've shown you two-way join algorithms like two for loops There are multiple way joins where you can say I'm joining r s and t on exactly the same attribute Let me join them exactly the same time That's hard and only the expensive systems do that Postgres my seal don't do this yes What distinguishes What distinguishes that what do you mean So this is actually something that the human recreate I'll cover this maybe next class or I can send post something on on our piazza Like you would just like a human would have to design the database to say here's my fact table here's my dimension table It's not something we automatically figure out I'm using the term dimension table because that's that's the vernacular for describing what that table looks like in a snowflake schema You have a fact table in the middle and then the things around it are called dimension tables But there's from the databases perspective unless it's like a system designed for OLAP queries or analytical queries There's no special degradation in SQL to say oh your dimension table or oh your fact table. It's just as humans We use that term. Yeah, I'll send I'll send slides from from less class or from this the advanced class I'll post that on pizza. Yes Yes So the idea here is like there's again, it's a random it's a randomized algorithm But so the algorithm says this thing has the lowest cost I don't know why this is the worst, but there's something about it. It's bad So rather than propagating through the next generation where it's bad, you know deformed genes are right I'm just gonna throw it away And then there's something about this one that has the lowest cost that I want to propagate forward So that's why I'll let this thing you know get friendly with this guy Right and then before you know you got to go to the hospital on Wednesday and your wife's giving birth right like it's Right Yeah Your question is the sorry how is what different How is this different than this well this is doing this is this is right deep that's left deep No, no, no that that's that's the outer table. This this is the inner table So if I'm doing hash join I'm building hash table on T and I'm probing with T down here Join matter the order matters. Yes Oh Yeah, yeah, that should be here. Yeah, that's a mistake. Yeah, this is just saying that this thing this guy is allowed to propagate forward That should be down there. I'll fix that. Sorry. Okay So let's finish up quickly to witness the subqueries So this is something we'll cover more in the advanced class if you take that but the the main way to think about this is that The it's not like a join because it's a subquery and we want to be smart about how we're actually going to evaluate it Right because the dumbest thing we could do is just evaluate that inner query the subquery for every single two But we're looking at and the and the outer query Right my sequel used to do that and it needs to be awful So there's two approaches we can do this and the idea here is some of these and we can do these without having to Run a cost-based search we can do this in part of our rewriting phase So the first of that we can rewrite the query to to to Decorrelate it or flatten it out or we can extract the inner query and run that separately as its own query And then feed its result into the first query So let's look a more complicated example here. So this is So this query here we're trying to get all of the Say we have a table to keep track of sailors So this is from when we used to use the old textbook from Wisconsin and Wisconsin has a sailing club So that's what this is from so basically we're trying to get all the sailors The name of the sailors where they have reserved a boat on it on a given date So we had this inner query here where now we're we see if we're referencing From our inner query a record from the outer outer query So in that case here, we know that they're correlated Right because they're referencing each other so we can just rewrite this as a join So again, we can do this in rewriting phase we can recognize that we have this this this predicate up here And we rewrite it like this and then we just do our cost-based search to figure out where the right join order is as before Let's look more complicated example So this one here We're trying to find for every single sailor for each sailor the highest rating Over all our sailors get the two reservations for red boats And then find the sail idea with the earliest date in which the sailor reservation was on the red boat So we had the main thing is that we had this inner query here Where we're just trying to get the max sailor rating or ranking for all sailors So worst case scenario as we do this look up here where rating equals and then this inner query For every single tuple on the outer table on the on the sailor table We just rerun this thing over and over again But that would be slow that that would suck And so what we can do is then extract that out or rewrite it To so that we don't have to have that you know rerun that every single time So the first approach could be just take this out Just nest a block run it up here store it in some kind of variable And then substitute that value down below So the main thing I'm trying to show here is that the query automated doesn't necessarily have to take a single query And only treat it as a single query We could rewrite it and execute them you know one after another In order to fill in the values that we actually need And this is something that the more sophisticated systems can do And this is saying this is the outer block and that's the inner block above Okay, so the that's the bulk of this is what we're going to talk about for query optimization for the rest of the semester As I said this is something that I'm really interested in And I plan to cover more in the advanced class Is this is something that you're interested in getting involved in And want to get started on this you know this is you know Contact me because this would guarantee you a seat in the advanced class as well So we talked about how to do selectivity estimations We talked about the major assumptions we made about the uniformity independence And the problems with this we talked about how to do dynamic programming For to do join orderings and then we can do simple techniques to rewrite nested queries Alright any questions about any of this So again every single time you fire up a query in SQLite or Postgres or MySQL Whatever database you're using it's going through all of this which is amazing How fast it actually is Right even though it's a really complex problem Alright so next class I will be here And we're going to start talking about concurrency control So this is the second hardest thing in database systems This is also something if you're really good at you can also get a job This will cover on Monday or sorry Wednesday this week We'll cover the basics of concurrency control theory Again Andy theory not real theory And then Monday next week and Wednesday next week I'll have the PG students cover two-phase locking and time stamp ordering Because those will be the two things you'll need for the fourth homework assignment Okay Okay I was going to do Extra credit Hold up let me get the slides So releasing the extra credit assignment here today So you can earn 10% extra credit for your final grade Not for the exams not for the project for the final grade of the entire semester If you write an article about a database management system And so you can pay any database management system you want other than the ones that Students have written about before The way to think about this is you're writing like in a psychopedia article Where instead of writing free form text like you would in Wikipedia It's set up so that you can specify exactly how different parts of the system are implemented There's options to choose like how does concurrency control what kind of indexes it has And then you can write information in like a description of what actually it actually does So for this one you have to provide citations for everything you do But again everything will be explained on the website So I've created this website called the database of databases dvdv.io I wanted to do dv2.io but then that's asking for a lawsuit from IBM so we didn't do that So the way it basically works is that it has different ways to categorize different database systems To find what you're looking for So in this case here these are all the data systems that are implemented in Rust And then there's again there's an article here that describes exactly how everything works And there's citations for everything Okay So I will post a sign-up sheet on Piazza for you to select what database system you want to use Again people this is the second year we've done this so there'll be Not every single system will actually be available to you But there's enough of them out there that everyone shouldn't have a problem to actually pick what you want So it'll be first come first served so when I post it online When I post the sign-up sheet online I will announce it on Piazza And then whoever gets there first gets whatever system Longest been approved So the way you think about this is that if you pick a system that's widely known and used Like Oracle Then there's a lot of documentation and there's a lot of information about actually how it works So you expect to write something that's very comprehensive with a lot of citations If you pick an obscure system Then this might be problematic because there may not be as much documentation available So in that case I can either get you in touch with the people that are actually implementing it And they can provide you information I've had previous students post on the message boards or message people on Twitter They ask them for information on how this system actually works One student last year picked a system that only exists for three years in the 1980s And then she reached out to the guy that actually wrote it because he was MIT alum It's actually quite impressive what people have tried to get this information And so what you get is you'll sign up and get an account And there'll be a page like this you can go edit Again you can see that it's not free form text for everything You can select exactly how different parts of the system are actually implemented So you may say, Andy, what database system am I going to pick? And I'll just say that there's enough of them out there That you should have no problem finding one that you want So I think I am currently aware of 636 different database systems So distributed systems, embedded systems, systems written in different languages Going back to the 1968 we have one, or 65 So again you should have no problem finding a data system that peaks whatever your real interest is Okay, yes? Why is there something that we're also doing? What's up? F***ing giggles, yeah, yeah, I know It's not real, yeah This is actually, this is not even all of them, this is still a subset This has probably made like 400 of them, there's another 230 that I know about Okay, so how can you decide what system you want to pick? You pick the country of origin, if you care about databases written in Brazil, or China, or India We have those, right? Based on popularity, so I keep track of what pages get viewed the most I can tell you which one you should probably look at There's this one in France written in JavaScript For whatever reason Google picks us up, we get a lot of traffic for that Whatever programming language you're interested in, if you like Rust, you like Go, you like C We have a database system written in Bash, right? Whatever you want, we have something for that You want to distribute databases, embed databases, single node databases, we have those Disk versus memory, Roaster versus column store, commercial versus open source or enterprise We have it all If we have time, we can look at the leaderboard and see which systems actually appear the most Which programming language, which country All right, I gotta say this, because last year this didn't happen So do not plagiarize So there'll be two parts, you're going to submit it in two parts The first part will be like a checkpoint, maybe after Thanksgiving We'll look over it and give you feedback on whether you're on the right track If you're doing the right thing And then there'll be the final submission during finals week, and that's where you get your final grade For both of those submissions, do not plagiarize Last year we had somebody caught in pace from Wikipedia And they put that in there, and the reason why we called them is because they didn't go delete all the brackets For the citations So we had to go report them to Warner Hall It was very messy So do not plagiarize, do not copy anything, do not copy images from the internet Don't assume that if someone describes the system that they're an expert and they know better than you And therefore you feel wrong rewriting it I don't care, do not copy anything Because your name is going to go on this, this is public on the internet So if you plagiarize, we have to go report you Okay, yes? If you cite an image, it's rather good to sign it The question is, if you cite an image, I'd rather not... You can get by without images Yes? Is there a difference between the image that represents the database and any images that are not actual? Like, are we responsible for finding an image that... I wouldn't worry too much about images, yeah Yes? How does the editing of previous data users work? If they're updated, is that something that... So the way to work is, since students have done this in the past They don't all have the same quality So I will provide a list of the ones you should not pick from No, I mean more like when... Oh, yeah, so I'll provide guidance for this You should always assume the latest version So if they did something like five years ago and it got fixed I don't care about five years ago, just describe what it does now You can mention they used to do it a certain way But I would focus, if you just focus on how it exists now Yes? And then in the future, like five years from now I'll be dead or you'll be gone, don't worry about it Yeah, actually we've had some companies reach out to us And we allow them to edit the page But again, I'm curating it I want them to avoid marketing... We're the fastest database, all that crap Keep it very scientific Yes? We'll take that on case by case basis Is there any system in particular you're thinking about? No Because the Mongo one is actually pretty good already Okay So if you got to go, you can go But let's... We can quickly look at the website and we can... Alright, so what country has created the most database systems? USA USA Who's number two? He said Russia China He says China India France and India Alright, so this is the website So we can go to the leaderboards Actually, let's do this So again, you can click on any database system And then over here it has the country of origin So somebody said China So it'll tell you how many databases China has China has 28 U.S. has... U.S. has 300 Somebody said... What was it? India? What is the Indian code? IN? 5 Russia... Are you? 21 You're missing it Germany 44 Right? France is less Alright, so again So I can click on this And then I can click on programming languages To implement it So this one is written in C++ Right? To 101 What is probably the most common programming language Used for a database system? Everyone says C++ C The combination of C and C++ Yes Java Right, so we can go to... So I have this leaderboard thing here It says the breakdown of... So here's the breakdown of the countries Right? Germany is number two Java's the most common programming language But again if you combine C++ And C Then it outpaces it And then also we keep track of like What database systems use other database systems So these are these embedded database systems Like ROXDB A bunch of systems like... Like CockroachDB Use ROXDB as the internal storage And then they have a go thing on top of it There's also systems that have forked Other systems and use that as the implementation So there's a lot of companies that have forked Postgres and use that And as well as MySQL Those are the most two common ones So again, you can see why we're trying to keep track Of all these things Right? And so I'm aware of in total As I said, 630 whatever So if you find a system This would be super helpful if you're from India Or China or whatever country you come from And there's a database system that's written In that language that I don't know about But let me know Because I want to know This thing just goes on forever and ever There's tons of them So I'll post this on Piazza today Project 3 will be announced Either today or tomorrow And then we'll get Gradescripts for that And then we'll have class on Wednesday And then next week starting Monday Next week will be my grad students I'll have office hours immediately after this If you want to see your midterm exam We'll be around later in the week Okay? Alright guys, see you, thank you Thanks for watching!