 Okay, so thank you for coming, and to this talk where I will talk about histogram support in the upcoming MySQL 8.0. My name is Estan Gravelen, and I work for the optimizer team at Oracle, the MySQL optimizer team. So the agenda for my talk is that I will start with a motivating example about why you want to care about histograms, as a quick start guide on how to use it, how are it used internally in MySQL, some example of how you can get better query plans by using histograms and end up with some, concluded some advice. So for the motivating example on the need for histograms, I have this simple join query here. You're joining two tables, the orders and the customer tables, and I have some rare conditions on the date and the account balance of the customer. And if I run explain on this query, you will get a query plan, the plan that the optimizer has chosen in order to execute this query. And what you can see from the output is that it will start with the orders tables, and then for each row of the orders tables, it will do lookups into the customer table to find the matching row. And you can see which index is used. This one will be using a table scan. And that is because when you are looking for more than 30% of rows, using an index is not optimal anymore, because you will access so many rows that is more efficient to just scan through the entire table. We have an index on order date, as we can see here, but we are choosing not to use it. This estimate you have here on filtering is pretty accurate, because when you have an index, what the optimizer does is ask the storage engine, in this case InnoDB, how many rows are in this range. And what the storage engine does, or InnoDB does in that case, is it takes the start of the range and end of the range, and navigates through the B3, and then it estimates the distance between the two pages that it founds the start and end row in. And that way it has a pretty good estimate on the size of the range. However, for the other table, the customer table, we don't have an index on this account balance column. So the optimizer does not have any information about the selectivity of this condition. But what it does, in MySQL 5.7, it makes a guess. And since this is a range, it guesses that probably one-third could be in this guess. But this is just a guess, not something it has any basis for except the type of expression here. So you might wonder, is this really the best join order? So one way to try the difference is to use a hint. And if this was 8.0, I can actually force the join order with this hint. I say, customer should be before orders. In earlier versions, you have straight join that can do the same. But then I would have to switch the order here because then customer actually has to come before the orders in the join query. So this new hint has the advantage that you don't have to edit your query in order to specify the join order. And what I see when I execute these queries is that if I switch the join order, it's much faster. It goes from 15 seconds to one second. And the reason for that is that there is no customer that has account balance of less than minus 1,000. So actually, if you do this order, you will scan through the entire customer table, you will find nothing, and there will be no look-ups into the orders table. But the optimizer is not able to know that because it does not have any information about the selectivity of this condition. And that is where histograms come in. So what I can do, I can say, create a histogram using this syntax. I'm coming back to more on that on this column. And after that, you see that now it will pick this join order. There's no join hint that forcing this join order. It picks it by itself because now it knows that the filtering here is 0%. So now you see why the histogram is a useful addition when you don't have any other statistics on your columns. So a quick start guide on histograms, or I could also call it all you need to know about histograms probably. It's not that much. So histograms is statistics on columns, the distributions of the values you have for your columns. You've grouped the data values into buckets. I will get back to what that means. And for each bucket, you calculate the frequency of values in that bucket. For the histograms, we have a maximum of 1,024 buckets. MySQL builds a histogram. It could either base it on all the rows of the table or on a sample of the rows of the table. It depends basically on the available amount of memory you reserve for creating the histogram. And there are two histogram types which are automatically chosen by MySQL. It could either be a single true, which has one value per bucket, or an equity height, which can have multiple values per bucket. I will show you now what that means. If you have a single histogram, you have one value per bucket. And for each bucket, you store the value and the cumulative frequency for that bucket. Here's an example. You have a set of values between... A column has a set of values, a domain between 0 and 10, and you have like 14% 0s, you have 22% 1s, and so on. And this is very useful because since you have one bucket for each value, you can both estimate easily the quality frequency or the range frequency because the quality will be, as I said, if you have the column equal to 0, if you know it's 14% in that, that is the selectivity. And if it's like less than 5, you know that is the sum of these four buckets gives you the selectivity. And you see, if there's no value, like for 4, there's no value. So there's no bucket for 4 here. So you can also determine, or for values larger than 10. So you can easily determine if a certain value is not in your database. But often you have columns that can take very many values, and then you can't have one bucket for each value. So then we have something called equity height histograms, and then multiple values per bucket. And the ones we have implemented, it's not true equity height because unlike the basic equity height histograms, we do not split values across buckets. So because in this case, this is the same values as we had on the previous slides, but now we have only five buckets. More than one in each, potentially more than one value per bucket. And you see that a nice feature with the not the quite equity height approach is that for high frequency value, you get a separate bucket. So for the most, the most frequent value is one, and in this case you actually get a separate bucket for one. So if you have something like equals one or not equals one, you get a pretty good estimate. But in other cases, equity height is best for range and not always that good for equality. Take the example of if you have something equal six. You have this bucket with five and six, and you see that the frequency for this bucket is 13%, so we estimate there's six and a half percent of each. But in this case, there was actually 12% with five and 1% with six. So you see that equity height histograms when you do ranges is not that accurate as a singleton histogram. So for these equity height histograms, we store the minimum value, the maximum value. The frequency, community frequency, and the number of distinct values. So how to create or refresh a histogram for a column? You do an analyze table command, but it's a special variant where you say update histogram on a set of columns with a set of buckets. Note that this analyze table command, it will just do the histogram. When you say update histogram, it will just do that. It will not do the other analyze. The normal thing you do with analyze table. And you can drop a histogram some columns too. As I said earlier, whether to do sampling or use the entire table is based on available memory, which you can set with this variable. The default is to use 20 megabyte while creating the histogram. This is about the amount of memory when creating the histogram, not the final storage of the histogram. We have implemented a new storage engine API for the sampling part, which... But the default implementation will even, in the sampling case, it will do a full table scan. So if it has only enough memory for 20% of the value, it will read the entire table, but just pick every fifth row. But it's possible for a storage engine to implement a more efficient sampling. So for the case, we won't innovate to implement this in a way that instead of reading every fifth row, it reads every fifth page because that would be much more efficient. The histogram is stored in a JSON column in data dictionary. And you can use an information schema table or view to inspect the histogram. So I just specify... I select from the column statistics a view of the information schema, and then I specify the schema name, table name, and column name to get the histogram for a specific column. And then you will see something like this. You will see the buckets. This case, the histogram type is a singleton, so there will be two values. The value for the bucket and the cumulative frequency for this bucket. You see, they are increasing all the way to 1 since it's a cumulative value. You see the data type? How many null values there for this column? The collation ID, it does not matter when it's integer, but if it's string, it tells you how it compares values. When it you last updated the histogram, so there's no automatic updates. If you want to refresh because your values have changed or something, you need to do another analyze table command. Here you can see the sampling rate. So what you could do if you want to have a full histogram over the whole table, you can say, oh, it used only 20%. Let's increase my memory usage to 100 megabytes and you can sample an entire... Or you can base it on the entire thing. And the number of buckets specified, I didn't forget to mention that in the syntax. You have this with and buckets where you can specify how many buckets used. And this part is added because when you do a MySQL pump or dump, it will actually put the analyze table commands into the dump file so that when you run it, load it again, you will actually get recreated the histograms. So that's why it stores the original specification because if this is a singleton, like here, it only has seven buckets. But if you specified it with 1024 and the data has changed, you want it to still be a singleton, for example. And we can use this information schema Now, first for strings, you consider maximum 42 characters. So if you have very long strings and they all have the first 42 characters are the same, then this is kind of not the best... But... And they are base 64 encoded. This is because what is stored in data dictionary, or in JSON in general is dutyF8. But you can, in theory, have any characters at here. So it's not necessarily valid JSON, valid dutyF8 character strings that you are storing. So you have to decode them in order to see strings. And there's a... What I would think is a bad guy. I think we will fix before GAs. There's some prefix here that should probably be hidden from you. So... But the basic thing here is that I use the JSON table function in 8.0 to convert these histograms, or the brackets array, into a relational table. So this way you can... You can get the table of the values and you can see here the cumulative frequency. But if you want to see the actual frequency of each bucket, you can use a window function. As Sergei talked about earlier, you take the value of this bucket and then you subtract the values of the previous bucket. That was the lag function. It refers to the first previous in this case. And since I know this array is already sorted, I don't need something in the overall closer. And that way you can actually see that we have, for this order status column, we have 48.6% Fs and 48.8% O's and just 2.6% P's, for example. And that's a typical column where you want a singleton histogram, because whether... If they have something equals F here and you optimize that without a histogram, I think this is really selective. It's not because it's like 50% of the rows have this value. So how are histograms used? The most important use case is for when the optimizer want to optimize join queries. Because I say we do something on the first table and then there's some filtering here and what's left goes into lookups into the next table. And it's this filtering here that's important to know because then you will know how costly is this next step. So it's probably one of the major issues with many query plans is that the optimizer does not get the cardinality right here. It does not estimate the effects of the intermediate steps of a join. And they are, as I've already touched upon, several ways the optimizer gets these estimates. In MySQL 5.7, it could base it on the range estimates that I talked about earlier, the way you do deep or dives into the index and then you compute the distance. There's also index statistics, what you get with the normal analyze table command. And we added in 5.7 some guesstimates here that said that if it's a quality, it's 10% if it's some less than, and so it's one third and so on. Just to have some estimates here because some filtering is better than no filtering. Often, if you have one table with some filtering and one table without any, you would prefer to start with one with some filtering. What's different now is that you have histograms here. Still, the range estimates are the most accurate, but histograms is much better than guesstimates. And this is why histogram is not that important in MySQL as in many other databases because most databases does not do this range estimate using an index. So the use of histograms is mostly in MySQL for columns, or is for columns where you don't have an index. So you can then have different ways of getting estimates here. You have the range estimate for this column here if you have an index and in this equality you can use some other kind of statistics. And then you have these guesstimates if you don't have histograms. And then you compute the total here. We assume that they are not correlated with this condition so you compute this selectivity by multiplying the different here. And then if you add histograms here, you see that if you look at this one third for h greater than 21 for m, please, it's probably not a very good estimate. But since an optimizer does not know what the word employee or age mean, it needs something to guide it and that's the histograms in this case. So it actually happened that 95% of the history of the employees are older than 21 and that means that the selectivity computed here goes up from 0.01 to 0.03. And then here you see we still have one that is a guesstimate and 10% is probably too frequent even if John is a frequent name. You could add a histogram here too but it's not necessarily that good since you probably have too many names to make a singleton histogram so the equality will probably not be that accurate computed. So in this case you probably should add an index or name because if you look at these conditions there, it's actually the name that is the most selective one. So actually in this case you probably want to start with the empty table and filter out on name using an index. The other ones here is not that selective and so an index might not be useful and then you could use a histogram instead. Here's an example of how we use histograms. This is still an age column and in this case it's not employees but the entire population. So you see there's an equity height because you have 10 buckets and it's around 10% in each of the buckets and most of them are like about 7 or 8 years in each bucket except the last one that for natural courses is a bit wider in range and we have stored for each bucket we have the cumulative frequency stored and to show you why this is most useful I can go through an example here where we are looking for all that are 21 years and below. First we do is identify the bucket for 21. It's this one. Then to compute the selectivity we first check the cumulative frequency of the previous bucket which is 20.3% and all these we know that these are younger than 21 and since we have the cumulative frequency we don't have to check the frequency in this bucket but it's enough to check the previous one. We know that we have at least 20% but then we need to figure out how large part of this bucket should go into this condition. So then we check the frequency for that bucket but then we need to take into account that less than or equal to 21 is like 5 or 8 of 8 values in that bucket so you add that fraction to the selectivity and you end up with 0.267 as the filtering effect of this condition and if you want the opposite you just do 1 minus the previous one so you get that 73% is older than 21. Another example of a query this is a query from the DB2.3 benchmark where you don't need to understand the type where it's called a volume shipping query and actually what it does it calculates the revenue of goods going from suppliers in Russia to customers in France and vice versa and we don't have an index on the name and as I said earlier if you would be guessing we would say that 10% would be Russia and 10% would be France which probably is not accurate because near the Russian there are more than 10 countries in the world. Without histogram we get this plan we start with a full table scan or one normal join plan start with a full table scan or a range scan and then you do lookups into each table either a secondary index or a primary index all the way to have chained together all the tables in this case that means that you get like 855,000 primary key lookups into the other nation table but when you add his grant to this you see that there's very few combination here so actually it's better to just combine these France and Russia early on and then you just you avoid thousands of primary key lookups in the final table that you will otherwise have and if you look at the performance here we see that not as big an improvement as in the first example but you see that you go from 1.7 to 1.35 seconds and something like that so to conclude with some advice here when should you create histograms it's useful as I said for columns that are not indexed or for columns that are not the first column of an index because if you have like a two column index there's no way for the optimizer to do this index dives in order to find the values for that column and it's used in addition to join queries which is the most important it's also useful for queries where you have in subqueries because there we try to estimate the size of the subqueries and so on and also in for single table queries if you have order by limit because then you have the choice of ordering the entire result and then do limit or use some index that are already ordered but in order to understand whether that is a performance or not you need to know how long how far into the index do you need to scan in order to find enough rows for your limit and then the filtering estimate is important the best fit by far is if you have low cardinality columns like gender or order status or day or week or so because then the okay I have to end he says but I only have this one left and if you have an index no use if it's not used in a very close forget about histograms how many buckets if you can get a single ton use as many as you need if not I would guess 100 should be enough because then you have like a granularity of 1% which should be useful more information there's a nice blog by colleague about this that you can look up the slides are already on the first page so you can get all that information from there show me histograms yes it's stored in this data dictionary and it's what the new 807 data dictionary made is very easy to make a information schema view over this and they got a lot for free by using that yeah so how is using histograms how would that differ from partitioning so you want equal size partitions in some way I haven't thought about that one thing I forgot to say is that if you have histograms they are good estimates of your data so you can actually instead of doing costly queries against your data you could actually just query the histograms in order to get the some that you can use the information schema table for yeah yes and no we don't currently have that in order to detect whether to update the histogram I guess you need to keep have some awareness about how frequently your distribution change because like in this order status example I had for example they are usually pretty fixed even if you add more orders it would probably be like 47% of each of the two big ones and three in the last one so in that case you would not need to to update your histograms even if you add a lot of more data now the case is that might not be the case if you have more questions for us staying go to the bar with him pay him a beer he will be very happy and he will answer your questions