 Now we will have Mr. Amit Bhattacharya here from MySQL Oracle and he will be talking about the histograms and the way we use it in MySQL. We are almost there. Yes! Okay, took a lot of time. Okay. Okay, hi. Good morning everyone. So, I am Amit. I work with the... Okay. Is it? Oh! Okay. Okay. Yeah, now it's fine. So, I work with the MySQL team here in Bangalore, India and so I've been with the team for the last six years. So, before I start the presentation on histograms, I would like to know, like, how many of... Not my colleagues. How many of you guys have used MySQL? And, oh, great. How many of you have faced problems in performance of queries getting slow? And, yeah, it's a common problem, right? And how many of you have heard about histograms as such during your college or school days? It's a statistical way of collecting data and using it to plot bars and graphs like that. So, yes. So, this is one of the new features which went in MySQL 8.0. And so, apart... So, what I'll do is I'll scan through this feature as well as I'll... I can take up questions at the end related to, like, how a query gets executed, what are the things you're going to look for in MySQL while analyzing a query, stuffs like that. So, yeah. So, example, I'll show you two slides. And so, if you look at these slides, this is an explain plan for a MySQL query. So, it basically says it's a join of two tables and there are certain filtering columns in it. And if you see the two slides, you will see certain differences over here. So, that's the number of rows that gets filtered out. And you see that when in the normal query, the number of rows are more and when we do a little bit of optimization as a DBA or as a user by giving a join order hint. So, basically, what we are doing here is we are changing the order of the joints here. So, what you see is that MySQL optimizer is going to generate an explain plan and the explain plan might change over a period of time based on certain factors. Maybe the new, I mean, a lot of mobile rows have been inserted into the table or you might not have updated the statistics after your insert updates. So, all those things. So, what I'm trying to focus here is that at certain times, you might not get the optimal plan for a particular query. And the reason behind that is this, the bad estimation about the number of rows in the table, incorrect information about the number of distinct values in the table. Now, this is one of the most important statistics that goes into the generation of a query plan in the optimizer. And if the statistics is not perfect, then it can result in a bad query plan. The other point is about the data distribution in a column of a table. So, there are cases where we have seen a lot of null values inside a particular column and the data is heavily skewed. So, you might have, say, 80% of values of a particular table with a certain distinct value and the remaining values are scattered all over. Because of that, also, the optimizer can generate a non-optimal plan. I would say, not say a bad plan, but a non-optimal plan. So, here is the use case, a very nice use case. So, assume that we have a table where we have a time column and this time column would basically mean the time that person wakes up in the morning. So, if you just think about the values that will go over here, it can range from, let's say, 6, 5 a.m., 4 a.m., to maybe 8 or 9 a.m. in the morning if a person is not working in a night shift and he sleeps at, like, say, 12 o'clock. But here, in this column, the data would be heavily skewed for a particular value. So, we need a solution and optimizer which would help us use certain statistics from a table, from a column which is heavily skewed. And that's where the histograms come into picture. So, the histogram basically takes the information about the value distribution of a column. It will calculate the frequency of each column into groups known as buckets. And right now, we support a maximum of 1024 buckets. And while creating the histogram, you don't need to take all the data which is there in the column. You can probably do a sampling based on available memory that you have and you can take a random sample. So, even this factor is configurable in your use. And there are two types of histogram. One is the singleton value wherein you have, let's say you have a column which talks about the gender of a person. So, that will have a male and a female. And maybe, okay, what happened? Okay, maybe some other genders. But so basically, two, three, four values. So, in that case, you can have one value per bucket for that particular column. And there can be scenarios wherein the number of values would be more than, number of distinct values would be more than 1024. So, in that case, you use something called as a equi-height histograms. But this, you don't have to configure it yourself based on the number of buckets that you define. It's going to automatically generate these types of histograms. So, yeah, so this is how a singleton histogram is going to look. It will have each, a value will be there for each bucket. And it will have the cumulative frequencies of these values. And this is basically used when you have equality or a range predicate or a range filter predicate in your query. So, this is going to be helpful in this case, this particular case. And this is a pictorial representation of equi-height histogram. So, you won't see the values are not split. So, the values are decided based on the distribution that you have. So, if you have, like, say, less values in this category then, so you will fit in 7 to 10 range of values under this category. It also has the same stuff. It has a minimum value, the maximum value, the cumulative frequency, and the number of distinct values for each column. And it is best suited for the range kind of queries that you have. Yeah, so this is how you are going to define a histogram. You will have to specify the column and the number of buckets that you have. This is how you are going to drop a histogram in case you don't need it. And this is the configurable parameter which basically tells you what is the amount of sample that you are going to use. So, this is in bytes. So, yeah, so there are storage agents API for sampling which will basically do a full table scan when you are doing a sample. So, yeah, so histograms are stored as JSON values in the data dictionary. So, and it can be queried from the information schema table that you have. So, basically you can query what is the, how does a histogram look? Okay, this is jumbled up. Yeah, so basically it will look something like this. So, if you have values it will show you the cumulative frequencies of these values. But this is mostly the metadata that you are going to get. So, what I'll do is I'll skip the few slides and I'll come to this. So, this is a very important slide. So, this is how the optimizer finds out how many records are going to be filtered out of a particular condition. You have something like a condition filter effect. So, what is this? So, this is an estimation that is achieved because of the kind of filter predicates that you have. So, MySQL optimizer uses something called as a guess estimate on columns which does not have index. So, based on the predicates that you have it is going to generate, I mean it is going to use one of these values to estimate certain statistics that it's going to get and use it for the query plan. So, if you see this query you would see that there are some predicates over here. So, there's an equality predicate, there's a range predicate over here and there's a between predicate over here. I mean that's a range. So, what happens is, so before the use of histograms, what it will do is it will use these predicates as a filtering out, as statistics to filter out how many rows are going to be read from the table. So, that's something called as the condition filtering effect. So, basically it multiplies all these values and comes to a conclusion saying that this is the factor that it's going to use. Now, what happens is, so this is pure guess estimate. We call it as something called as guess estimate based on these values. Now, what will happen is if the actual number of rows that qualifies are more or less than that, then this is not going to generate an optimal plan. With histogram, what will happen is you would get the frequency of the data distribution. You would get the frequency of the values age greater than one and this gives you a better kind of a statistics while generating the query plan. So, that's, so this is something internal which the optimizer does. I've just put it to show you like how the calculations are done in the optimizer. So, okay, so what we do is we do a lot of performance tests with certain benchmarks. So, this is one of the benchmark we call it as a dbt3 query and it's, so this schema would look familiar to you. This is something like a TPCH kind of a schema that we use. So, here we have two columns which does not have a index and we basically create histogram on these two columns. So, if I have to pictorially represent the query execution plan, you would see that the join order starts from supplier line item orders, customers nation and nation two. Without histogram, so what happens is the nation one and nation two tables are scattered apart and the execution happens from left to right and you see the number of rows that gets qualified over here are less. Now, with histogram what will happen is if you have histogram on the nation table, then the statistics that it gets generated and the filtering happens over here and you see the number of rows that qualifies as less. So, basically what will happen is the cost of all these operations are less right now with histograms and this basically translates into a better performing query. So, without histogram it used to take around 1.7 seconds and now it is like 1.335 seconds, but it again depends on the volume of data that you have. So, very important when you are going to use it, you are going to use it on a column where you don't want to create an index. So, if the column is a part of your query and you don't want to create an index on the column, then histogram is going to help you out over here. This is a one-time operation that you do. So, basically what you do is you create a histogram out of it. When you are creating a histogram, you sample an amount of data and that amount of data is sampled and all the values are stored in the metadata tables. So, the best fit for this is when you have low cardinality, something like a gender order status, day of fixed stuff and when you don't have a uniform distribution of data when the data distribution is very heavily skewed. But the only problem over here is if your data is changing too much, then you will have to keep updating your histograms a lot. But if it's something like this and if you think that it might not change over a period of time, then this is one of the best solutions that you go in for, 8.0 for your query performance. Some more advice is when not to... So, if your column is the first column in the index, then you don't create histograms on this and when it is never used in the where clause, if your column is just a part of the select query, you don't need to create histograms. So, these are certain advices that you get if it is possible enough, get a singleton histogram because that's the best fit because if you have a equate histogram and if your value is 1, then you'll have to scan through the histogram and find out again and so that's something different. Yes. So, and if your things are getting updated very frequently, then you would probably not want to create a histogram. Yeah. So, I think this was a small topic that I wanted to present, but I can take up questions. There are blogs that we have written for this feature, so you can go ahead and read the blogs. I would also like to thank my colleague Eric and Oistin, who have... Eric has developed this feature and Oistin has written a lot of blogs about it, stuff like that. Yeah. So, I think I have... I'm done with my presentation and I would like to take up any questions that you have related to. Yeah, sure. Go ahead. It's also possible that we can analyze schema level, not only the table level. Analyzing the schema in the sense you... Yeah. The one I saw in the presentation saw near the table. So, it's also possible that we can analyze the whole schema in the sense you want to analyze all the tables in the schema. Yeah. No, I don't know. So, you're talking about the analyze command that you... Okay. Let me go there and then probably you can... Create the histogram, right, analyzing it. This one? Yeah. Okay. What's your question then? Yeah, it's also possible like in Oracle, we can do... In Oracle, RTBMS database, we can do schema level, analyzing histogram. Here, it's also possible. No, I think it's on columns. Only on columns. And on the column, we can definitely define how many percent that we are going to analyze them to have the statistics. How many percentage of data in that column? No, you cannot do that. It's on the bucket. It's 100 percent. So, if you have, for example, 100 GB of table... So... We'll analyze it. It will take forever. So, this is going to help you out. So, you define how much of MB you want to sample the data. Yeah, that's the sampling. Yes, that's the sampling. Yeah. So, you mentioned like say, of course, you cannot do analyze on 100 GB of data, right? A full table scan is going to take ages. So, yeah. So, here, this is what you define. So, you define like 1 GB or let's say 20 MB or 50 MB of data you want to analyze. You can do that. Okay. Thank you so much and thanks for coming.