 So, this is the talk about MariaDB and MySQL about statistics that optimise the needs to optimise for queries and when and how not to use indexes. I will, ok, so I work in MariaDB, cooperation as chief architect in MariaDB and I was working on MySQL code base for like 20 years, so I am pretty familiar with it, but let's start the talk. So, what statistics optimise the need? So, when optimiser optimises your query, it needs to know something about the data and usually, well, tradition was done by optimiser was asking engine different questions like I don't know, how many rows does this table have or how much would it cost to do the full table scan and cost is in some arbitrary cost units, not in microseconds or CPU cycles or anything. Or for example, how much would it cost to read a thousand values from the index number three and then optimiser would compare this cost with the cost of the full table scan and decided whether it is cheaper or how many distinct values are in the index and then if the index has say 10 million rows and there are 1 million distinct values, then optimiser would assume that every value is present about 10 times in the index or optimiser could ask how many rows, how many index values lie in the range between say 10,000 and 11,000 and then it would ask the cost to read them and then it would compare with the full table scan and so on. That works reasonably well, but there are some issues with that and one of the problems is instability. Not all engines have these values all the time already. They calculate them on the fly and energy for example is doing it using random index dives and because they are random, they supposedly don't have any bias on the other hand because they are random, they do not give repeatable results and those are the tests I've done on dbt3 benchmarks on a query from dbt3 number Q8, the eight in dbt tables involved in this query and by just by repeating many times, I have seen four different execution plans and with the query execution time ranging from seven minutes to 1.2 hours and for Q7, I've seen seven different plans, the fastest one was executing in 12 minutes and the longest one I just waited for few hours and killed the query because I got impatient. So it's not what you would prefer to have. Some might ask what about NDB persistent statistics and it does solve the instability problem in a way that NDB is still doing those random dives and then it remembers the result in an internal table and next time it'll just show you the same result. So there's no instability, always get the same result and the same execution plan but because it still started from the random dives, if you're lucky you'll get always the same execution plan and the query will always execute in 12 minutes. If you're unlucky, you'll get always the same plan and it'll always be many hours and time out. If you don't even know how many plans are there, so you might just think that query is slow if it always time out, times out after many hours but it may be just that you got bad random dives. Another problem is that all engines, they do report statistics in arbitrary cost units but they are different cost units for different engines and engines lie about what data are in the table. This is an example, I have a table with three integer columns and I populated every column with a random number between one zero and one hundred thousand and repeated it four hundred thousand times. And now I do select count distinct which is honest exact skill calculation to get the number of distinct values and what do you think what number I'll get here? Anybody could guess? Who's not looking in his own laptop? Okay, it was four hundred thousand rows and numbers are between one hundred zero and hundred thousand so it should be about hundred thousand distinct values and this is exactly what I've got. But this query takes a lot of time to calculate so usually optimizer relies on some imprecise estimations done by the engine. So the engine is my isom here and here I asked the engine for the estimation how many distinct values are in the index and my isom was estimated it's not exact number but it's within one percent which is good enough for the optimizer planning point of view. Now I change the table to NADB and I look not worth time trying to guess. NADB reports two hundred thousand distinct values and if you remember that the table contains integers between zero and one hundred thousand that cannot possibly be two hundred thousand distinct values in there. But what does it mean for the optimizer? So there are four hundred... Optimizer knows that there are four hundred thousand rows in the table and there are one hundred distinct values. It means that optimizer would expect every single value to be present in the table four times. If you look for the specific value, optimizer would expect to find four rows. But in NADB lies that there are twice as many distinct values as there actually are. So optimizer will think there are only two rows to be found. So optimizer will be more likely to prefer an index in a NADB case because NADB lies about statistics. And why NADB is doing that? Well, like 15 to 20 years ago, the original NADB author back then Heiky Tuuri, he just noticed that for some way NADB doesn't use the index and he would like my skill to use the index. So he just edit times two and make the index twice as more attractive to the optimizer and he was happy with that. And since then, because there are many years past, optimizer just kind of adopted and it's probably embedded in many parts of the assumptions that NADB is doing that. And as far as I know, they've removed these times two in MySQL 8. So the other problem, of course, that if you need all these statistics, you need an index. You need to pay the price for an index. Index, it takes space, it takes time to be maintained and usually, not usually, often if you just need statistics and index, it's just too much price to pay, it's too expensive. And the solution, so from now on, I'll be talking about MariaDB for a while and then later I'll describe the differences with MySQL and new MariaDB versions. So the solution that was implemented in MariaDB called Engine Independent Table Statistics. And it provides stable, precise, detailed and comparable statistics. So it's stable because statistics is calculated once and then stored in persistent tables. It's not calculated anytime using some random dives. It's precise because MariaDB analyzes all the data, not just very, very small percentage of the data. So it gets very good representation of what the data is and statistics is actually close to actual data. It's detailed because this Engine Independent Statistics remembers a lot more information than just cardinality, than just one number. So it allows optimizer to know more about the data and create better execution plans. And it's calculated by the server, not by the Engine. So it's calculated identical for all engines. So cost estimations are identical and they're comparable. So optimizer can calculate the cost for one table in one engine and for another table in different engine and then compare them and decide in what order is better to execute the query, for example, join. Now I'm not going to recite the manual. So this is just, I'm just giving you a few pointers, few keywords. If you're interested, you either download my talk later or just make a picture of the slide. And these are the words you can later look up in the manual. So very briefly, it is implemented by having new tables in the MySQL schema where all the information, all the statistics stored. There's a new system variable calls, use that tables that tells the optimizer when to use information. These engine-independent statistics, it could be never used or preferable in some. And this optimizer use condition selectivity, which I'll be talking about for the rest of my talk, which tells the optimizer which exactly parts of the engine-independent statistics optimizer is allowed to use. And there are optional clauses in the analyze table that can, where you can specify what parts of the table should be, for what parts of the table the statistics should be calculated. It could be done for all table or for only some indexes or for only some columns. But enough of this theory now, let's get to the examples. And for the examples, I used the employees database, which can be downloaded from the Giuseppe's GitHub and Giuseppe was the guy who was giving talk just before. It's not a very big data set, but it's not exactly small either. It's 167 megabytes of data and the employees table has 300,000 records and salary table has like 2.8 million records. So let's load it, you need to download it from the Giuseppe's GitHub, then you do source employees that are scaling, which creates all the table and performs all the load data to load all the data. And then I enable engine-independent table statistics by setting your start tables preferably. Then I analyze all the table to collect the statistics. There's no need to specify any optional clauses for the analyze here because if statistics is enabled preferably, then analyze it will automatically do collect all the engine-independent statistics. So you don't need to modify your application and rewrite all the analyze table statements. You just need to enable engine-independent statistics once and then it'll automatically work for all of your application. And you can see that it's not a very big data set because this one took only less than 20 seconds. So how does this one help to optimize your queries? First, I said optimize use condition selectivity one, which tells optimizer not to use any engine-independent statistics. Basically it only uses the data that were available before engine-independent statistics was introduced. And I want to find all the managers from all departments in this example in place database. So I join departments with employees, with manager, with titles, where title equals manager. And for the purpose of the example, it's important to remember that title call is not indexed. Then maybe does something with the data and returns the result set in 15 seconds. Now I enable optimize use condition selectivity three, which allows optimizer to use per column statistics, statistics for non-index columns. And the same query is executed under a second. This is 16 times improvement. It's not 16%, it's not 1.6 times. It's 16 times improvement. And so it's pretty remarkable to me. And let's see why does this happen. This is a result of explain extended for the first query, when the optimizer cannot use per column statistics. And it is explain extended. So there's column filtered, which tells how many, what percentage of the table optimizer thinks will pass the rear close. And because there's no data available for non-index columns, optimizer doesn't know, so he just says 100%, everything might pass the rear close. But if we enable, if we allow optimize to use per column statistics, statistics for not index columns, then optimizer knows the cardinality for non-index columns and minimal amount of value for non-index columns. And optimizer can figure out that not all employees are in fact managers. And only about 14% of the titles will match the rear close. And in this case, optimizer decides to join tables in different order and start from full table scan of the titles table. And this gives 16 times improvement. Now, before talking about histograms, I want to explain what the equi-height histogram is. This is not an equi-height histogram. It's a usual histogram that you see everywhere. It's equi-width histograms. The red line just shows some imaginary data distribution. And this is histogram for this data distribution under the red line. I've drawn it with five bars, five buckets. They all have the same width. That's why it's called equi-width. And the height of the bar is proportional to the area under the curve. And this is the equi-height histogram where all the bars have the same height, but different width. There are also five of them. But in this case, the area under the curve is all the same. And why would we want to do this strange thing instead of normal histograms? Because all we remember, I recall, is those five numbers, well, more than five, but still fixed number of bars to describe a histogram. And then later, optimizer needs to use this data to kind of restore the original data distribution to understand the original data distribution. And if optimizer would try to restore the original distribution based on the equi-width histogram, this is what it would get. You see that the huge error in the most interesting part of the curve is the values are high. And if you try to restore the curve from the equi-height histogram, the interesting part of the curve is described much more precisely. The error is much smaller. Basically, equi-height histogram allows you to describe the data distribution much more precisely with a smaller number of values. This is what my IDB is using for histograms. Now, let's see how histograms are helping to optimize the queries. And for this one, we need an example of some data with non-uniform distribution. Let's look at the salary values from the employee's database and recalculate the mid-point of the mid-seller that's exactly in the middle between minimal and maximal salary. That's the first query. It's about 100,000. And now, let's select all the employees and their titles from the salary table where the salary was above the mid-point, above the 100,000. Again, in some time after turn some data, my IDB returns a result in 18 seconds, 19 seconds. But if we enable the use of histograms by setting optimized use condition selectivity equals four, we'll get the same result three times faster. Again, why is this happening? Because if you don't use histograms, then all the data that optimizer knows about the table is, it knows the minimal value for the salary. It knows the maximal value for the salary. And because we are searching for everything which is above the mid-point, optimizer can figure out, okay, so this is mid-point. So about half of the data should be above mid-point, about half of the data should be below mid-point. That would optimize the things. And indeed, if you see the filtered column is about 50%. Let's optimize the things that half of the rows will pass their where close. But at least half of the rows would have passed the where close if the sellers would be distributed uniformly. But in this emplaced database, just like in the real world, sellers are distributed very non-uniformly. And if we enable histograms, optimizer can see that indeed only less than 5% of sellers apply above the mid-point. In this case, optimizer again joins tables in different order and gives, if this gives a three times speed up. The next interesting feature that was implemented, it's sampling. So this is a query which could be realistically used by, I don't know, HR department by analyzing whether there's sellers are distributed fairly or there are some people are getting low salaries based on, I don't know, sex or origin or whatever. But this query, it takes the average salary per department of the people whose last name ends with off, which supposedly could mean people with slight origin. As you know, those kind of like queries, you cannot optimize it. You cannot optimize it from the index. You can't get any reasonable histogram for that. You cannot get any reasonable statistics for that. So optimize doesn't know anything about this. And it finds five rows because it grew by department in five seconds. And if we enable sampling, you get the same result 12 times faster. So how does sampling work? Because there's no other way for optimizer to get any reasonable information about the data. It reads few rows from the table and applies this like condition. And then it extrapolates the result to the rest of the table. And indeed, without sampling, you can see that filtered column is 100% because there's no histograms, no indexes, no nothing that could help optimizer to know anything about the data. And with sampling, optimizer can realize that indeed very little, there are very few rows that matches that were close. And it optimizes joint tables in different order. And this is 12 times speed up. Everything that I was talking about now, it was implemented in MariaDB 10.0, which went GA in 2014. So it was, it's quite stable. Well, in fact, 10.0 is already reached its end of life. So don't use that, at least in one. And in this MariaDB 10.0, engine independent statistic remembers the number of rows per table and the cardinality per index. And for every column, it remembers number of distinct values, minimal and maximum value, amount of nulls, average value length for strings, and this equi-height histogram. And sampling also works. Last year, Maria may scroll 80 went GA. They implemented a conceptually similar feature. It's called differently, called optimized statistic in MySQL. And they remember amount of nulls per column. And, but they have two different kinds of histograms. They have equi-height histograms and singleton histograms. And singleton works much better than equi-height if you have data with very few distinct values. Like if you have a sex column or a column with yes, no values. This case where singleton histograms work and equi-height basically don't. Also, if you try to collect a histogram for really, really huge table, then MySQL starts skipping values to keep all the data in the memory and it gives significant speed up as compared to how MariaDB 10.0 collects histograms on the huge tables. And in a few months, MariaDB 10.4 will become GA where this engineer density is enabled by default, which is basically the main reason why I'm giving this talk. So if you start upgrading, you should know this enabled by default. You don't need to do anything anymore. Everything will just automatically and magically will be faster. And MariaDB 10.4 also collects histogram by sampling, by skipping values so that analyzed table will be fast. And in MariaDB before 10.4, analyze for engine specific statistics was slow. This was the reason why it wasn't enabled by default. And this feature was mostly implemented as a Google Summer of Code project last year. That's all. Thank you. Have any questions? Yes? Yeah. Yeah. I'm talking about this one. Yeah, so the question is about how I compared the results and execution times for the same query. It was the same system and I would just repeat it many times. I repeated explain many times, I repeated just select many times in different order. And I was collecting how many different selects, how many different explains output I would see. And caching does not affect the explain output. And yeah, execution time as well. I think after the cache was warmed up, so this shouldn't be taken into account. After the result, I was also collecting the results that were stable. I'm sorry, I didn't get it. It's still found. So the question is if there are any plans to collect statistics on the fly. We are thinking about that, but it will not be in my individual form.