 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. Hi guys. Thank you for coming. It's another quarantine database session. Today we're happy to have Norvald Rieng, who is a software engineer at Oracle on the MySQL team. He's the lead of the MySQL optimizer team, which covers query optimization and query execution as we'll talk about here. He is a PhD from the Norwegian University of Science and Technology. He's also a teaching faculty there where he teaches database architecture and design. So we're super happy to have Norvald here today. It is 11 p.m. where he is in Norway, so we thank him for staying up late. And as always, the way we do this is if you have any questions, please unmute your mic. Say who you are. We're coming from it. Ask your question. Feel free to do this anytime. We want this to be a conversation. And as always, we want to thank the Stephen Moy Foundation for keeping it real for sponsoring this event. So with that, Norvald, the floor is yours. Go for it. Thank you. Thanks, Andy. So when asked to present here, Andy said, can you use something on the MySQL query optimizer? And said, it might be more interesting to talk about the whole processing pipeline. I'll get back to details about that. First, I'm going to talk about things that are in MySQL today, things we've done before, but also a few things about what we're thinking about the future. And if you're from the future watching this, then it might not actually be the way it turned out, just the diss of this. The U.S. election is tomorrow, so there might not be a future. Throw that out. Yeah, yeah. Good luck with that one. So I'll briefly go through the history of MySQL, just to give you kind of the understanding of why MySQL is the way it is. Then I'll give you an overview of how query processing in MySQL works, and then I'll talk about refactoring with them. I've done quite a lot of refactoring of the whole query processing in MySQL the last few years, and there I'll go into more details about each stage of processing. And then since Andy asked, I'll give you an overview of the query optimizer. I'll take that at the end and talk about how it works and how it doesn't work. So let's start with history. MySQL started, the first release was in 1995, or it was an internal release. Version one was not released in public as far as I know, which makes MySQL 25 years today. It became part of LAMPstack, so with Linux Apache and PHP together, this was kind of really became a popular thing in the 90s and early 2000s. A lot of simple websites started up using MySQL. They grew more and more advanced with time, but they started with very simple OTP workloads. Basically, single point selects a single row based on primary key or that kind of thing. Maybe a few joins if you're advanced. That's where we started, and MySQL and PHP have a lot in common. For instance, they have the same view on type safety and other kind of thing, but I'm not talking about that today, but they kind of all this evolved together. It started with very simple things and it grew into something more complex. So the timeline is very shortly. Version 3.21 was the first one that was kind of called production quality and released to the public. In 2003, they got union, but not intersect or accept. So what's happening here is that it's a small company producing an open-source product and basically jumping on whatever the largest customer is asking for. So there's a talk from LinuxConf Australia from 2014 with Stuart Smith from IBM. He talks about why did we do this and why did we do that with MySQL. And basically it all boils down into the customers paid for this and we paid for that. So it's not until 2008 with Sun Microsystems that you get a big corporation backing MySQL. And Sun Microsystems wasn't afraid to spend money, so that gave MySQL a bit more time to step back and look at how does this work, how do we want it to work, not just jumping for the next big feature that the customer asked for. And then in 2009, 2010, depending on which state you look at the start of acquisition or the approval by the EU, Oracle requires Sun. And Sun Microsystems has spent time on adding storage engines to MySQL, a transactional storage engine because InnoDB had already been acquired by Oracle. So MySQL didn't have a transactional storage engine that wasn't owned by Oracle. But now in 2010, Oracle owned both and there was no reason to do anything else, basically. So with version 5.5 MySQL had InnoDB as default storage engine. Then I joined in 2011, so the first version I worked on was 5.6. All the major products were already assigned to other people, so I did bug fixing for a year. Which brought me around all the code base to look at various problems. There were bugs, right, so I saw all the things that were wrong with the source code. Then 5.7. Version 8 is a big change in that the catalog of MySQL, called Data Dictionary MySQL, is moved to InnoDB. Before that it was based on various files, external to the storage engines. That changes a lot of stuff. They actually have a transactional engine doing the catalog. You also see that CTEs, like with the classes and window functions and so on, so more advanced queries, more analytical capability is adding to Adobe. In 2019, we're still adding features after the release of ADL. There's a change here in policy. The users actually asked us to change policy and start adding features in point releases. Before this, we added features in 5.5, 5.6, 5.7, but we tried not to add them in the point releases of each release. But now we are adding features as we go. One interesting thing to note here is that you see in 2019, it says hash join. That's actually true. MySQL didn't have a hash join until 2019. I'll get back to that later on. MySQL basically become very popular, very fast, and the focus was in adding features not to keep the design, the architecture up to date. I would say, based on what I know, that it outgrew the design. The code just, they had to add new features and the design was not kept up to date. While the users grew from small websites to large websites, they had more data. It's more important that the optimizer picks a good plan. And lately, they also want to do analytics, more complex queries, more joins, many more tables, window functions, that kind of thing, making the job harder for the optimizer. There's more optimization to do. It's hard to find a good plan, and at the same time, it's more important. There's a growth in complexity here, and it's more important that the optimizer does a good job. If you look at how MySQL does query processing, it's the classical model. Of course, MySQL, we have this step called prepare. It has different names and different systems, and then optimize and then execute. So prepare basically does kind of name binding and these query transformations, like semi-join, in an existing semi-join, that kind of thing. So one thing that's different in MySQL than some other systems is the kind of plug-able storage engine system. We use it internally for mainly two storage systems, in ODB, which you get with MySQL server, and NDB, which is what is the backing for MySQL cluster, which is a network row store, or a distributed row server that's used for mostly telecom. I think it's also behind some of these big online games. They have different properties, so the optimizer has to adapt to different qualities of these things. And recently, we also got a third internal engine to care about, and that's a bit special. Rapid is an analytical engine, a distributed column store, main memory store, and it actually has its own execution engine, so we optimize and execute in there. So that affects some of our decisions going forward. So this will be released as a cloud service only in the future. It's currently in beta testing with some customers. But this affects quite a lot of our choices going forward, having the second execution engine. The partisan bison, the executor is a typical volcano crater executor. What's a bit special about MySQL is that the design, as I said, it started out simple, and it's very much designed to do nested loop inner joins. That's the idea behind the whole thing, and you can see it in all the data structure from the parser to the executor. That was the intent. And everything else has been added on as you go, but no big refactories. That's the reason for the work we've been doing lately. So as I said, it probably started out pretty straightforward, and then it gradually kind of diverges. And there's an idea behind this. It's not random graph changes I'm doing here. For instance, doing name resolving during execution, that actually happened. The idea was that, hey, we don't need to optimize this until we actually need to run it. So we don't need to spend time optimizing a subquery until we actually use it, because maybe we don't have to evaluate it. Maybe it will be query executes in a way that it's not necessary. And then maybe you don't have to resolve it either. And that's fine as long as that's true, but what we saw was that we started trying to execute things that weren't resolved. So you're trying to execute and refer to a column that hasn't been bound to anything. So that's of course a bug, but we had many of these bugs. And it was clear that we had to clean this up and get back to a more clean stage separation. So we started refactoring. And no user ever asked for refactoring. They like the effects, but they don't ask for it. And they don't ask for anything that doesn't have a visible effect or benefit for them. But they do come to us saying that it has fixed a lot of bugs. Disability in MySQL has increased a lot. And whether you know it or not, the features they've been waiting for, they depend on refactoring. So I'll show some features later on and try them into different refactors with it. And the new thing with ADO is that we are delivering features in point releases, which means that we're also delivering refactoring in point releases. So we have to try not to break things. We're trying really hard, but it's not always that easy. So the main thing is to separate refactoring from removal. So even if we refactor and try to fix a problem, we may have to put some extra code in there to maintain and still have some almost buggy behavior just to have backwards compatibility. Then in a future version, we can actually remove that. So maybe this is jumping ahead. What does the MySQL testing pipeline look like? If you're doing major refactoring, then your unit tests are all broken. So you have to rely on SQL stuff. So how, presumably, you have all these logical tests that check the correctness of queries. We have lots of them, yeah. Of course, yeah. And then you had existing C++ unit tests. As part of the refactoring, are you maintaining them and updating them? Or you just say screw it for a while and assume they have like SQL-based tests that can test whatever was the C++ test was testing. That makes sense. Yeah. So from the old days, we mostly have SQL-based tests actually, not much unit tests. So we are writing unit tests for new stuff, but there's often not much money unit tests to change. That's kind of the sad truth. But also the convenient truth. There's no unit tests to fix, but there's a lot of SQL tests to make sure they're still running. And those test suites are pretty extensive. Not always systematic, but there's a lot of regression cases added on over the years. And they really pick up a lot. And we do a lot of fuzzing of queries and that kind of thing. Do you use SQL-Smith or do you have something with specifics? Oh, we have all. Got it. Is that open source? Let me see. Does it have a name? RQG is one. I think that one is open source. We have a new one called Gator. That is in the process of open sourcing, if not open source already. That's the idea at least, to have this open source. So I'm not sure the state of everything here. I know they were working hard to open source Gator. I hope it's open now. Awesome, thank you. So we have lots of SQL tests. The whole test suite takes ages to run. So we try to do something on every push, something on daily or weekly schedules. But sometimes we do break things. We just try very hard not to. So the first obvious thing to do was to separate stages. We started 10 years ago. We finished a few years ago. It was not a concentrated effort. It was something we did in backgrounds, something we did when we had the opportunity. At least to begin with. Then as we got closer to the goal, we started setting more strict schedules and actually planning for getting a full separation. This is an ideal figure. It's not necessarily the entire truth. We still execute a bit during optimization, for instance. But it's under control. We know what we're doing there. So the separation is pretty clean now. And this was really the main thing that improved stability. As I said, these unresolved things that you're trying to execute, that was a huge class of bugs. And it also helps feature development. We have fewer surprises. If you do a query transformation, for instance, to unnest a sub query into something, then you have fewer surprises and everything is properly resolved first. That was the first thing we did. Then we started looking at each stage. The first thing was the parse stage. We're using Bison, which is normally supposed to do a bottom-up parsing. But we had so many semantics actions doing weird things that it wasn't really bottom-up. So we fixed that for a DML, for select insert-up-delete. But DDL statements still are a bit messy there. So the D here is to get a parser library that we can reuse in other products. For instance, we have MySQL Router, which is a proxy between your client and your servers. And it can do like HA or load balancing or sharding that kind of thing. And sometimes you need to understand queries, and then the best thing would be to have the exact same parsing for both products. So that's a goal that's within reach now, I think. I don't know, next year or so, we should probably be able to do that. And later we've done a lot to the prepare stage as well. So MySQL, until recently, didn't really do prepared statements as well as you should. Basically, the prepared statements was resolved. The name and type resolution was done every time you executed it. So now we have reached a state where we only do optimization execution every time you execute a prepared statement. That's affected type resolving quite a lot. Do you know whether... I think Postgres caches things. Do you know whether the commercial guys whether they're doing aggressive caching? Caching of plans, you mean? Yeah, for prepared statements, because it sort of defeats the purpose of a prepared statement if you're going to plan it every single time. Yeah, so our goal would be, of course, to cache plans as well. We're not there yet, we are now at least the step before. So I'm not sure if Postgres does that. I know Oracle and SQL Server I think do it for any type of statement, actually. You can just recognize patterns that we're just looking at what is actually prepared. Postgres, I think Matt can tell you this because he invalidated in our protocol. I think if you execute the query, the same query like five times, then it triggers it as a... like it caches the plan. Yeah. So the challenge there is that the different parameters you give to that execution may change what is the optimal plan. Yeah, I understand. I mean, it's like... but if you know it's like a primary key lookup, the flutivity is always the same, right? Yeah, it's a primary key lookup, then it's safe. But if it's more complex, then it might be slightly different. I'm not saying you always want to cache it, but like, again, I think SQL Server can be a bit intelligent about what they're caching and then generate multiple plans. But it surprises me that you're basically saying you're going to rerun the optimizer every single time in the location of our pairs statement. Yeah, we do. Our goal in the future, of course, would be to not do that, but we haven't sold that. We're just at the step where we are caching the logical plan before optimization. Ah, okay. But we're there at least. That's a good thing. Okay, cool. Awesome, thank you. Yeah. So yeah, we clean up how we do transformations, so the old one in the same way. We do one dive down the AST. So on the way down, we do name resolving and on the way up, we do type resolving and do transformations as we see it. And also concept propagation and concept folding on the way up there. So without this, especially without the stage separation, we would never have been able to deliver CDs and window functions and lateral and the first table functionalized goal, JSON table is also made possible because of this. I think it's not impossible to implement this in kind of the old stuff we had, but we wouldn't have been able to do it by now, I think. It was just too complex a structure to deal with. So it's been simplified a lot for us. So we do faster future developments. We really noticed that we're doing that. Next step we looked at was execution. People would say we had an iterator model. The engineer who did this would say that we had seven. We had seven different interfaces, some were pushed, some were pulled inside the executor. That was interesting. And it was very much nested loop based. So it was nested loop through and through. It didn't know how to do anything else. Do you think that sort of like, I remember looking at the MySQL code in 2000, we were trying to start a story, we looked at 2008, and it looked like because you have like the MySQL layer, the top part, couldn't assume what the store engine was. So it had sort of this Hodgepodge API. Do you think that was a remnant of that? Like it was MyISAM and then it became along and therefore things were sort of different? Or was it just sort of like it was written assuming it was MyISAM from the very beginning and sometimes it was pushed and sometimes it was pulled? I don't think that assumption really affects it. I think they just locked themselves to nested loop through already and wrote things explicitly for nested loop. I think that was the mistake they did. So over the last two years we have refactored the whole thing. The code is reused but it's just wrapped in a single iterator interface. So we can now put things together the way we want it. And as we did in the last bullet point here, nested loop join is suddenly not the core of the algorithm, it's just an iterator, which means we can add hash join. So I like to say that the old one was like a jigsaw puzzle where it's modular but you can only fit it together in a very few ways. But now we have Lego. So the immediate effect of this was that we got rid of quite a few temporary tables because temporary tables was the go-to solution for connecting things that didn't fit together because almost everything could read from or store in a temporary table. So if the iterators didn't fit together, a temporary table was inserted as a buffer. And that's not good for execution times but if you're in a rush, have to deliver a feature, of course that's the solution you go for. So this is now all modular. And it's just necessary because you can't really start looking at the optimizer until you're able to express the plans and the execution you need to do. So three features that came out of this directly from this refactoring. MySQL explain has like different formats. You can do JSON if you want to do... MySQL workbook introduces the JSON format to do visual explain. So you have a graphical view of your query. The traditional one is to print a table. That table is also very nested loop-oriented. It shines through there as well. So in order to really express the new execution tree we had, we add a new explain format. That's just a very simple walk of the tree I can show you here. So you have an iterator for a join, for a table scan, and you just return a line in the explain. This is very much like Postgres. So it should be familiar to Postgres users. Actually, the old explain code interpreted the plan separately from the execution engine. Surprisingly, I can only remember, I think, one bug where the explain output said something else than we executed. That's surprising. I would have expected much more, but we were able to keep it in sync. But now, with this implementation, we are sure that what we are printing is what we are executing. And that also makes it that simple to add explain-analyze. I remember asking the guy who did this, I think I asked him before lunch one day and said, how much work would it be to add explain-analyze now? He said, not much. A few hours later, he came back with 100 lines of code and said, this is a prototype, it works. It was that simple. Of course, now it's a bit more code, but it's basically the same thing. We wrap the iterator in a timing iterator. We intercept all init and read calls, and then just store some timestamps and counters, and then you get explain-analyze. This is a really good outcome of just a simple refactor. And then hash join, of course, was also added there. So, let's look at the optimizer. That's what you want to hear about. This is Handlerstein, Stolmreich, and Hamilton in 2007, the query optimizer of MySQL. The full code is more like this. It was entirely heuristic and ran mostly on exploiting indexes and key foreign key constraints. I don't know when they looked at the code base. I did a search a few days ago, and I saw at least traces of cost-based computation in there in 2000. I think it was from 2000. Maybe it wasn't released in 2000. Maybe it was released in 2003 or something. It's been cost-based for many years, or parts of it has been cost-based for many years. Mike did read the code, to be honest here. But I admit it's still today too much relying on heuristics and rule-based stuff. There's not enough cost-based decisions. Maybe you'll get into this. Is MySQL collecting the statistics that would allow you to do a cost-based search better? We do have statistics, but not enough. That's my conclusion. I'll get back to that. Actually, in this slide, I think, for table, we have cardinality and a number of pages on disk. For indexes, we have number of synced values, number of null values, number of pages. If you have a multi-part index with several columns, you have NDV for every prefix of columns. If it's ABC, then you have NDV for A, NDV for AB, and all that kind. But if you don't have indexes, you don't have any statistics for any column. And if you compare that to, for instance, Postgres, then you see there's a lot of stuff lacking. We do have histograms though, single column histograms only. But they have to be manually added and manually updated. There's no automatic updates. We did have students last year who looked at automatic updates, so that might be coming. And histograms, they can be single-ton histograms if you have enough buckets or they're equity histograms. But you have to update them manually. And if there is an index on a column, we will ignore the histogram because just because they're manually updated, indexes are assumed to be much more up-to-date because index statistics are automatically updated. And of course, like everyone else, we have these fallback assumptions of if you have no information. But in my school, since we don't have column statistics, these are probably more used than in many other systems. And they are very crude. It doesn't really reflect realities, just made up of those. And the default values are those exposed through my.conf or like, can I tune in? No, you can't configure. No, not these. They are hardcoded in this server. But even if you couldn't configure and you could never find one number as good for all columns, right? Oh, yeah, yeah, yeah, yeah. It's just a way to make the cost formulas work in the end. You need a number. But yeah, whenever you hit these, you have to be lucky to make it fit your query. So let's look at what's actually happening in your optimization. Let's start with the transformation. So this is in the prepared stage of my school. It's not part of the optimizer, really. But it really affects the optimizer. So we do static transformations that are not cost-based. They're always done. There are switches on and off if you want. You can hit them on and off. But if they're enabled, they will always be done if possible. So this is like your expansion in there, merging derived tables and uses of into the other query, in-exists and into the server join and to join stuff, standard ones. We also have some transformations that are specifically made to make this analytics engine, the rapid engine work. It can't execute all types of sub-queries. So we are rewriting queries into what is not necessarily more optimized, but something that it can execute. But those are only enabled for that engine. So let's say a very specific optimization. These things are always prepared between prepared state-based institutions, unlike the actual optimizer step, which is currently repeated every time. So the transformations we do for derived tables, views, and CDs, we treat them the same. They are either materialized or they merge into data query. Recursive CDs, we all materialize. I don't know exactly how to do it otherwise. But there might be techniques that we haven't looked at. If we materialize, we are able to push some conditions from the outer query block into the derived query so that we can derive tables so we can filter out rows early on. There's more to do there as well, but we can do some of it, at least. If you have sub-queries in predicates, we can materialize, we can do SummerJoin and Antidoin, or we have a way of rewriting in any queries to exist. And we have a special execution mode for exists that doesn't really materialize the thing, but it just returns true, false if anything exists. So it's an optimization materialization in a way. When you decide to do, you decide that in the optimizer and you basically hint down the execution and say, okay, just check whether true or false don't actually produce any results. No, this is actually not in the prepare stage. This is always done and we do a, this is not a cost-based optimization. Okay, I got it. So this is rule-based. And we do constant folding. As I said, but we also do constant folding at the start of the actual optimized step because MySQL does execute materialized single row or zero row tables. If it's maximum one row, MySQL will materialize it if it's constant and extract the values and put it into predicates and propagate all these constants and do a separate round of folding. And this seems to filter out quite a lot of joints, actually. We are able to reduce the complexity of the query. So while the transformations we did in the prepare step are designed to add more joints to unmask some queries and add more joints giving the join planner more options, this one is actually trying to reduce those options. So it's cutting away things that we don't really need to join anymore. So an interesting thing here is that we assume a man, he tried to run the join order benchmark on MySQL the other day and get some numbers and some of these things were just, the intermediate results were just optimized away because of this constant evaluation and propagation. But the way that's triggered is basically you say, okay, you're doing the lookup and it has one or zero rows, then you make a move. No, if we can deduce from the query that it's only going to be one row. So it has to be evident from the query that it can't be more than one row. Primary key lookup or if you have them in, for instance, in the select list or whatever, if you have a scalar subquery that can only return one value, for instance. Select one plus one. Yeah. Yeah, okay. So if it's obvious from the query that this can't return more than one row, zero or one row, then we do this. But we don't really check statistics. InnoDB gives us statistics, but it can't tell us if there are kind of exactly how many rows there are. Right, okay. We can't really rely on that. MyISOM did that very predictably, but not InnoDB. How can you, I mean, it's hard to say, because you're not a cloud service, you don't see every query anybody ever runs in MySQL. But with all the customer workloads you have access to, how often does this optimization get fired off, can you say? Like 20% of the time, 95% of the time. I don't understand this, but it's frequently used, yeah. So basically, if someone wanted to, this should be a worthwhile optimization that somebody should add to their system. Well, it does take time to execute it, so I'm not 100% sure. But you would have to execute it anyways at some point. So I'm not really sure. You are executing it early and you are able to propagate these numbers. But I'm not sure if you didn't do this, you would have more joins in your query. But I'm not sure if it would be that much more expensive. Got it, okay. I can't really guarantee that. No, you don't understand. I mean, for select one plus one, that makes sense, but accessing a table, it could be selected from a function. That function could be, you know, computing a hash and would be expensive. This is very cool. Sorry, I don't mean anything like that. This is cool. But it's an optimization down there, but also it breaks down the barrier between optimization and execution. So we kind of, we haven't really decided if we want to keep it or not. It could go away at some point. I guess we'll have to at least do a proper evaluation of the effectiveness. Okay. So let's go to the actual join ordering and optimization. To get the base table access, we have, this is what's, on this slide we have what's called the range optimization. So it mostly looks at range scans. So we can do all these different operators. And it produces a index range scan or several index range scans, several ranges in the same index. Or it can do index skip scan, which is, there was actually a feature that Facebook contributed to us. So if you have an index on A, B and C, but your predicates are only on B and C, we can skip from the first value of A to the second and do the ranges on B and C. So this of course depends on how many distinct values of A there are, but we have that statistics. So we can actually do a cost-based decision on using this one. So I guess at least Facebook finds a value in this. It's pretty new. So I haven't really seen enough numbers to know how many queries it helps. I think it's just you guys in Oracle have it, like the regular Oracle. Because I learned skip scans from Oracle. I don't know anybody else that has it. Yeah, well, we have it now. So yeah. But it seems to work for some use cases at least. Then we have index merge where you read from two or more indexes and read the primary keys, you do the unit intersection and then read the full table. So in order B is a cluster index, so everything is sorted by primary key. So all indexes have the refer to the primary key of the base table. So it's not like, for instance, Postgres where you have a kind of row pointer instead of the primary key. And in addition to statistics, the range of advice you can use index types to get accurate costs. So it actually goes down to the index and asks how many rows have this value. And so it was a certain configurable limits. I think it's like seven or something by default. This depends on your workload, of course, but the thing here is that get the accurate cost, not only the estimates, but it's quite expensive. So that's why it's limited. But it helps for, for instance, for inlists and that kind of thing. This is also where we do spatial access methods. So within an overlap contains that kind of thing. It's mapped to a scan of, or a lookup in the R tree. So what's a bit special of MySQL is that this is a separate module that the planner then uses to get one suggestion. So it can choose between table scan, index scan, primary key lookup, or secondary index lookup, or whatever the range optimizer suggests. So for some reason this was not baked into one algorithm. This is one thing for range, and then you have all the other stuff. So I'm thinking if what I'm trying to be an archeologist here and look at what Stonebreaker and Hamilton saw, maybe they saw this one and at the time they looked at it, the range optimizer didn't exist. I think that was added to code base in 2000 or something. Maybe this was then heuristic based. Now it's cost based. So that might explain that quote. But as we say the weakness here is that the range optimizer only returns one access method. And especially if you were trying to do anything like interesting order tracking, that would be a bad idea I think. Because you don't get the full picture of all access methods that are available to you. Another thing we can do is to use the covering index instead of table. So if you have an index that happens to cover all the columns you need, then we can use that. So that's the base table access in MySQL. And then comes the actual join ordering. No, sorry. One more step of things we can do for access. Condition push down. So if you have a condition that is not used in the index lookup, but it can be evaluated on the index before you read the base table, then you can push down into the index and get it evaluated. So it filter out rows early on. And if you're using NDB with MySQL cluster, you can push down conditions that will be filtered network traffic internally to the stored end. Then we get to join ordering. That's where it starts getting interesting. I was shocked when I read this the first time and saw the code. So basically MySQL from the start did an exhaustive search for all plans. It first generates one full plan. It has a list of tables and it adds one by the other and ends up with the full list. And then it has one complete plan and then it tries permutations of that plan. Basically keeping it prefixed and changing the kind of end of the list until it has generated all combinations. This is not a very efficient algorithm. And at some point it realized that and added pruning. So if a partial plan is worse than the best full plan you've seen so far, then you just stop there. Cut off that branch and start generating a new prefix. But even then it's bad if you have more than six, seven tables around there. So there's a cutoff where it just gives up the exhaustive search and then switches to greedy. So let's say you have eight tables. It will generate all possible prefixes of let's say six tables then it reaches the cutoff and the last two tables will be added by greedy search. So this is kind of worse of both words. It has a horrible execution time for the exhaustive search and then the greedy search just misses out on opportunities. So not a good choice I would say. But it actually works. That's a surprising thing. It has served my school well for 25 years for some reason. You're not even picking the algorithm at this point, right? You're just trying to do ordering. That's all you have before. Left deep, that's the loop kind of thing. So the sorting we do upfront, that's on estimation of how many rows you have to read. Not rows return, but rows scanned when reading. When sorting that way, we are kind of making the exhaustive search. We're bringing it towards a pretty good plan to begin with. So it can prune off more stuff. I actually remember when that we looked at this and we added this sorting and it helped a lot for us. But still the base algorithm is it just screams for dynamic programming. It was some more reasonable approach. If only there was a paper that did this, right? For instance, in 1971, someone had published something about this. 79, yes. So we still try to catch up with the same 70s in my school. But the surprising thing is how well it actually works. I'm surprised that it stood the test of time so far. It hasn't really crashed yet. It's very close to collapsing. For instance, if you start doing proper analytic queries with this one, you have far too many tables for this exhaustive search to handle. And then you start doing reading and it just goes down here from there. So just to finish this discussion, it goes left deep. So it's smart enough to pick an index loop and look up on the inner table if it can or cost-based. We used to have blockness loop that was removed in 80, actually, and replaced with Hashtag. We looked at it and tested it and for all most of these reasonable use cases it's better to just switch to Hashtag. So actually it doesn't do that blockness loop anymore. We didn't want to touch the costs computations too much. So we could have tweaked it to do a cost computation that's more close to Hashtag but we tried to be more conservative and use Hashtag less because we're doing this in 80, which is already a public release. So we tried to use Hashtag already a public release. So basically we do nest-loop join and almost by accident Hashtag. It does not track interesting orders at all. It doesn't try to. So I think the next slide is on challenges. So it's exhaustive combined with greedy. We know how to do this. This is not the way. And the idea of this range of mice on the side I think that misses out on some good plans. Although it's not necessarily obvious that it does, but after analyzing it, I believe it does. There's no choice of join algorithm really in there. Not a good one. It's built for nest-loop and it doesn't use Hashtag into the extent it should. And we actually have heuristics to select sorting or index order and it turned out that sometimes it fails so spectacularly that we recently added a switch to turn off these heuristics and just leave with a plan you had and don't try to use heuristics to change your mind. So this sorting thing really doesn't work. We're not able to make a good choice there. Surprisingly enough, it works in most cases, but some cases it really fails. By the way, we're getting the engineering view here now that everything is bad. But that's because we are kind of realizing that this is the end of life for this optimizer. We have to do something. We have to do something that could last for a longer time. So our requirements is, well, not only lefty plans, allow for some bushy plans. We have to select different join algorithms. We have tracked interesting orders. We have a base level here. And then we had this stuff with NDB and Rapid requiring different things. So we have to somehow be modular and extensible enough to handle the different costs models for these and different operations. It's not it's not necessarily straightforward, but it's not that complicated to get it good enough, at least. I think at least a regular kind of system are like optimizers should work well for both cases. That's what we've seen so far, at least. But you're going to need better statistics, right? So my sequel is going to have to back on Analyze, right? So second last point here on this slide we'll probably need more statistics. What comes to mind immediately is histograms for all columns, like Postgres. And maybe also this top 100 thing that they do. But we are definitely going to need more statistics. So this will show up at some point. We also know that it's we can't incrementally improve on this. We have to basically write in for the scratch. But the good thing is that we have with stage separation we have the input interface, we have the output interface, and we can try to do as much as we can inside that contained little box. But it does of course affect data structures that come all the way from parsing and down to optimization. So some changes will happen everywhere. But we try to concentrate as much as possible into that stage. And we're going from bottom up. We're not doing any Cascades type optimizer. It's just that bottom up is safe and well known. And we know it works for rapid for the analytics engine. And if we were to do something like Cascades for this I'm afraid we would have to have different transformation rules for that plan for rapid and for inodibi. That's my main worry there. Another thing is that Cascades optimizes I would say a bit less known but it's hard to find people who really understand them that well. Bottom up seems safe. I have students that do Cascades. But the Germans make strong arguments that the bottom up is the better approach. Tom Snowman has his paper with him. We're doing Cascades in our system because that's what I like. I like the idea of Cascades as well. We decided we do bottom up. We considered it less risky for us at least. But we are looking at these hypograph algorithms coming from Germany. That's an interesting improvement on system R. So those are really interesting to us. We definitely want to take this range of optimizer and make sure we evaluate all access paths for interesting orders. I don't know if you've done a lot of work on spatial queries but for instance K&H's neighbors seems to be much easier in a model where you just keep track of interesting orders than trying to detect these queries based on the query structure that we would have to do with the current optimizer. Then you just produce your points in the interesting order of the nearest neighbor and then it just pops out at the end whether you do a distance scan of an index or if you do sorting and then distance calculations. But yeah, more statistics. Then the main question here is how do we switch? If we're creating a new optimizer how do we decide that it's good enough? The old optimizer is so simple it might get lucky and a proper cost based optimizer will always have estimates and that kind of thing and get unlucky. I'm really hoping that it will turn out to be lucky all the time but I think at some point we just have to declare that it's good enough and deal with it. We've seen similar problems before when you change the cost model. If you tweak the cost model a bit then suddenly queries are behaving differently and you always get some regressions. It never fails to happen. Whatever you do, you get performance regressions. Whatever good you try to do someone always has a query that goes wrong. That's why my goal people like hints so much also because the optimizer is a bit weak. For analysis we have explain-explain-analyze. We also have optimizer trace which is more or less a brain dump from the optimizer. This is a non-documented format that is written by optimizer developers to be read by optimizer developers. If you have a query that has a query and contact my support they will ask you for an optimizer trace. You set the session variable saying I want to do optimizer trace, you run your query and then it's stored somewhere in the information schema. This brain dump is just all the major decisions done by the optimizer and gives us cost numbers and raw numbers from statistics, that kind of thing. So it's really helpful. So explain-analyze and optimizer trace, we always ask for that if we get a performance issue. And then the solution unless we can do find a bug or anything, is usually use a hint or an optimizer switch. Hints can of course turn on or off any type of optimization can force or disable access methods or join all rings. And the optimizer switch is a session variable that does all this enabling, disabling of different transformations and optimizations. So normally if you have only a few queries, you do optimizer hints. If it affects a lot of queries, you do the optimizer switch. And just a kind of final note on these hints is that we've started changing our hints syntax from this old style, which I know called intrusive hints, because they intrude on a query. Instead of writing proper SQL, you're writing straight join instead of join. So you're actually rewriting a query. But we took the idea from Oracle DB and added hints as comments. So we can remove the comment and the query still works. And that makes it easy to automatically insert hints. So we have a plug-in that can do pattern matching or incoming queries and rewrite the queries. So people use it for all kinds of things, but the intention here is to use it to add hints. So let's say you have an application you can't change. It has two queries and two of them are performing badly. Then you use this one to pattern match those two queries and add the hints you need. So this is kind of a bailout for us if you have performance issues in an application where you can't change a query. But on the other hand, most of our customers are able to change their applications. They have in-house or open source applications they can't change. But this is what we have used so far when things go wrong because we are changing cost numbers. And I think this will be the go-to solution for anything that affects optimization as well. We're just hoping we won't have to use it that much. So to summarize, between 5.6 and 8.0 we have refactoring, refactored almost the whole query problem, almost the whole. We haven't touched the optimizer really. And so far no user has discovered this. At least they haven't had any problem with it. We've been talking a bit about it last year or so, but people really haven't seen an issue here. And our experience is that refactoring is so worth it. We have a lot of bugs fixed immediately and we see featured development speeding up a lot. And this was when I joined in 2011, I was surprised how much we were allowed to do refactoring. You always hear that managers don't want you refactoring or want to deliver features and all that thing. But no, we were really encouraged to refactor because for the first time it was like 15 years then. And for the first time they actually had a financial backing to invest in the future of the product, not just run after the next big customer. So we had a lot of opportunities, we still have that. We still have a lot of freedom to engineering work that makes us proud in that way. We want to do a good job and we are allowed to actually rewrite. We don't have to run for the next feature because it's a really good thing. And I think we've convinced management above me as well that this is worth it. We are delivering new features based on this refactoring and we tell them really clearly that this is the reason. I think that works. And currently the optimizer is the weak link. So that's the obvious next step for us. We do prototype at the moment. We'll see how it turns out. But it's looking very good so far. But there's a lot of stuff to re-implement. So it will take some time to get there. Then there's a long process ahead of us of performance testing to prove that it's good enough to be put into production. And I'm talking about refactoring things in 80. And I don't think I dare turn this on in 80. Not by default at least. This would be a 90 thing I think. I think that was my last slide. Before I open for questions again, I just want to say thank you for inviting me to talk about my scroll. We don't talk much about these kind of internals. We usually talk about features users can use or switches they can touch or settings they can tweak or that kind of thing. That's what I like. It's fun to actually talk about internals. We're building a new system too. We hit a lot of the same issues. So I'll applaud on behalf of everyone else. Thank you for staying up late for being here. Thank you for being here. Thank you for being here. First off, we did all the refactoring. Here's what we did refactor. Here's all the problems with the optimizer. You asked about the optimizer. I had to put some positive things in there as well. That's why I wanted to do the whole pipeline. This is awesome. I teach this stuff. This is awesome. Please unmute yourself if you have any questions. Joseph in the chat, do you want to ask your question if they're still here? Let's look at the chat. The question is, are you planning to feature flag the new optimizer? I assume yes, that's what he said at the end. Definitely. I think we'll feature flag it in 80. At the moment, we can't run all queries. That's just the thing. We don't support all syntax in there. We have a prototype that doesn't do that yet. We feature flag it. Maybe 90 will be aggressive and remove that flag. We'll see. I don't want to keep to optimizer at the same time. That costs us a lot. As soon as possible, when we think it's good enough, we'll throw it away at the old one. That's my idea at the moment, at least. Got it. The last question I asked would be like, in the same way that you have a plugable storage engine, like a model for the architecture of MySQL, would you suspect you'd have to have a similar thing for the cost model for the query optimizer? The query optimizer can have... If you're just doing the DB search and bottom up, that's just taking the joins and doing the search. But at some point, the cost model would be specific to... If you want to pick what join engine you want to use, because if it's NDB, you might want to do broadcast join or distributed hash join. Actually, I don't know if Rapids is distributed or not. Well, I don't know the details, but in my head, it's like C-Store in a way. That was the project at Oracle Labs, right? I thought they killed it and everyone quit, right? No, the rapid project was there, but it never got integrated into Oracle. They tried to do it differently. OracleDB tried to do this for partial plans. That idea was scrapped with MySQL. Instead, we said, no, we're going to do a complete offload or nothing. And then maybe later on, we can look at partial plans. But at the moment, complete offload and then on. So the idea here, just to talk about it, it's kind of doing analytics in sync with your old TP. So there's snapshot isolation or snapshot guarantees between NDB and Rapids. So it's automatically synced between the two engines. We insert into NDB and then it's automatically moved to Rapids. And then your analytic queries will go to Rapids. It's fractured mirrors. This is what the Oracle is in. Yeah, yeah. And then the idea is, of course, to compete against Redshift and all that thing. But with one integrated solution. So that integration is kind of my headache. So yeah, we need some kind of plug-able Cosmodel. Unlike the word plug-able. It has a bad feel to us in a magical way. But some kind of adaptable Cosmodel, where Rapids can say something to us about costing of subplans and that kind of thing. And also they have different, they have several different types of hashed-down variations of hashed-down that we can evaluate. They have other summer operations. But they pretty well matched what we have planned for in my scroll. So it's a pretty good match, but they have some different ways of thinking. And of course, being a calm store, the costs are different. So we have to give them some configurability in the statistics. Okay, again, Norval, thank you so much for doing this. The talk was fantastic. This is one of the best talks we've had all season. Because again, it's the internals of the system that most people don't care about, but we care about. Thank you for doing this. Thank you for staying up late.