 It's okay. Hi. My name is Eric. Welcome to this presentation. I'm a software engineer at Oracle working for the optimizer team in MySQL. And I will be talking about hash join in MySQL, which is a new feature in MySQL 8018. So the agenda for today is that I will first talk about how we got to the point where we were able to implement hash join. We have done some major refactorings which are now finally starting to pay off. I will go through what hash join actually is before going into details about hash join in MySQL 8. What it is, what kind of performance you can get from it, how you use it, and other things that can be useful to know about hash join. So first some background history. For the longest time, MySQL only supported variations of the nested loop algorithm for executing joins, which is of course the basic nested loop, nested loop with index lookup. And we also have the block nested loop, which is an nested loop with an in-memory buffer for some speedup. And because we only supported nested loop, our executor and optimizer for that sake has been very focused around this nested loop execution, making it very hard to extend with other join algorithms that may not be a nested loop kind of algorithm. And for the last two years, we have been working very hard on implementing a brand new executor, an iterator executor, which is now released in 808. So this new executor is based on the volcano iterator model, which is sort of like the textbook example on how to implement an executor. And one of the basic ideas in this model is that all operations should share a common interface. In MySQL, this interface has two methods, which are init and read. Init initializes the iterator for reading, making it ready for returning rows. And read simply returns the next row from that iterator. Another important thing is that each operation is contained within one single iterator, so that responsibility and things are not scattered around the code base in different places. So things are more self-contained now. Because of these two things, we now have a much more modular executor, which allows us to implement new features faster and making things a lot easier to understand. A colleague of mine usually compares the old executor to a jigsaw puzzle, where pieces can fit in only one specific way, while the new executor is more like LEGO bricks, where you can connect them like you want. So by having this new executor, Hashtag was just a new iterator. Of course, there are some details around it. And there are other features now as well, which stems from this new executor. Explain Analyze is one of them. You have a new tree-based explain output, which is also very nice and thanks to this new executor. So an execution tree can be visualized something like this, where you have two table scans on the bottom. The left side has a filter after it. Both of these go into a hash-join iterator before ending up in a sort. And because all of these share the same interface, you can easily imagine replacing the hash-join with an extended loop, for instance, or any other operation. So what is hash-join? Hash-join is a join algorithm that uses hashing to find matching rows between two inputs. It was first described in the mid-80s. So this is a fairly old executor join algorithm. In order for hash-join to work efficiently, it requires at least one echo-join condition. An echo-join condition is a condition on the form column equals column. Of course, I have multiple echo-join join conditions in the same query. And hash-join usually comes in three different flavors. You have the classic hash-join, you have the gray's hash-join, and you have the hybrid hash-join. Each of these builds on top of each other, adding a bit more complexity. You also have more exotic variants like distributed hash-join, but these are kind of the three basic hash-join flavors. So I will go through each of them to explain how they work in order for you to understand how hash-join works in MySQL. So for these examples, we are going to join together two tables, which are countries and persons. We will join them on the column country ID, which simply describes which country a person lives in. The classic hash-join is divided into two separate faces. They are called the build phase and the probe phase. So in the build phase, one of the tables are designated as the build input. Of course, it doesn't have to be a table, it can be any other input, of course. The build input is loaded into an in-memory hash table, where the hash table key is the part of the join condition that belongs to the build input. In this case, that's the column countries.country ID. Once all the rows have been loaded to the hash table, the build phase is complete. The next phase is called the probe phase, and the other table is called the probe table. For each row in the probe table, you create a hash table lookup key using the part from the join condition that belongs to the probe table. And for each match you can find in the hash table, you return a joined row to the client or to the output or the next step in your execution tree. Once all the tables from the probe input have been read, the join is complete. So the benefit there is that you read each input only once. But in order for that to work, the build table has to fit in memory, of course. That is why we choose the smallest table as the build table measured in size or bytes, not in number of rows. Larger inputs can also be handled like block-nested looped-nose, where once the hash table goes full, you do the probe phase by reading the entire probe input. You then clear the hash table, continue reading the build input, filling it up with the rows. You then scan the entire probe input again and continue doing this until you have consumed the entire build input. And the drawback here is that you end up reading the probe table multiple times, which is something we do not want to do. We can be more efficient than that. So in order to handle more larger inputs, the GRACE hash join was introduced. It was first implemented in the GRACE database system, hence the name GRACE hash join. The first step in a GRACE hash join is to partition each input out to a set of smaller trunk files on disk. You have equal amount of trunk files for both inputs. And which file to put a row in is decided by doing a hash over the part of the join condition that belongs to that input. And the result is that we can guarantee that matching rows are located in the same pair of trunk files. So after the partitioning phase is done, you can do the classic hash join algorithm over each pair of trunk file. You take the first trunk file from the build input and load it into the hash table. You then can do the probe phase using the trunk file from the probe input. You clear the hash table, load the rows from the next trunk file, and you continue to do this until you have processed all the trunk files. So this is sort of like a divide and conquer algorithm where you take a big problem and you divide it into a set of smaller problems and handle them. Each input is still read only once, but you have to write it out to trunk files and read these trunk files back again. There are two things to note here. One thing is that you compute each input only once. And by that, you can think that each input to hash join can be a more complex subtree. It can be another join or anything. So by writing these out to trunk files, you can compute the subtree only once, and when you read the files back, it's pre-computed, right? Compared to a classic hash join where you end up computing the probe input multiple times for each time you read it. The other thing to note is that reading from a trunk file is a lot cheaper than reading from a table. You do not have the overhead of locking. You don't have to care about transactions, multiverse and concurrency control, etc. So the cost of reading from a trunk file compared to the table is very low. The drawback of the gray hash join is that for small inputs you end up doing unnecessary disk IO because you could have done everything in memory, right? That is where the hybrid hash join comes in. It's a combination of both the classic and the gray hash join. You start out by trying to do everything in memory. You read as much as you can into the in-memory hash table, and if you are able to fit all rows in the hash table, you can do the classic hash join, right? If at any point the hash table goes full you take the rest of the rows from the building put and write it out to trunk files on disk as in the gray hash join. During the probe phase you do the normal probing in the hash table, send out all matching rows to the outputs. But in addition if you did spill to disk during the build phase you also have to write all the rows out to trunk files. That is because one of the rows may match one of the rows that you did write out the trunk file from the build input. So once I read the probe input once you can now do the classic hash join for each pair of trunk files which is what we did for the gray hash join. So once that is done you have completed the join. So this gives you sort of the best of both worlds. You get in memory if possible if you have small inputs but you also get the benefit of spill to disk for large inputs and you also still are computing each input only once. So this brings us to hash join in MySQL 8. Hybrid hash join is the algorithm we implemented and since hashing is used so much for hash join we chose to use the hash function xx hash which is a familiar good quality fast hash function. If we decide to spill to disk we will write up to 128 trunk files per input out to disk which means each hash join iterator will potentially write up to 256 files on disk and this upper limit exists because so you don't risk hitting the open files limit that MySQL has. With hash join you have no guaranteed output ordering anymore which you might have been used to with the nested loop algorithm. 8018 supports inner hash join but in the upcoming 8020 release we also added support for the rest of the join types semi-join anti-join and outer hash join. Currently hash join replaces block nested loop whenever possible which is in almost every possible case. Since hash join replaces block nested loop you have to use the optimizer switch block nested loop to enable or disable hash join. It is enabled by default and if you want to see whether or not your query is using hash join we recommend you to use the new explain format equals three outputs. That's the one explain that gives you an accurate representation of our iterate execution tree. Here you can see an output from explain very clear you can see that we have an inner hash join you can see the join condition you can also see which of the tables that were chosen as the build input in this case it's the country's table and here we turn off the optimizer switch block nested loop and you can see that the query is executed using nested loop instead of hash join. If indexes are available the optimizer will tend to favor nested loop with index lookups instead of hash join. If you want to force a hash join to be used you can either use the ignore index syntax or use the new invisible index feature to turn off indexes and force the optimizer to choose a hash join plan. Oh you can use the optimizer hints as well. Yeah you can and the system variable join buffer size controls how much memory is available for the hash table but please note that a larger buffer size does not necessarily mean better performance. Having larger hash tables has this cost by doing rehashing and general maintenance so in some cases doing things on this may be faster than doing everything in memory so this is where you need to benchmark your queries to see what fits for you. So I have five minutes left I think. Okay so I will go give you some simple performance numbers. Here we have two single column tables. Each table has around 42,000 rows integral values uniform distributed and this is all MySQL 8018 with default settings. If I execute this simple join with block nested loop we are comparing the block nested loop since we are replacing it. The join can be executed in almost 50 seconds. With hash join however you can do the same join in 0.03 seconds which is an improvement of around 1600 times. The main reason for this speedup is that doing a hash table lookup is a constant time operation versus the linear search a block nested loop has to do so this gives you very good speedups for these kind of queries. If you add more rows let's add the 32 times more rows we can still execute a join in 1.3 seconds using hash join which is still far less than block nested loop. You can add up to let's say 5 million rows you can still do the join in 10.6 seconds which is still faster and block nested loop with 42,000 rows. Here you also get the benefit of spilt to disk since the inputs are so large and we have a low join buffer size which is the default setting. In the first example tables were small enough to fit in memory and hash join can even be more efficient than a nested loop index lookup. In this case I've added an index on both columns so the query plan is now a nested loop with index lookup and execution time for this join is 12.17 seconds which is one and a half seconds slower than hash join and the main reason for hash join being faster here is that probing a hash table in memory is faster than doing a secondary index lookup. If you have fewer duplicate values nested loop would probably be quicker than hash join. That was all I had so if any questions? Simon? One question the join buffer size is the only that's something that's per thread depending with your own. Yes that's per thread and per hash join iterator. Is there any control on memory usage there because it might be that you might prefer to limit the number of concurrent hash joins as a running to preserve memory and not suddenly jump and hit a memory issue? Okay so the question was the join buffer size is per thread and do we have a global control of it? Yeah. Yes it's per thread and now we don't have global control of it currently. So the memory usage is or the join buffer size is per hash join iterator. So one query may use multiple times of that buffer size. But if you're not doing that then potentially. It's exactly the same as known. Yeah but the point is that you're suddenly changing your behavior and all of a sudden you've got 30 000 queries running and I think you've changed the users. But that's the same with block nested loop as today. Block nested loop uses the same buffer size. Okay so there's no change behavior change here. No not this loop but block nested loop. Yeah so his comment was that with many hash joins doing large joins you can get a lot of files on disk which can of course create some contention on the disk and problems. Why we use many files instead of one big file? That we use many files because of simplicity. More questions? The two tables are the same size not in an index. Yeah in some cases yes. The two tables are exactly the same number of rows right. So is there any kind of underground map that will kind of choose the best strategy even when the index is available? So you're asking if we are going to change the optimizer. Of course we are thinking about that to of course choose a better plan for these guys cases yes. Currently we haven't changed anything but we are thinking about it. So what we store in the hash table or in the files on disk we store exactly the same as we store in in the hash table. We store whatever from the row that is needed to complete the join. If you do select star we must store the entire row. If you select one column we store that single column. Yeah so the no the trunk file is not the entire table necessarily. And also note that this can happen of course after filtering so that you don't have to put the entire table down to trunk files. Yes yes. So where these trunk files go? The same place as all temporary files in MySQL go. Okay thank you.