 Hash tables are one of the most important data structures in software. Hash tables are built into many programming languages under different names such as dictionaries, associative arrays, hashes, objects, or just hash tables. Now because hash tables are used all over the place, they're carefully optimized and tuned to make your program fast. But sometimes the built-in hash table isn't enough, so I set off to make the perfect hash table. In this video I'll show my journey on making the perfect hash table while also dropping some performance nuggets so you can make your code fast too. So let's start with a simple use case. Let's say we have the days of the week and you want to map numbers to the names. So here we map the number one to the word Monday, number two to the word Tuesday, and so on. How do you implement this? Well one way is to compact all the strings together and put them into an array and then you index into the array by subtracting one from the day number to get the index. Now this is something arrays are very good at, but what if we want to go the opposite direction? What if we want to go from the strings to the numbers? Now we can't just use an array index anymore. What we can do is take the first character of each string, map it to a number, so in this case the letter m is the 13th letter of the alphabet, letter t is the 20th letter of the alphabet, and use that number derived from the first character as the index into an array. This means we end up with a bunch of empty slots in the array, but we're able to access the array quickly. So we call this taking the first character, the hash function. This is one potential hash function. It's not necessarily a good one. We'll discuss better ones later, but this is the essence of a hash table. But there's a problem. What happens if we have two days that start with the same letter? For example, Thursday starts with the letter t. So it would take up the same array slot. Now different hash table schemes do different things in this case, but all general purpose hash tables need to deal with this problem. We don't want this behavior. It kind of slows down the hash table, but it's a fact of life for most hash tables that they have to deal with collisions. But what if we don't want to deal with collisions? Well, what we can do is carefully construct our hash table such that it avoids any collisions in the array. Here's an example hash function where we take the second character and double it and then add that to the first character. This gives us a bunch of unique numbers. From these unique numbers, we can index the table and there's no collisions with our data. This is called a perfect hash function and that's the subject of this video. Pop quiz. What happens if hash function always returns the same index rather than unique indexes? How many keys would need to be checked when you look things up? So let's talk about my real world problem. I'm writing a typescript parser and I need to recognize keywords. In particular, I need to map each keyword to a number. So the most naive solution is a linear search where we go through and do a bunch of string comparisons one after the other. If we find a match, we return the number associated with it. That performs okay. We're able to reach six million lookups per second. Now you may have heard in your computer science class that binary search is a way to cut down the number of searches if we can sort our input data. And in our case, we have our input data sorted, the list of keywords we know ahead of time, we can pre-sort the data. So how fast does a binary search work? Actually, it turns out the binary search was a little bit slower than linear search. Now this talk isn't about binary search. I'm not going to go into the details of why binary search might be a little bit slower than a linear search. But let's try something that's even better than a binary search. We're going to use a hash table. So let's use the standard C++ hash table, which is called unordered map. Unordered map is twice as fast as linear search. So just by switching a basic data structure from a vector, in this case, to an unordered map, we get a 2x performance speedup, which is great. We put in very little work to get a massive gain. But a lot of people say C++'s unordered map is kind of slow. It's stuck in the 1990s in terms of design. So let's look at a more modern hash map implementation, this time from the Rust programming language. It looks like Rust's hash map is not as well optimized as C++'s unordered map, at least for our use case. So performance tip number one, try it. Try the first thing that comes to mind. Enter the performance mindset. The first thing I had in mind was a linear map. I tried it. It works, of course. But then I tried a binary search. I noticed it was a little slower. So then I kept trying other things until I got something that was satisfactory. Off the shelf, hash maps use a hash function that is general purpose. It doesn't know anything about our dataset. It has to look through every character inside of our strings. It doesn't know ahead of time how many characters there are on the string or how many is expected. CPUs like working with a fixed amount of data. But with a general purpose hash function, you can have a key that's one character or 10 characters or five characters. It's a variable number. CPUs don't really like that kind of data. So what we want to do is have our hash function pick a specific number of characters rather than a variable number of characters. That way the CPU doesn't need to do any loops or anything like that. So the fastest hash function that's possible for strings is just taking its length. Because the length is information that's readily available. It's already in number. It can easily be used as the index to the table. The problem is the length is not unique. There are a lot of keywords that are four or five characters long. So we end up with a lot of collisions with this hash function. We need something a little more sophisticated. Something that's more unique is taking the first character. But as we saw with the day of the week example, there are still collisions that can happen. So what if we can pick a set of characters that lead to no collisions, say the first two or maybe the first three. Well, if we look at just the first character, we have some collisions like null and new, both start with n, true and type of, start with t. If we look at the first two characters, this and throw have a common prefix. Let's try looking at the last character. If we take the first and the last character, it's still not unique. We have case and continue, both start with c and end with e. What if we take the first two characters and the last character? Well, unfortunately, declare and delete, both start with d e and end with the letter e. Okay. What if we take the first character in the last two? We also have never a number having a collision. All right. What if we take four characters instead of three? Hey, we get no collisions if we pick these different characters. Let's look at this in more detail. Say we have the keyword debugger. Now this is split into a bunch of different characters. What we would do to select the first two characters is do a 16 bit load, 16 bits meaning two bytes. And for the last two characters, we would also do a 16 bit load. Now our index is a single number, not two numbers. So we need to pack all of these characters into one number. So what we could do is shift one of the pairs up 16 bits, leaving the remaining bits zero. And then we use a bitwise or to bring in the other 16 bits. So we end up with a 32 bit number. Let's look at another example like the let keyword. We do a 16 bit load of the first two characters. We do a 16 bit load of the last two characters. Now notice the middle character, the E, we load twice, but that's fine. There's no real penalty to loading the same character twice. Then we do the same shift that we did before, and the same bitwise or, and we end up with another 32 bit integer. So this is what the code looks like in assembly. We have a 16 bit load. We have another 16 bit load. We have the shift instruction and the or instruction. So there's no loop involved to get the hash. All we have to do is load two things from memory and mix them together for simple instructions. Now in order to make this work, we need to make sure we don't go out of bounds. So first we need a bounce check to make sure we have at least two characters because if you only have one character and you do the 16 bit load, you're out of bounds of your array. So let's see how this performs. We're going to plug this into the standard unordered map in C++ and see how fast it is. Well, we got a significant performance improvement just by changing the hash function. So with a small tweak to something that already exists, we can get a massive performance improvement. So performance tip number two is to make assumptions about your data. In our case, the first and last two characters were unique. Because it was unique, we could feed that as the hash function and end up with few collisions. Now you should make assumptions about your data because your data is never random. There's always some pattern to it. In our case, the pattern is there's uniqueness in the first and last characters and also all of our keys are at least two characters. Pop quiz. Why do we want to pack the characters into a single 32 bit integer? And what else could we assume about the key length? So we discussed using off the shelf general purpose hash tables. But because we have some uniqueness properties in our data set and because we know our data set ahead of time, we can use tools to generate code for the hash table and bake it into our program. So let's look at a few third party libraries that accomplish this. First up is rust phf. As the name implies, this is a rust crate for generating perfect hash tables and perfect hash functions. Let's say we have the key if what it does is hash the if using a normal hash function. In this case, sip hash one three and we get back some number. It uses the bottom bits of that number to look up in a table. Now this isn't the table with the data. This is just the helper table. Once it looked up data from the table, it mixes that with the hash to generate another hash. And that second hash is used as the index into the real table. Now they use this middle table to prevent collisions because if there is a collision, they can just tweak the middle table to make sure there's no collisions. So let's see how this performs compared to the off the shelf hash tables. It looks like we got a small performance improvement compared to the standard rust hash map. However, we're still much slower than the C++ unordered map with our custom hash function. Let's take a look at another library, this time C++ frozen. Now how it works is it takes our key. It hashes it similar to rust phf using an off the shelf hash function. Uses this hash to index into a table just like with rust phf. Then it directly indexes into the result table. Now what's the purpose of this middle table here? Well in the case of collisions, what C++ frozen does is add a marker indicating that the slots in the helper table is not an index, but it's actually some data that needs to be mixed in with the hash. It rehashes the data using the seed to generate a new index. And then it uses that new index to go into the table. So how does this perform compared to rust phf? C++ frozen is much faster, almost twice as fast as rust phf. And it's also significantly faster than our custom hash function. So we're definitely onto something here with this library. Now I said that frozen uses a off the shelf hash function. What if we gave it our own hash function instead? Wow, so it looks like C++ frozen's hash function has some improvements to be made. And if we make those improvements, we get double the performance still. This is looking pretty promising. Let's look at a third library, in this case gperf. Now what gperf does, it takes our key splits it up into characters. It looks up the characters in the helper table. Once it looks up the characters, it adds them together to generate a number. And the number is used as a key to our real hash table. Now this intermediate table is used as part of the hash function. It is not separate from the hash function. So we can't use our custom hash function with this approach. We have to just rely on gperf's hash thing. But let's see how gperf performs on its own. Gperf is beating the pants off of everything so far. It's no wonder many compilers use gperf for keyword recognition and other tasks. Before I made my own custom hash table, I used gperf because it's pretty fast. Let's talk about the new hash table that I made. So similar to rustph and C++ frozen, I use some hash function on the key, and then we directly look up in the table. It's that simple. How do we avoid collisions? How is this a perfect hash table? We're kind of cheating, right? What happens is when we're generating the table, if there's any collisions, we tweak the hash function. And the way we tweak the hash function is by giving it a different seed. So the hash function needs to be seeded, and we start with a seed of one, and then try seed of two and try to seed of three. We keep trying until we get no collisions. And if we do get a collision, we make a bigger table and try it again. Now the hash function we pick, we don't give it the full string, we give it those four characters that we talked about. So let's see how this performs compared to gperf. Well, I tried a bunch of different hash functions because I didn't know which one would perform best. So I tried with fnv1a first because it's public domain and I've used it before. And it doesn't perform as well as gperf. I also tried xxh3 because I heard it was pretty fast, but it turned out to be slower than fnv. I also tried the Intel CRC32 instruction, which is just one instruction on x86 processors. And it was actually faster than fnv1, so that's good. Unfortunately, it's not portable, so that was kind of a downside of it. I also tried a linear congr... congrential congrential... I also tried a basic linear congrential generator as my hash function, and it performed about the same as the CRC32 instruction. Pretty good. Now my suspicion is that the CRC32 function performed worse than it could have because I couldn't use a power of two hash table. And non-power of two hash tables are a bit slower for lookup. So all right, so let's take the best of these hash functions and try some more. So I tried the Intel AES instruction, which should mix better than the CRC32. Unfortunately, it performed worse. I guess it's a bit slower. I also tried a clever hash called the Pearson hash. It's kind of like what gperf uses. It uses a separate table for the lookups, but unfortunately this was slower than the LCG. So I'm not getting the performance that I want. We're not matching gperf. I want to match gperf. That's my goal. So let's take a look at the pseudocode for my algorithm. First we do a bounce check, and I think this is pretty fast. It seemed to be like only two instructions or so, so... Next we load those four characters and then we do the hash. So the loading of the four characters is pretty fast. It's just four instructions as we discussed. The hashing, I tried a bunch of different hash functions, but I'm not certain that we got the fastest. Maybe that's something we can improve later. But now we have this matches key function, which compares the strings of the keys, and turns out that's pretty slow. How do I know this? Well, I profiled it. I tried Intel vTune, which is a profiler for Linux, and I think it also works on Windows. It told me that memcump avx2 movebe is slow, and it's taking about 60% of my CPU. Now memcump is the function I'm using to compare the strings. And the rest of our stuff is taking only 25% of the benchmark time. So clearly memcump is the slow part here. I also profiled gperf and noticed that it wasn't that much of a bottleneck to do the string comparison. So I decided to take a look at gperf's code. So it does a string compare, which is basically the same as a memcump, and it also does a length check, which I'm also doing. But it also has this piece of code, this star stir equals star s. What's that doing? Well, that's checking the first character of the string. And only if the first character matches does it do the string comparison in the length check. So this seemed a bit weird because the string comparison is supposed to be doing the same thing anyway, right? I thought, well, what's the harm? I'll try it. We'll see how my hash table performs with this trick. And we beat gperf, just like that. That one little thing was all we needed to get a huge 50% speedup and beat the pants off gperf. Pop quiz. So we saw that checking the first character made it faster. But why does it make it faster? I also decided, hey, if we're checking the first character and that's faster, why not check the first two characters? We already loaded that into memory as part of our hash function. So well, it turns out it's a little slower than just checking the first character. I didn't really investigate why, but there you go. Perf tip number three, see how other people did it. In my case, I looked at gperf to see what trick they were doing to make their string comparisons faster. And I was able to use the same trick to get a massive speedup. But make sure to measure, because there might be some technique you copy from somebody else, but it actually doesn't improve performance and maybe makes things worse. Speaking of seeing how other people did it, I also looked at general purpose hash tables to see what tricks I could learn from them. And one thing I saw is that for collision detection, some hash tables stored the hash inside the table in addition to the string and the value. And this means before checking the string inside the table, they can check the hash inside the table and see if there's a match. So in this case, we would check the hash for the if and see that there's a match. And check the hash for fetch and see that there's no match. So let's see how this performs compared to checking the first character. And it actually performs about the same. Unfortunately, this approach needs a little bit more memory, so I decided to just stick with the check in the first character. So I beat the performance of gperf. What else can we do with our perfect hash table? Well, let's look at our keyword set again. Let's focus in on this keyword, constructor. This is the longest keyword in our set. It takes up 11 characters, which means all of our keys are at most 11 characters. We're using memcomp, but maybe there's a faster way to compare this key with the input key. So let's do a web search. What's the fastest way to compare 11 bytes in C++? Well, our search engine gives us a thing about std byte, but then it points us at memcomp, which is the function we're already using. So that's not really helpful. Let's think about this again. What we want to do is compare the input string constructor with what's in the table constructor all in one go. But if we have a small input, like let's, we want to treat that the same as if it was constructor. So we're going to add some zeros at the end and compare that with what's in the table, which also has zeros at the end. So now the question isn't how do we compare 11 characters, but how do we load 11 characters and then do the comparison? So let's do another web search. This seems a bit more promising, but there's some buzzwords here, like XMM and M256, what are those? On X8664, there's a bunch of different registers. So the XMM we saw in that web search refers to these 16 byte integers or floats, XMM 0 through XMM 15. Now this can be accessed from C++ using the M128I type. So let's say we're able to load them into memory. Now we want to compare the strings. So how do we do that? Well, let's do another web search. And it takes us a while to find, but we end up with this mmCompisturC intrinsic. Now what is this? It stands for multimedia, compare, explicitly sized string, and I can't figure out what the C stands for. Now this is specific to Intel, but I'm on an Intel processor. Let's see how it performs. Now compared to our fastest approach, oh, we got a crash. Well, what's going on here? This is how I was thinking about it. We have the word let, and then we have garbage after the let in the rest of the string. We zero out the stuff after the string and then do a comparison with the thing in the table. The problem is if we're near the end of the string, then accessing the remaining bytes is out of bounds and that can cause a crash. So what we want to do instead is make sure that the input string has a bunch of trailing zeros in it. Therefore, even though we're kind of out of bounds, it doesn't cause any crash for us. Now if we make sure after we load the input file that we add some null bytes at the end and we run our benchmark and better than our inline hash approach from before. So perf tip number four is to improve your data. In our case, we were able to use CompiStersy without any extra bounds checking by changing the input data to have the null bytes at the end. And in general, the more you can make assumptions about the data, the faster code you can make, and how do you make more assumptions? You can massage your data so those assumptions are valid. Perf tip number five, keep asking questions. Simple web searches can expand your toolbox and think about what your hardware has built in and do some research. Now while developing the CompiStersy thing, I noticed that it requires SSE 4.2. But I want my software to run on older processors where SSE 4.2 is not available. So I developed a version that works on all Intel 64 bit processors. So let's see how that one performs. So it's actually faster than CompiStersy even though CompiStersy is dedicated at string comparisons. While developing the portable move mask approach, I learned about the SSE 4.1 instruction P test. So let's try using that to solve a problem instead. P test is even faster than either CompiStersy or move mask. So let's stick with that, huh? So perf tip number six, smaller isn't always better. The CompiStersy solution was the smallest using only 31 total instructions for the entire algorithm. But it ended up being slower than both move mask and P test which had more instructions. So fewer instructions doesn't necessarily mean faster even if the instruction seems to be dedicated to the specific task you're doing. Pop quiz, can we optimize a string comparison without using special SIMD instructions? For example, just using normal integers. So now our string comparison isn't showing up in Intel VTUNE. So the VTUNE profiler we used earlier tells us about functions. We need to drill down a little bit deeper than functions and get to the assembly instructions. So for that we're going to use a tool called perf. If we perf record our benchmark and then perf report, we get this an assembly printout. This shows us the percent of time taken on the left column and on the right column is the assembly instructions. You can see the instructions color coded based on how hot they are. So let's zoom in to the hot part. If we read carefully we can see this is a string comparison code. We're loading things, we're doing a P test, we're doing some comparisons and then we're returning. I don't see a way to optimize this. There doesn't seem to be any wasted instructions here. So let's try a different approach for profiling. Instead we're going to use perf stat which gives us performance counter data rather than instruction data. And it gives us a lot of data. But let's focus in on these two. Branch misses and L1 decache load misses. Now our branch misses is almost 4% of all branches. Now 4% of branch misses doesn't sound like a lot but each branch miss can cost a lot of time. So let's talk about branches for a little bit. Let's say you have this code if strong health is 4 otherwise health is 1 and we return the health. This might get compiled down to this code. We have the if which is the test and the jump. Now the jump instruction jne is what we call a branch. Sometimes jne will go to another instruction and sometimes it just goes to the next instruction. Branches are something CPUs predict and if they predict incorrectly it could be disastrous for performance. One technique to get rid of branch misses is to get rid of the branches entirely. In this particular program we can avoid branches by using a little bit of math. This formula gives us the same answer but it doesn't use any branches. In fact it only compiles down to one assembly instruction on my machine. But we can't always use this technique right? So let's use something a bit more general. So let's rewrite our if this way. And if our compiler is nice to us it'll generate code like this. So the if used to be a test and a jump now it's a test and a C move. C move not equal in this case. Now there's no branches which means we won't get any branch misses from this code. Unfortunately C++ compilers are very unreliable when it comes to C move so we often need to resort to inline assembly to get the machine code we want. So in our particular example we would convert these jumps into a few C moves. So let's see how C move performed compared to the branching version. Here is a three SIMD approaches we made. And the branchless version of CompiStersi is flying. We're almost at 10 times the speed of the standard hash table. We can also use this technique for the other SIMD approaches like move mask and for P test. Now interestingly in the branch version P test was a bit faster than move mask. But in this case move mask is faster than P test and even CompiStersi is faster than P test. So it's a good idea to try all combinations of the different approaches you try rather than assuming the best one at one time is going to remain the best in the future. Perf tip number seven use multiple profiling tools. I wasn't able to notice the branch miss issue until I used perf stat rather than perf record or vtune. Sometimes your profile can mislead you into thinking something is slow for one reason even though the reason is totally different. Perf tip number eight benchmark on real data. In my case I was benchmarking against keywords and variable names in jQuery which is a real world JavaScript code base. If I instead benchmarked on say just keywords or just variable names I wouldn't have noticed the branch miss issue because it would either always pass or always fail. And if it's if your data is always passing or always failing that's pretty easy for a CPU to predict. Therefore you don't end up with branch misses. But because I use real world data where only 20% of identifiers are actually keywords the CPU wasn't able to predict as well. Therefore there were a lot of branch misses and I was able to optimize for that use case. If you use fake data in your benchmarks you might be optimizing for fake workloads rather than real workloads. Pop quiz could we make the length check at the top of our function branchless. Let's talk about more ideas. After we do the hashing we need to compute the array index. At a low level you take the hash you modulo by the table size you multiply by the entry size and bytes in order to get the byte offset. If our table size is a power of two such as 512 we can use a bitwise and instead of modulo. A bitwise and is much faster than using a modulo and it's something that compiler might do for you but you should probably do by hand if you really care. In the same vein instead of doing a multiplication we can use a bitwise shift if the entry size is a power of two. So in our case we're going to try entry size of 16 which lets us shift left by four. But my thinking is what if we can get rid of the shift? We're already doing a bitwise and what if we could use the bitwise and to clear the bottom bits which kind of acts like a shift. So if we do our address calculation a little bit differently we can get rid of the shift and just do one bitwise and in addition to the addition with the table. So what's this look like at the assembly level? Well on the left we have the original code which does the and that we talked about. It also does some additions and multiplies to get the right address. On the right is our new way of doing things which only has an and instruction. So let's see how this performs. So originally we had a 13 byte entry so let's try a 16 byte entry with this technique. And it performs just a smidge faster or maybe just the same it's hard to tell. One theory is that we made a bigger table because we had to make the entries 16 bytes instead of 13 bytes each. So let's try shrinking the table and seeing if that gives us a performance improvement. Well how do we make the table smaller? As I mentioned earlier the way my hash table works is that we try inserting all the keys with a certain hash function. That doesn't work we change the seed and try inserting all of them again. And we try that a bunch of times and if we can't fit them in after 50,000 attempts we make the table a little bit bigger and then try the process again. But what if we just try harder? What if we do more attempts and don't give up so early? It'll make generation slower but maybe we can get lucky and find a no collision hash table with a smaller size. Here we have a hash table with the default attempts count and it ended up at 262 entries. Now if I crank that up to like a million attempts we end up at 194 entries which performs the same. So it looks like smaller tables doesn't mean better. I mean it might mean better for memory usage or something like that but it doesn't mean better for lookup performance. So in addition to those failed attempts I've had other missteps. I had trouble making compilers generate two 16-bit lows that get merged in together for the character selection. I also couldn't make compilers generate C move instructions that was a real pain. I had to resort to inline assembly and the problem with inline assembly is it's kind of a black box for compilers so they can't optimize that much around the inline assembly. Also the several SIMD approaches that I talked about I had to write like 20 versions of them until I got something that performed well and didn't have a bunch of crashes. In particular generating the mass that took the input string and removed the extra stuff at the end that was kind of tough. Fortunately while I was developing it I learned about the p-test instruction because Clang generated p-test for one of my mask generation approaches. Even though I did struggle making the implementation I did learn something by virtue of Clang optimizing my code. So I did all these optimizations and then realized actually my length check wasn't working properly. I had to go back and rewrite a bunch of the different approaches to make sure they were doing the length check correctly and I had to do the length check slightly differently for each one because of the nature of the string comparison algorithms. So perf tip number nine keep trying new ideas. You will hit a lot of dead ends but you will learn things as you go along. Winners don't give up. Pop quiz. What ideas do you have for my perfect hash table? I myself have more ideas to try but performance is good enough and I've been hacking at this for two weeks so if anybody wants to try these ideas be my guest. So let's take a look at those pop quizzes and see if we can answer them. So the first pop quiz what happens if the hash function always returns the same index? How many keys need to be checked? The answer is bad things. All values are put in the same bucket and the hash lookup function needs to check every single bucket to see is there a match. Lookup devolves into a linear scan for both hits and misses and it's probably slower than blasting through an array. Next pop quiz. Why do we want to pack the characters into a single 32 bit integer? The answer is that some hash functions such as Intel CRC32 instruction work efficiently on 32 bit registers rather than being fed one byte at a time. What else could we assume about the key length than the minimum size? Well the longest keyword has 11 characters and that means we can exit early just like we exit early if the length is less than 2. It also means that all the keywords fit in a 16 bit SIMD register and we tried doing this and it gave us some pretty good speed ups. Now why did checking the first character ourselves make a lookup 50% faster? I'm not 100% sure on this but my guess is because of the early check on lookup failures which are 80% of the time we could avoid calling memcomp. And memcomp is pretty cheap but it's not as cheap as an if statement. Next question. Can we optimize our string comparison without using special SIMD instructions? Well the answer is yes. If what we do is do a memcomp with exactly 12, all compilers will generate a 64 bit load followed by a 12 bit load and we could do the same technique and do some masking so let's try it. Here's our fast branchless SIMD approaches and without using special SIMD instructions we almost got the same performance as our P test approach. That's pretty good. Could we make the length bounce check branchless? Well the answer is yes. But we gotta be careful. If we're looking up a single character identifier such as the letter x well our hash function checks the first two characters which means it'll check out of bounds one character but also it checks the last two characters which means it checks before the string and our string comparison function loads either 12 or 16 bytes depending on the algorithm. So we have to make sure that one byte before and 15 bytes after are allocated and not out of bounds. If we can make this assumption we could just delete the bounce check. So let's see how this compares to our fastest SIMD approach and it's even better. Now this is surprising to me because we're doing all this hashing and all this lookup and all the string comparison stuff for one character variable names. Now I imagine there's a lot of one character variable names but it seems like it's not predictable by the CPU so it's faster to do all this work than to have the CPU mispredict. So perfect tip number 10 is to talk about your optimizations. I didn't have this idea of deleting the bounce check until I made this video and when I made the video I decided well I should probably check the performance of deleting the bounce check because I know it's possible let's try it and I was pleasantly surprised that I made things even faster. So if you talk about your optimizations with people even if you just rub or duck and talk to nobody going through that thinking process can unlock optimization opportunities. And the last pop quiz question was about what ideas you have for my perfect hash table and of course you should leave a comment with your answer below. So that's my journey of making a hash table that is 10 times faster than C++ is built in hash table. Preliminary results say my compiler is about 5% faster overall even though detecting keywords is actually a pretty small part of the compiler. We covered 10 performance tips you can do in your own programs to make things faster. Now none of these are specialized for hash tables or data structures. They're just general performance tips that you can use in your own programs. So I hope you learned something have fun optimizing your code. I want to thank Dave Churchill and other chatters in my Twitch stream who helped me make my perfect hash table even better.