 This Thursday, we have another speaker coming for our seminar series on hardware accelerator databases. This will be Todd Mostak from MAPD. He's the CEO and co-founder. So again, this is another database system that uses GPUs to accelerate the execution of queries. So we're not quite there yet in discussing how we execute queries. But we've talked a little bit about doing sequential scans. These systems are essentially doing super parallel sequential scans on data that's been put inside of the GPU, and they can run that on parallel really fast. Of course, the downside as we know, we've been talking about this already, about how to move data back and forth between the disk and to our buffer pool and memory. These guys have another problem, they got to move the data from memory up on the CPU down to memory in the GPU. So these talks are describing how they actually want to manage that. So administrative things is that, reminder the project one is due Wednesday, September 26th, that's next week at midnight. Homework two will go out later today after class. It originally was due the same day as project one, but I bumped that to be the Friday on the 28th. I just give you some extra time. Again, everything's on gridscope. So any questions so far about project one? I know a couple of you have some technical questions on Piazza. We'll sort those things out. Yes. So I have a question on the operator. Yes. So I was trying to do the first task. So I submitted many times and found out that there were 12 tasks a little bit. Yes. And I actually had to pass all the tests for a task, in order to get any point out. So if you fail the first one, it doesn't do the FV. I think I failed, so I submitted once and I failed. I only failed the last one and I got zero, but it's the pretty tired thing. Okay. I'll fix that. Okay, did you post on Piazza about this? I did. I'll post on Piazza, we'll fix it. Okay. Okay. Any non-technical, the greater thing doesn't work right. It's in high level questions. If it's confident, they can build the remover pool manager. Right? Okay. The last thing I also want to bring up is just a reminder of what I said at the beginning of the semester. I really want you guys to stop me as we go along if you have questions about the material as I'm speaking right now. What I don't want is for people to come up in the front and ask me questions about something on slide 20 that occurred 30 minutes ago. Right? This serves two purposes. One, if you have questions about the material, then somebody else probably does too, so it's better just to stop me and make sure I clarify or speak more clearly about what it is that you're confused about. But also it serves a way to make this material better for the next year because I can always go back and watch the video and see what the students asked this question, this part wasn't clear, and actually try to make the slides better so that the thing you were confused about can be described in a better way the next time around. Okay? So it means again, if you come up at the end of the class and ask questions about something that's not about the homeworks or something beyond the course material, I will not answer the question. Right? I'm not trying to be an asshole. I want you guys to stop me because it's not, I just want to speak here for an hour and a half and just go through all the slides. If you have questions, stop me and we can go over things in more detail. Okay? Again, there's no stupid question. The only stupid question there is is, is this a stupid question? Right? So you can ask me anything. I don't care. All right. So where we're at now, again, just to ground ourselves and understanding where we're going along in the semester, is now we're going to start talking about, after we know how to organize our data on disk and in memory, now we're going up further up the stack and talking about how we're actually going to have queries execute in our system and read and write data. So for today's lecture, we're sort of talking about how we support them to internally maintaining metadata, among other things. But sort of broadly, you can think of like, we're at this point here called access methods. And access methods are almost like self-describing. They're the methods or the mechanisms inside our database system that's going to allow queries or threads to access data. And the data structures we'll be talking about for the next two weeks can be grouped in sort of two categories. We'll have the hash tables that we talked about today, and then we'll have the order preserving trees that we'll talk about on Wednesday this week and then Monday next week. So to understand where these data structures are going to be used and how they're going to help us execute queries, we're going to first go over what are some cases or where can we actually use them inside of our system? So as I first already said, we can use these to maintain the internal metadata of the database system. And you guys are already doing this for your buffer pool. You have to build a hash table for the page table to map page IDs to frames in the buffer pool. That's what I mean by internal metadata. It can also be used for the core data storage of the system. So we can use either hash table or order preserving tree to actually organize the underlying pages or tuples inside of our pages. So the easiest way to think about this for a hash table, there's some NoSQL systems that are key value stores. These are just hash tables that map keys to values. In the case of order preserving trees, like B plus trees, though we'll see some systems like MySQL actually organize all the pages and the tables themselves inside of the trees. It's not just sort of like a separate auxiliary data structure. The next thing we can use them for is temporary data structures during query execution. So that means as we're executing our query, we can actually build a hash table on the fly based on the data we're reading for our query, do whatever it is that we need to do using that hash table and then blow it away or throw it away immediately when the query is done. And this actually turns out to be much faster than just having to do sequential scans over and over again. We'll see this case in hash joins, but you can compute aggregations this way. The group by clause, you can think of that just again using a hash table to group the different categories or clusters together. And the last one is probably what most people think about when they think about hash tables and order preserving trees is using them for table indexes. And the main takeaway I wanna get from this is that there's certain trade-offs as we go along that we'll talk about where sometimes the design decisions for certain hash tables or certain order preserving trees will be good for table indexes but may be bad for internal data structures, right? The way we maybe design our hash table to do our page table is not the same way we wanna do our hash table to do joins because again they have different trade-offs in terms of performance and the amount of memory they have to maintain. So again, we'll see this as we go along throughout the semester where we're in a building talking about this week or the building blocks we wanna use to solve all of these problems. And then we again build on top of this and do more complicated things. So the two major design decisions we're gonna have when discussing our data structures are the data organization and the concurrency methods. So data organization is essentially just how we're gonna lay out the physical bits of the data we're trying to store in our data structure in either memory, in the heap or in pages, on disk so that we can support efficient access to the data that we want, right? And it's how we're gonna lay out the data plus the additional metadata we may have to store inside of our data structure to figure out what we need. The second part is how to allow multiple threads to safely access and modify our data structures at runtime. So for this discussion, for this week and for Monday next week, we're actually gonna for the most part assume that we're only gonna have a single thread accessing our data structure. But obviously in a modern system we're gonna have multiple cores and multiple threads running so you're gonna want to have multiple threads access to your data structure at the same time. And now you need to protect its contents both at a logical level and a physical level to make sure that one thread doesn't write something that another thread is reading and it gets corrupted data. So by logical versus physical, what I would mean is the physical data structure would be again the underlying bits that are stored inside the data structure and we don't want to modify something that would cause us to have it now a pointer to an invalid memory location. That's what people normally think about when they think about allowing multiple threads to access a data structure. And we can protect the data structure using latches. The logical contents, protecting the logical contents of the data structure has to do with things like what queries are actually seeing while they're running while other queries are running at the same time. So this is a more complicated nuanced topic we'll cover much later on when we talk about transactions. The way to sort of think about this as a simple example would be say I have my query delete an entry in my data structure, my hash table. If I go back and check to see whether that key is still there it better not come back and say it's still there. Physically it could still be there because maybe the data structure is gonna remove it later on with a garbage collection process but the bits are still there but logically we don't see it. So there's a whole bunch of other mechanisms we're gonna need to protect these data structures to make sure this occurs. And we'll cover this actually on Wednesday next week and we'll see this more in detail when we talk about transactions and concurrency control. So our purpose here for this discussion is really gonna be focused on the first one here and the first design decision and we're just gonna assume that we have a single thread accessing our data structure. It makes things much easier. So our focus today is on hash tables. So at a high level a hash table is gonna provide you with a associative array interface that is gonna map keys to values. For some arbitrary key we wanna map it to a value. If you're familiar with Python it's the dictionary data structure same thing if you've written Java it's the Java hash map class and the same idea keys to values. So the sort of core concept of how this is all gonna work is that we're gonna have a hash function that we can use to compute some location or offset in this associative array that can point us to the value that we're looking for for our given key. So the absolute easiest way to implement a hash table is what is called a static hash table. So then think of this as just a giant array of slots. And for this we're gonna assume that the values we wanna store in our hash table are fixed length. So we know that the offset at every location we know how to jump to it based on whether we want the 10th item or the 100th item. We know how to do that arithmetic to jump to that memory location. So what'll happen is that we'll just have some hash function and for the given key we wanna look up the simple hash function could just take the key and the bits and mod it by the number of elements that we have. So we just order these different slots from zero to n assuming that we have n keys that we wanna store. And then now we wanna store something, again we just hash the key and then we can dump the value inside of there. Super simple, this is the easiest hash table you could actually build. What are some obvious problems with this? What assumptions did I make? No collisions, but why would there be no collisions? First of all, what is a collision? Right, he said the same key has to have the same offset in my array. So why am I assuming that there's no collisions? Because I'm assuming I know exactly the number of keys I wanna store, n, right? And I just take every key and just mod by n and it's gonna put me in the same location, right? It's sort of real simple case there, here. The other big assumption is that my values are fixed length. So I know how to jump exactly to the location that I want. So that's not a big deal, right? To handle arbitrary length keys, all we have to do is sort of have this sort of separate stores location over here and now our giant array doesn't actually contain values, it contains pointers to some other memory location or some other page location that has the values that we want, right? So again, the problems with this approach is that we assume that we know the exact number of keys that we had ahead of time. We also assume that all the keys would be unique, which in many cases it's not going to be and at this point we don't have any way to handle that. And it also assumed that we had what is called a perfect hash function. So a perfect hash function is a theoretical function that given two keys that don't actually are not equal to each other, the hash generated by our perfect hash function will not be equal either, right? The sort of thing of this is like, if I had my domain of every single possible key I could ever have, I can have a unique hash output or hash value for every single key. Again, these exist in literature, in theory you can just build this with an additional hash table, additional slot of the array map, but in practice nobody actually does this because it's the very expensive to maintain. So the way we want to handle this is by approximating. So the two design stages we're going to have in our hash table is what hash function we want to use and what hashing scheme we want to use. And I would say that the combination of these two things is actually what defines a hash table, right? When you say I have a hash table implementation, it's you have these two things. So the hash function, the way you think about this again is we want to take a large key space of every possible key we want to store and we want to map it to a smaller domain, right? And the reason why we want to do this is because otherwise we have to have store either have a perfect hash function or store a potential slot for every single key you could ever see. So for this, the big trade off is when we talk about what hash function we actually want to use is going to be this trade off between how fast our hash function is and versus what our collision rate is, right? Because again, we're doing this while we're running queries, we're doing this while we're trying to access pages in our page table. We want our hash function to be really fast because we don't want to have to maybe do this huge traversal and long lookup just to figure out where the key is that we want in our hash table. So we want to be really fast but then we don't want to have all our keys mapped to the same location because as I'll see in a second, depending on your hashing scheme, this is gonna require you to do essentially a sequential scan to find the thing that you're looking for. So what's the fastest hash function that you can ever possibly build? What's the dumbest, fastest hash function? Yes? Always return one. Exactly, she said always return a number, I'll make it even easier, always return one. So every single possible key, no matter what name you have in the directory or what email address, it always returns one. It's super fast because it's just return one. The problem is collision rate is abysmal because everybody's gonna map to one, it's gonna map to the same location in our array. So we want to be a bit smarter about this and that's what we can have different hash functions do for us. The next decision we have to make is the hashing scheme and this is essentially how we're gonna handle collisions which are gonna be unavoidable because we don't have a perfect hash function and we're not trying to store a giant array that can handle every possible slot for our keys. So we wanna have a hashing scheme that describes exactly what method or heuristic we're gonna use to handle collisions when they occur. And again, for this, the trade-off is gonna be almost the classic computer science trade-off where we're gonna sacrifice or give, allow us to use less memory to store a hash table in exchange for having to execute more instructions to deal with collisions when they occur. And again, if I just had a, I could build my hash map that's two to the 64 possible entries that possibly take all the memory I have in my single machine. I'll never have a collision so the number of instructions I have to execute to deal with collisions will be minimal but I've allocated all that memory and now I can't do anything else in my machine, right? So again, this is classic trade-off that we wanna handle on. I do not wanna get better with Windows and do not wanna update, okay? So for today, we're gonna start off talking about the different type of hash functions you can possibly have and this is not an algorithms class, this is not a cryptography class. So we don't really care how it's actually implemented. We're more interested in the properties that they have. And then we'll talk about the two different types of hashing schemes that you can have, you can have static hashing and dynamic hashing. And they just, you know, as a ground you guys for what we're talking about, the extendable hashing is an example of a dynamic hashing scheme, okay? Okay, so for a hash function, can everybody give me an example of a hash function they may be familiar with? The one they may be used at an internship or for the projects. Other than STD hash. CRC, yes, that's a good one. SHA 256, perfect example, okay. So CRC is a non-cryptographic hash, right? You can take some arbitrary byte stream and it'll generate you a hash function. He said SHA 256. SHA 256 is a cryptographic hash function. We don't, and it has certain properties that it's not gonna leak data, the more keys that you give it, right? And it's very difficult to reverse. We don't care about security on the inside of our system, right, let me caveat that. For our internal data structures that the information is never gonna be exposed to the outside world, we don't care about cybersecurity or don't care about cryptography inside of this. So SHA 256 is an overkill for what we want. First of all, it's expensive. It's more expensive than CRC 32, which is a single instruction on X86 CPUs. And it actually can reverse it. If you have the public private key, you can take a shot to V6 and reverse it, right? We don't care about reversing it. We just wanna know what offset do we jump to in our table. So we don't wanna use cryptographic hash function. We want something that's fast and something that provides us with a low collision rate. So CRC 32 is a good example and we'll see some other ones from Google and some other people that are pretty common. So the, I should actually include CRC 32. Next year I'll upload it, but these are some of the most common hash functions that are used in real systems today. I will say the commercial guys and Postgres and MySQL, these are much older systems, so they have their own custom hash function. But most of the new data systems that I know about that have come out in the last 10 years or so, when we talk to the developers, they tell us they're using one of these guys here. And these are all open source. So, member hash came out in 2008. It was really put out with just some random dude on the internet posted on GitHub or something, right? And then people sort of picked it up and started using it. It is designed to be a fast, general-purpose hash function. And again, these are one-way hash functions where we give them some key, it produces some output. And in theory, if you saw enough of the keys, a key value pairs or keys to hashes, you could reverse it, but we don't care about that, right? We're not worried about leaking any information because this is all internal. So, member hash again was designed to be this fast, general-purpose hash function. Google ended up picking it up in, and actually extending the version two of member hash. And they designed it to be better or faster for shorter keys. Keys are less than 64 bytes, right? And they wanted to do this because in their internal workload, in their services, this is what the kind of data they saw all the time is that they wanted a hash function that was optimized for these things. Then Google extended city hash in 2014 with farm hash, and this was further improved to have better collision rates. In 2016, they actually came out with another hash function called highway hash. This is one that actually does have some cryptographic guarantees, our protections against analysis on the data. Again, so we don't care about that, so we're not gonna use that for this. And then the last one here is seal hash. So this one's interesting, this one came out of Canada from a professor up in Montreal named Daniel Lemire. And this is actually using a different kind of math called carry less multiplication. There's a link there for Wikipedia article if you wanna learn more about it. But what's interesting about this is that the idea of carry less multiplication is not new. It's just in 2014 or so Intel and AMD added instructions to do this kind of math in the harbor itself. So now it's actually possible to do this arithmetic very fast, so this hash function became an actual viable option. So just to give you an idea of what these things sort of look like in terms of performance. So this is an experiment I ran on my workstation with one of the newer Intel Core i7 CPUs. So this was written by somebody on GitHub and I sort of extended it to use seal hash. So basically what they're doing is they're gonna hash a bunch of random strings of different sizes as fast as possible and measure what the throughput rate is. And so what you see is that when the keys are kind of small the hash functions are all sort of more or less the same but then as you increase the key size and the y-axis is throughput. So it's how many bytes of a key can you hash as fast as possible? So up to a certain point they all sort of plateau and you can see the SCD hash in there, the black line that you guys are using for your project. Again it's reasonably good but the two sawtooth pattern lines are from the SCD hash and farm hash from Google and as I said they designed it to be optimized for 64 bytes or less than 64 bytes. And then the points where it sort of goes up and goes back down these are where they're aligned to cache lines. So cache line is when you do a fetch into memory you just don't get like just the one thing you want and you get a whole bunch of stuff along with it that are nearby because it assumes you're probably gonna need that too and then it can start packing them into 64 byte cache lines. So now if you can do a bunch of operations or instructions on data that's sitting within a single cache line you never have to go back and fetch more to maybe the upper levels of your cache. So that's why you see this like when it gets to 32 bytes or 64 bytes it's you're getting much better performance because it's from one cache miss they're doing more work. For one cache miss you're generating more output for the same amount of read data read. So that's why there's the solitude pattern. And just to throw CL hash in this again you see the same kind of ups and downs based on how you're actually packing things into registers. The main takeaway I wanna give you guys for this is like for things that are less than 64 bytes you can use city hash or farm hash. For larger strings you maybe wanna use something else. For our project purposes we're not trying to run as fast as possible because we're running in SQLite or in Gradescope so this SED hash is good enough. So the main takeaway here is that different hash functions have different properties and you may wanna choose one versus another based on what you think the distribution of the data it is and what you're actually trying to do with it. But the hash functions are as far as I know interchangeable with all the hashing schemes that we'll talk about next. So when we talk about these hashing schemes it doesn't matter whether using city hash or murmur hash the actual algorithm will still work the same. The question is whether you're gonna have a higher collision rate with one function versus another. Again that can depend on the distribution of your data. Okay so now we're gonna talk about static hashing schemes. So again the hashing scheme is the protocol or the method that the hash table is gonna use when there's a collision. And we say collisions are essentially unavoidable because we're trying to store a large key space into a smaller amount of memory. So things may end up hashing to the same key. So first I'll talk about linear probing hashing, Robinhood hashing and Cuckoo hashing. So linear probe hashing is probably the most common hash table. It's not the one everyone thinks of when you think of a hash table. Most people think of the chain hash table which we'll see in a second. But in terms of implementation inside of database systems this is the one that's the most common because it's just so simple and so fast. So the hash table is just gonna be this giant table of slots and you take the key you wanna insert you hash it and that puts you in some position into the hash table, right? And so what'll happen is if two keys hash to the same slot in our table then we just keep scanning down into our table until we find a position that is free and that's where we can go put our item in there. The thing that we're trying to insert, right? And this is why it's called linear probing because again we're just going literally almost as a sequential scan in the table going down until we find the thing that we're looking for. So for lookups we have to again we do our hash from a key, we land in some space, we scan down until we find the thing that we're looking for or we find an empty spot at which point we know that the thing we're looking for is not there. Of course now what's the problem with this? With this approach? Yes? You make an empty slot, you do a deletion and you do a strategy read. He says, yes he's correct. He said that when you do a deletion you now make an empty spot which sort of disconnects the two chunks of the things next to each other. Yes, absolutely. What's another problem? Say I'm doing an insert. The thing I want is not there and I keep scanning. What if I never stop? What if I reach the bottom, loop back around, I keep scanning down and now I land back in the original position I started that. I'm stuck in an infinite loop. We also need, when we do inserts, we also need, in searches, we want to keep track of where we left off so that we don't loop forever. So let's look at an example. So say on the side here, A, B, C, D, E, F, these are the keys I want to insert and again my hash table is just this giant array of slots. So I want to say, the first thing I want to do is insert A, so I take A, I hash it and it lands at this position here. Because nobody's there, I can go ahead and take the slot. So inside of the entry in my hash table, I need to store the key that I just inserted, the original key, as well as the value. The value could be like a record ID, like a page ID in an offset, could be an actual value. It doesn't matter, but I need to store both. And the reason is because when I come back and try to insert something else or do another lookup, the hash function is just going to tell me where to jump into the hash table. But then I need to still do a comparison as I'm scanning to see whether the entry I'm looking at is the actual key that I want. Because again, two keys may hash to the same location and I need to know whether the entry is actually the key, the same key as me. All right, so now let's say we hash B, that points up here, nobody's in there, so that's fine. So now we want to hash C, points to the same slot that A is in, but that's occupied. So again, the linear probe method just says go down to the next one and that's where we put C. So now if I do a lookup on C, when I hash it, I'll land where A is, I do my comparison with the key in A and the key of C, they're not equal, keep scanning down, then I find my match on C. Right, I can do this and so forth and so forth, D goes where C is, that's occupied, so it has to go to one below, E goes where A is, that's occupied, so it has to go all the way down till it finds a slot that it's taken. And then last one for F, right? So I'm going to keep up a good point about deletions here, we are gonna punt to make this easy and just not deal with deletions. So for certain operations, this is fine. For hash join or aggregations, this is fine because you're not gonna go back and delete the entry as you're doing your hash join. So I realized, you might not know what a hash join is, but the basic idea is I wanna join two tables, I build a hash table for one side, put the keys in there, and then on the other table, I do a probe in the hash table and see whether I have a match, right? So the hash table is read only, same thing for aggregations, I never go back and delete something. So in that case, this is fine and this is super fast. For deletions, exactly as he said, you have to do some reshuffling and copy things around to fill in the gaps and that becomes expensive. Another example of a trade-off where this is be great for some operations where insert only and read heavy, but if you wanna do updates on deletions, you don't wanna actually use this leaner probing hashing method. The other big issue that we have to talk about is how we actually handle non-unique keys. So for this, I've assumed that all our keys are unique, but that's not always the case. So there's two approaches you can do to handle non-unique keys. So the first is that you just maintain the values for the duplicate key in a sort of separate link list area. So let's say that I have my hash table, I have keys x, y, z, a, b, c. These, instead of having the values embedded inside the hash table itself, they just have pointers to some pages or some memory location in the heap would now just a list of those values. So now when I wanna say, give me all the values for x, y, z, I do my hash, I jump to my location in my hash table, I may have to scan down, depends on where it actually is. And then when I find the entry that I want, now I have a pointer to this other area and then I know that the elements or the values in that area only correspond to the key that was pointing to it. The other approach is to just restore, store the redundant keys and the values together. So instead of breaking up x, y, z, a, b, c and their values separately, in my hash table I just store x, y, z and the values together like that. So what are some downsides of the top approach? Is it more space efficient? No, because I'm not just me allocating single entries for every value, I'm gonna organize things usually in pages. And so if a single key has one value, I may be storing, have to allocate an entire page to store that one value. Where in the bottom approach, I don't have to do that because I just add another entry into my hash table. The downside of course again, the top one is more efficient when I wanna say give me all the key, give me all the values for a single key, I just jump to my one key, find my key and then jump to the value list and everything's already there. In the bottom approach, I just have to scan through. So there's another good example of trade-off in databases where do we wanna favor the reads versus the writes, right? The top one is more efficient for reads, but maybe less efficient for writes because we have to allocate this extra space. The bottom one is super efficient for writes because we just plop our new thing in there and have to allocate any space, but reads may be more expensive because you have to scan through to find it. So that's how you handle non-unique keys. For our purposes going forward, for all my examples in these hash tables, we'll just assume the keys are unique, but if you need to handle non-unique keys, you have to do something either one of these two methods. And I would say most people do the bottom one. I actually don't know anybody that does the top one. Okay, so we said that one of the big issues with the linear probing hashing method, the hashing scheme, is that we have all of these wasteful comparisons potentially when we have collisions. Again, we try to pick a really good hash function to reduce the number of collisions that we have, but in real data sets, it's gonna be unavoidable. So the issue is again, when we have a collision, we have to scan through, literally, and start looking at every entry to find the thing that we want. Worst case scenario is that the key that we want is the previous key in our hash table. So we land at one position and then have to scan through the entire thing, loop back around, then get to the very end and just find the thing that we're looking for. In that case, the hash table is essentially useless to us because we could have just done a sequential scan on the data and try to find that thing we were looking for, but now we're paying this penalty or this cost overhead and maintaining this hash table and building it and didn't actually help us in any way. What should have been an O1 lookup is now an ON lookup, in the worst case scenario. So again, I said the way we can try to avoid the number of collisions is just by allocating more memory. The theory works out, the math works out, that on practice, if you want to reduce the best on average reduction of collisions would be allocating a hash table that's two times the amount of keys you actually want to store. But it's still not gonna help our case where, well, yeah, this reduces the collisions and reduces the amount of work we have, but maybe there's other approaches we could apply to actually reduce the amount of scans we have to do or lookups we have to do to find the keys that we're looking for. So the two approaches to do this are Robin Hood hashing and Cuckoo hashing. So Robin Hood hashing is a variant of the linear probing scheme where instead of when we do an insert, instead of just finding the first free slot that comes after where we should be in the hash table, we will actually allow new keys that are being inserted to steal the slot of existing keys if they are richer in quotes than we are. So Robin Hood is the folklore, the tale of medieval England. Robin Hood would steal from the rich people and give it to the poor people. So the idea of our hash table here is we wanna have poor keys steal from the rich keys. And I'm defining rich in terms of how many positions or jumps they are from where they actually should be in their optimal position in the hash table. So when we do an insert, if we can't get to our position where we wanna be, if as we start scanning down, if we come across a key that is closer to where they wanna be in terms of their optimal position to where we should be, then we'll go ahead and steal their slot and then move them down to our worser position. So let's go back to our table we had before. Again, keys we wanna insert is ABCDF and we said that when we hash it, it gets inserted here. And now you can see in our hash table, in addition to the original key and the value, I'm also gonna store this position that says that again, the number of jumps we are from are optimal first position. So in this case here, when we inserted A, nobody else was in A slot, so A got to take it, so A's position, or it's distance from its optimal position is zero, because it's zero jumps from where it should be, because it's exactly where it should be. Same thing for B, B hashed up here, its position counter is zero. So now we wanna insert C. C wants to go where A is, so now we wanna do a comparison to say what's A's optimal position counter versus our optimal position counter. Now at this point, C hasn't gone anywhere, right? We immediately hashed this slot here, so our position counter is zero, A's position counter is zero, so we leave A alone. And we just do what we did before in leader probing the original version and just jump down to the next one. But now you see that for cases C, we made its optimal position counter be one, because we'll run jump away from where we wanna be, which is where A is. So now we wanna do an insert on D, D goes where C wants to be, D's position counter is zero, C's position counter is one, one is greater than zero, so D's not allowed to steal from C, and D has to get moved down here. Now we can see this stealing thing actually work when we try to insert E. So E wants to go where A is, but they're both zero at that point, so E doesn't steal from A. Then we go down here to where C is, C's position is one, E's position counter is one, so it leaves C alone. But now we go to D, so now E is two hops away from where it wants to be, D is one hop away, two is greater than one, so E is allowed to steal from D, steals its slot, kills it, and then moves it down to the next position here, and now D's counter is two. Same thing with F, one more time, F wants to go where D is, two is greater than zero, so F can't go there, so F goes down below. So this is, again, this is what we're trying to do here, or what this is supposed to do better than the linear probing hashing scheme, is that on average, the amount of scanning you have to do to find the key that you're looking for is reduced. Because we're sort of shuffling things around to minimize the number of hops, any one key could actually ever be. So my example before where I said, the worst case scenario would be the key that I'm looking for is actually one above where I hashed to, that can't occur under this scheme because before you got there, things would get swapped around. Right, because there's likely to be one that's actually worse than you. So this is an old technique, this is from like 1985, and it's one of those things where the paper came out in 1985, no one actually read it, and then paid attention to it in sort of the systems community, and then in the last 10 years, it sort of showed up on Hacker News a couple of times and some systems were actually implementing this. The impractice, what I'll say though, is that the research shows that this actually does not, this is not better than linear probing, it seems like it would be, but it's not. Because on modern CPUs, all this sort of these checking these counters and moving things if necessary, this is doing additional branch misprediction, which is slower on superscalar architectures, I could get the flush of your pipeline cache, your pipeline and load everything back in, and it's also extra copying. So for a single insert, I may have to copy one thing out and move it to the next one. I don't do that on a linear probing because I just keep scanning to find the thing that I want, and the scanning is actually cheap, yes. Well he is stealing the stuff from me, but he is not going to reward that me. So your question is, back here, we're inserting D, E. Your question is, at this point here, when we want to steal from D, we're saying that, your question is, is this insured that D is not going to be worse than E, or E is not going to be worse than D? The width here is not going to be worse than the last. Yes correct, so his statement is, in this case here, when we're stealing, the idea of stealing is that we're going to make it so that the victim, the person we're stealing from, it's going to be to have the same number of steps, but it's not going to be worse, and we're defining worse in, again, the number of jumps you are from where you should be. So in this case here, E was two, D got moved down, and then D was going to be two as well. So it's not going to be the scenario where D would be three, right? Actually it could be, because if the one below this was two, then D would keep going down. But then at some point it would find somebody it could steal from. So there's no guarantee that actually stealing would actually not make you worse than the one you stole from you. So the way to think about this in sort of human metaphor would be I could steal from a rich person, take all their money, and now they're more poorer than I am, right? That can happen under this, yes? I don't see how this will make this better. So like if the D and E are on the same spot, and you are trying to move them down, either of them down to the next end is blocked. So you're going to take these steps no matter what. So you're adding to view or to view, but the total number of steps is still the same. Right, so his statement is that like in my example here, the total amount of steps is essentially still the same for everyone, right? In this example here, my hash table is tightly packed, right? In a real hash table, if you allocate it large enough, then the likelihood that there'll be an empty slot so that you're not sort of just keep cascading down and still ending up in an even worse position is unlikely. It's not always the case. The other thing I'll say too is like on average, this makes the number of steps per key minimal. Yes, but like the total number of steps is the same. So you can only say that you are avoiding the worst case. Exactly, so this is trying to avoid the worst case scenario where I had to scan through the entire thing. That's all the scheme does. But like you are just like a trade off the good ones. Yes. To improve the worst case. Right, so his statement is, and I agree with it, is that this is a trade off between the guys that are in the optimal position potentially, versus ones that actually be really bad. Yes, absolutely right. So again, the literature says that although this seems kind of like a nifty idea, this actually is worse than linear probing because you're not gonna always get to get the huge benefit that you're minimizing everyone's number of jumps or steps you have to take. And we have to do more work in order to reshuffle things. So in under linear probing, I just scan through until I find my slot that's free and just put it in there. Under this, I have to do the copying of the new one in and the copying of the old one out and moving it down. So that means cache misses, branch miss predictions on the CPU, and other issues like that. All right, good. So another alternative to handling these collisions is to do what's called cuckoo hashing. So with cuckoo hashing, the idea is that we're gonna maintain actually multiple hash tables at the same time and use multiple hash functions to hash keys to the different hash tables. And the idea is that a key can only exist in either one of the hash tables, right, so for simplicity we'll say there's just two, but you could have more than two. And when we wanna do an insert, we hash the key twice, find a free slot in either one and pick that free slot. When we wanna do a lookup, again we hash our key and we check both of the hash tables. Now the issues of how we're gonna handle collisions, we're gonna sort of ping pong back and forth between the two hash tables, because if we have a collision as we wanna insert something and that slot is taken, we'll steal from the whoever's in there now and then move them to the next hash table. Right, the idea here is again we don't, it's to minimize the amount of work we have to do to do a lookup, but we're gonna pay a penalty about having more expensive insertions. So let's see an example like this. So just again for simplicity, say we have two hash functions in practice, everyone pretty much always does two. I have heard of some people use three, but no one does more than three, right? Just the overhead is unnecessary. So let's say I wanna insert A. Again, I'm gonna have two hash functions, one for each hash table. So when it comes along, I hash both of them and it's gonna point to these two locations here. And at this point, both hash tables are empty. So I could go in either one position, so I'll just flip a coin and tell me to go into the first one. Now I wanna hash B or insert B, same thing, I'll hash it twice. The first hash function points to where A was inserted in the first hash table and that's occupied, so we don't want to steal it because in the second hash table, it's an empty slot. So we'll always choose the empty slot and put B in there. So now let's say we wanna insert C, we hash that twice, it points to where A is and it points to where B is in the two hash tables. So now we need to make a decision which one we actually wanna steal from. And in practice, it's just random. You can be a bit more sophisticated, maybe compute some metadata about the collision rate on one hash table versus another, but as far as I know, nobody actually does this, it's just not worth the engineering overhead. So we'll flip a coin and decide that we wanna steal from the second hash table. So we'll steal B's slot, put in C there and now we need to take B and put it in the other hash table. So now we'll use the first hash function that corresponds to hash table one, we hash that, but as we saw in the beginning, it wanted to go where A is now, right? Remember when we started B, it hashed where A is and turned empty slot in the second hash table so we chose the empty slot. Now we wanna put B back on the other side, now we have that collision again because we're never going back where A is. So here again, we steal from A, put B there, hash A with the second hash function and then that puts us into the second side here, right? So these two hash functions essentially can be the same thing, it can be murmur hash, can be city hash, we just provide it with a different random seed so that they have different distribution properties, but it doesn't have to, when I say hash function one, hash function two, it can be the same algorithm that can just be a different seed value to give it different randomness, right? And of course, what's the issue with this? For doing these insertions and moving things back and forth, if it'll loop, exactly right. So we have to maintain some metadata to say, where did we start off when we first inserted so that if we come back around and we see the same key, trying to put it back in the same spot, we know we're stuck in an infinite loop. And at that point, we have to rebuild the entire hash table. So again, typically, we didn't talk about rebuilding so much for the other two, but typically what you do is when you recognize that my collision rate is too high and I'm stuck in all these infinite loops, you double the size of the hash table. So you do the same thing here, if my hash table is too big, actually for both of them, because it's too small, I have too many collisions, I'll allocate a new hash table that's double the size and then take all the keys that's other than the first one and put it back into the second one, right? You have to essentially lock the hash table or put a latch on it while you do this for anybody from reading and writing it, because you're essentially just rebuilding the entire thing from scratch. So the math works out that with two hash tables and two hash functions, that you probably won't have to rebuild the tree, or sorry, rebuild the hash table until you're around 50% full. Meaning the likelihood that you're gonna hit a key that'll get stuck in an infinite loop won't really happen until you're about 50% full. If you have three hash functions or three hash tables, the math works out that you probably don't need to rebuild it until you're 90% full. Of course, now this means again, you're allocating a third hash table so you're paying the penalty of extra memory for having to execute fewer instructions. So cuckoo hashing shows up in a couple different systems. I know IBM DB2 does this for their in-memory accelerator. This one actually is pretty common. The best open source implementation of the cuckoo hash table is actually from CMU, from Dave Anderson. So if you Google like lib cuckoo, you'll find the CMU version of it. It was written by Dave Anderson and his students. With Robin Hood hashing, I only know one system that does this. When we asked them why they do this, they said that the engineer saw it on Hacker News, thought it was a good idea, so they implemented that. But the literature pretty much shows that for all these approaches, the linear probing is always the best. It's always gonna be the fastest. So it's especially for hash joins. Okay, so as again, these were static hashing schemes. That means we knew what the number of keys you want to store ahead of time. And so you may be asking when does this occur? When is it possible that I would actually know the number of keys that I have? In your page table, you don't, right? Because the data is gonna always keep growing, you add new pages, so you don't know the number of keys you need to store. Actually, that's not true, and we'll come back to that. The page table is static because you're preferable size of static, so it's always a fixed number of pages. It's the directory that grows. So again, a really common scenario is again for hash joins. Let's say I have a simple query here. I don't have two filters. I have only filters on these tables, so I know I'm gonna read or do a join on exactly the number of tuples that I have in both of these two tables. So in this case, I know exactly the size of my hash table because I know the number of keys I need to examine. So I can use linear probing, I can use cuckoo hashing, or I can use the Robinhood approach. If we get too big, then we have to sort of stop the world and double the size and rebuild it. And we'll see this later on when we start doing estimations of the cardinality or selectivity of predicates in our query plan. We get this wrong all the time and that's gonna make it really expensive for us if we have to resize our hash table. In the case of what you guys have written for the buffer pool, the in-memory page table is always gonna be a fixed size because you have a fixed amount of memory, but the page directory can always be increasing in size. So you may not wanna use one of these methods that we've been talking about because you may have to resize it as you go along. So this is where a dynamic hash table is gonna help us. So the basic idea is that we're gonna be able to grow our hash table incrementally based on whether the number of entries goes up or down without having to stop the world and rebuild everything. So there actually should be three approaches. We're gonna look at chain hashing, extendable hashing, and linear hashing. All right, so chain hashing is what pretty much everyone thinks of when you think of a hash table. This is what you get when you create a, you know, use the hash map class in Java and the JDK, but this is the underlying data structure that they use. So we're just gonna have a slot array that's gonna maintain these pointers for every single key to the head of a linked list of buckets. And we're just gonna store all the values we have in these buckets. And the way we're gonna handle collisions is that we're just gonna scan along linearly inside the bucket until we find either the thing that we want, if we're trying to do a lookup or we find a free slot that allows us to insert something, right? So again, it looks like this. Again, we have our slot array that's gonna point to these linked lists. If there's nothing as hash to a particular position in our slot array, then we just store null, right? There's no reason to allocate memory that we don't need yet. And then we have our buckets and the buckets again are just gonna be the same entries that we had in the linear probing hashing method where we have to store the key and the original key and the value. And let's say that we wanna do an insert into this first bucket at the top and it's full as we follow along the linked list. Then all we have to do is just allocate a new bucket and extend out our bucket chain or linked list like that. Now we sort of see the same kind of issues we have in linear probing. If everything hashes to the same bucket, then I'm essentially gonna have to do a scruncher scan or linear scan across every single element in the bucket list to find the thing I'm looking for or find a free slot to put something in, right? So it's sort of unavoidable. If you have a lot of collisions, then you end up sort of degenerating into a scruncher scan. So these bucket lists can grow infinitely and that becomes problematic. In terms of how you actually implement this to be thread safe, it's actually really easy because you just have to take a latch on the actual bucket itself, right? The easiest thing to do is to take a latch in the entire hash table or you just take a latch on the slot inside of the slot array with the pointers but you mean more fine grain latching would be ticked as latches on the individual pages and you know nobody's gonna modify as you're scanning it. So again, the downside of this approach is that these buckets grow infinitely and it's hard to sort of shuffle things around because I essentially would have to rehash everything or just rebuild the entire hash table. So a more incremental approach is what you guys are building for your first project is called extendable hashing. Again, these are, there's an old ideas from like 1982, 83 but it's widely used in a bunch of different systems. It's basically an extension of the chain hashing approach but instead of letting the link list of the buckets grow forever, we're gonna split them and move elements around. And the idea here is that rather than again rebuilding everything from scratch, we wanna do this incrementally so that the impact of having to do a split is not a large major stall in our thread that's executing. Again, rebuilding entire hash table is expensive. We have to take a latch in the entire thing, no one can read and write from it and I have to copy everything in one hash table and put it into another one. The idea of extendable hashing is that we can do this a little bit little cooperatively to sort of smooth out the access time across all threads. So let's look at an example. So we're gonna have our slot array. Again, these are just gonna be pointers to our bucket list and then what we're gonna have these different counters that's gonna keep track of the depth of the number of bits we have to look at to figure out what bucket we should be going to. So we're gonna have the global depth that says the maximum number of bits we have to examine across all of our slots and then for each bucket, they're gonna have a local depth that says the number of bits that you would need to get here or that we're actually representing. So for the local depth, we don't actually need these to figure out where we need to go. The global depth we do need but the local depths are essentially metadata for us to keep track of who's potentially pointing to us. So then we can reverse this. All right, so the first thing we see that now in our slot array, we have the output of our hash function, the hash values. And for this purpose here, the global depth is two so I only care about the first two bits of the hash value. So in my example here, I'm going from left to right. I think the textbook goes left to right. Some of the examples go right to left. It doesn't actually matter. The algorithm is still the same. Whether you go from least significant or most significant bit, everything's still the same. So let's say I wanna do a find on A. So the first thing I do is take the value A, or the key A, and I'm gonna hash it and I'm gonna produce some bit sequence like that. And then I look at my global depth and says, I need to examine the first two bits of the value of the hash function to figure out where I need to go to find the element that I'm looking for or find the bucket that I want. So in this case here, I need to look at the first two bits, which is zero one. So that tells me I want that entry in my slot array and it's gonna point to this bucket here at the top. And then I just do a sequential scan on that and I find the thing that I'm looking for. So this is a good example of the difference between the global depth and the local depth. The global depth is two. It means I always have to look at two bits. But as we see, zero, zero, and zero, one, both match to the same bucket that has a local depth of one. So that means that to get here, the way what happened was I only had to look at the first bit, which was zero, and that's why they get back to the same location here. Again, global depth used for the lookup, the local depth just keeps track of what you had to do to get there. So now I wanna do an insert on B, same thing. I look at the global depth, the global depth is two. So when I hash it, I just wanna look at the first two bits of the hash value, that points me to one, zero, here, and that points to the second bucket, and there was a free slot, so I can go ahead and add my entry, right? That's fine. Now let's say I wanna do another insert on C. And again, global depth is two. I look at the first two bits. That tells me to look at here. But now it's pointing to the second bucket, but there's no more free slots. So this bucket is full. Under chain hashing, I just have a new bucket and extend out the linked lists. But on our extendable hashing, I'm actually gonna split this and rehash it so that some of the elements in the bucket went to the new bucket and some of the elements stayed in the old bucket. So what happened here is go back it again, right? I'm gonna change the global depth to three, extend out my hash array, and for this bucket, the global depth was now three, and that's good, update it like that. And so I'm gonna look now at, for all these elements, I'm gonna look at the first three bits, and then I use that to figure out my hash table where they belong to. So I should have maybe drawn this in more steps. So back here, I say, oh, I need to split, I'm too big. So now I look at my global depth is two, sorry, my local depth is two, so that says now I need to go to three, and so now I'm gonna look for each key in here, I'm gonna look at the first three bits, and when I extend out my slot array, that'll tell me whether I stay in the original bucket or I go to the new one. With now the local depth is three as well, and the global depth gets up to two to three because that's the largest number of the local depth we've seen so far. So now I haven't still have an inserted C yet, right? Because I just did my split because I couldn't insert it where I wanted. So now when I go and do insert C, now I look at three bits, and then that points me to this bucket here and tells me where I need to go. I'm seeing blank faces, is this clear? Right, it's not that tricky, right? Yes? Is this like when you find a bucket, are you still gonna be able to just scan to get a slot? His question is, when I find a bucket, do I still have to do a sequential scan to find a free slot in it? Yes. Right, this is just some block of memory, and I know where my starting point is, and I just had to scan to find the slot that I'm looking for. And I know what my boundary is of the bucket. So if I go beyond that, then I need to split. Okay, so the local depth is actually just for like extending, but not much when you're like trying to... Correct, so his statement is, the local depth is just for extending. It's not actually used to find what you're looking for, right? So once you're inside the bucket, you don't care the number of bits that you use to get there. You just actually do the same checking that you would do in all the other cases. It's when you split to say, all right, I originally was two, now I'm three, so anything that's inside of me, now I need to figure out where that actually needs to go. And I can just stay in the bucket that I was originally, or I can go into the new bucket I just generated. Yeah. The statement is essentially if you still have to re-build everything. Re-build everything inside the bucket and the entire hash table. No, so his question is, do I still have to re-build everything in the hash table? No, the change is localized to just the bucket that overflowed. Right, so all of these, all of these other ones here, this guy here and this guy here at the bottom, they stay the same. And actually now when I go and I'm staying my slot array, now I'm looking at three bits, I see that the first two guys, they were still pointing the first bucket, they still do. And then now these guys over here are still now pointing to this one as well. Again, if I'm just looking at bit one, zero zero for these two guys is in the first bit, zero zero for these two guys in the first bit, they're all still pointing to the same bucket with a local depth of one. Yes? So after you said, I think you're global one, you don't have to collect it. Yeah, I have to collect, yes. So it seems that we need to be able to increase the size of the slot array. Yes, yes, yes. You are increasing the size of the slot array. Like you have to double the size of that. But like that's not the big part. The big part is all of this crap in here, right? Because again, I'm storing keys and values. If I have to reshuffle this, that's expensive. Extending that is easy. Rebuild actually means something very specific in a hash table. It means like literally rebuilding a brand new hash table. This is just a small change internally to the internal metadata, which is cheap. Right, because I can all like, again these are all gonna be 64 bit pointers and it's gonna be some small array, right? I can create the new one, copy everything over and do a compare and swap to put the new one and install it in, right? That's way cheaper than hashing and reshuffling everything. Yes. Is there a lot about like, if you know there's a copy of Google to be able to make sense of how to get another socket? So this question is, what would happen if for this first bucket, if I inserted maybe two more things and it overflows? Yeah, yeah. Right, so again, the local depth is one. So I'll increase this local depth to two. So now for every single key I have in here, I'll look at the first two bits and that'll tell me where I need to go. So the first two bits will then get split up into two separate buckets. So everybody was pointing, everybody that starts with a zero points to this one here, right? Zero, zero, zero. So now I'll look at two bits. So I'll have a bucket for zero, zero and a bucket for zero, one. Zero, zero, zero, one. So there'll be two new buckets. The zero, zero guys will point to that one, the zero, one guys will point to the other one. And then I look at the two bits for all my keys and that tells me which of the two I wanna go into. So like does that mean you still have some kind of pointer from the bucket to the global slot? His question is, do I still have to maintain a pointer from the bucket back to the global slot? Why? Now, you know what, again, the local depth tells you who's pointing to you. But like when you're building this, you just leave the first bucket to you, like how do you know which pointer to point to? Because you know who's pointing to you based on your local depth. The local depth is one. That means that anybody that has, anybody that has zero in the first bit, because I can look at what's inside of me, it has to be pointing to me, right? So like you're going back to the global hash table and find anybody with a starting zero and review the pointer. That's cheap, it's nothing. Yeah. We need to shrink the hash table. I mean, if we just remove a lot of entries from the hash table, a lot of buckets here will be empty. So yeah, so his question is, are you asking in the real world or asking for the project? His question is, do we have to shrink the hash table? In the real world. In the real world, yes, and the project, no. We'll see in later hashing, again, we have a little time. In later hashing, it's actually really easy to do. It's actually not hard to do this either, right? You just have to recognize that, like, oh, well, if I go back from my global depth of three to two, well, can I collapse everything and not have to reshuffle everything? It's more expensive to go deletion, in this case, than extending. Try to go around, yes. So his question is, if I have four entries to the same hash key, what do you do after you split? So we were here. So what's the hash value here? And so whatever performance is there to do in the same project? Yes, so his question is, say in this case here, I did, I have to split because I want to insert C, because I want to put four elements, and I can only start three. Let's say I split this again, and worst case scenario, it hashes again to the same thing, all to the same bucket, which then overflows. You have to split again. If the whole hash key is the same, then it's always going to be the same, then that's the worst case scenario. That means you have a terrible collision rate, then you have to keep extending this thing over here until it doesn't. His question is, do you not need to care about this in the project? Your page table is not going to be that big, right? So again, the number of bits could be up to 64 bits. The likelihood that a page ID in your project will map to the same value from two to 64 is unlikely. All right, the bunch of other questions, I want to finish up and talk about linear hashing before we run out of time. So linear hashing, I actually don't know if it's in the textbook. This is another approach I think it's kind of elegant. Again, same thing, it's from the 1980s, but a lot of systems use this. We're going to do incremental splits, but instead of just splitting the exact bucket that overflowed, we're actually going to have a pointer that can say, what's the next bucket we want to split? And so what'll happen is every time we do a split, because some other bucket overflowed, we split whatever we're pointing to and then move the pointer down by one and keep going to reach the bottom and then loop back around and then split again. So this is even more incremental. So we'll define overflow in for our purposes here is just when we go from one bucket to two buckets, but in your action implementation, it can mean a bunch of different things. It can mean that the length of the chain has gotten too long on average for the entire system. You have low space utilization for some buckets and you want to go in different directions. It doesn't matter, but for our purposes, we just assume that it's whatever, whenever we go from one bucket to two bucket. The thing we're trying to solve here, and I think some of you guys brought this up in Extendable Hashing is, in Extendable Hashing, every single time I did a split, I had to double the size of that directory, of the slot array. In linear hashing, we don't have this problem. We don't have to do this, we're gonna do this one at a time. So say that again, we have four elements and we have four buckets that we're pointing to. So the split pointer, again, is gonna be a, some page or something, sorry, some bucket, we're gonna split whenever we overflow. So for this also as well, it's gonna look a lot like Cougar hashing, where we're gonna have multiple hash functions, but instead of actually applying them at the same time, we only apply them one at a time and we don't need to go look at additional hash functions if the thing we hash to is below our split pointer. That'll make more sense in a second, but as we extend the directory and add more hash functions, we always start off with the first hash function and we may not need to go look at the other hash functions unless we are above the split pointer. All right, so let's say we're gonna do a find on six, again, simple hash function, we're just gonna mod it by the number of slots we have. So six, mod four goes to two, so we go find our entry there, right? That just looks like chain hash table, no problem. Now we're gonna do it in certain 17. 17, mod four, it hashes to one, so that goes here, but again, we can only store three entries, so we have to extend it by adding an additional bucket. Now this triggers our overflow, because we said every time we make a new bucket, that counts as an overflow. So now, what'll happen is, whatever the split pointer is pointing at, that's the bucket we're gonna split, not the one we just overflowed. Extendable hashing was always the one that overflowed, and split pointer and linear hashing is always the one that we're pointing to. So in this case here, the split pointer is pointing to position zero, so we're gonna wanna split this one at the top. So all we have to do is just add an additional location to our slot array slot directory, and add a new hash function that's gonna be based on the number of keys times, the key mod by the double the number of keys we have in our slot array now. So we've just had n elements, now when we're gonna do a mod, we're gonna have mod two n, but as we'll see in a second, nobody, although we have an allocated two n entries over here in the slot array, no one's actually gonna ever get to there, because you have to first always look at the hash function, and if you, the hash function, if you hash here and you're above the split pointer, then you look at the second one, and we know that we're never gonna hit anything beyond what we've already allocated. So in this case here, we add four and then we split the first guy here, so it had 20 in it, now 20 goes to the bottom. And we move the split pointer down by one. So now let's say we wanna do a lookup on 20. So we always start off with the first hash function, right? 20 mod four, which is the number elements we have initially, equals zero. So that points to the slot array here at the top, but now we said that we know that zero is above where the split pointer is. I started to think of this, the split pointer is a threshold. Anything above that says, after I've hashed, used the first hash function, I found I'm above where the split pointer is, then I have to use the second hash function, and then that will map me down below, and then I can find the entry that I'm looking for, right? So every time there's an overflow, I do a split, and I move the split pointer down by one. If I hash, the first hash function, I land to above the split pointer, then I have to use the second hash function to figure out where I really need to go. And the math works out that again, you could never have anything mapped to five, six, seven, eight, even though you haven't allocated those spaces yet. Okay, because anything that could get there would always land in the below with the split pointer. Yes? What if you want to find 13? He says, what if you want to find 13? So simple, I take the first hash function, 13 mod four would tell me that it's one, one is below the split pointer, so I don't need to look at the second hash function. So I mapped right to the bucket and find what I want. If I want to do a look up on eight, eight mod four will be zero. I know I'm above my split pointer, so I look at the second hash function, eight mod eight is zero. So there we go, it's right there. Yes? So does this work only for integer keys? This question is, does this only work for integer keys? No, because the hash function is going to return you back, to always return you an integer, like 32 bit or 64 bits. I'm just showing integer keys to keep it really simple here. So what hash function do you think like how does it only use a different motion? Yes, it's always going to be the same hash function, but you're going to mod it by a different number, right? So again, hash functions return a value from zero to, two to the 64, we always mod it by the number of slots we have to ground us where we need to go, right, within the number of elements we have. At some point, the split pointer will reach the bottom, at that point, we've extended it out to be double the size of the number of elements we started up originally. So then we delete the first hash key and reset the split pointer back at the top and start over again. Yes? What is the one that insert 21? It says if you want to insert 21, so 21, what is 21 mod four? One, right? So you hash to one, you land here, this will be overflow and you insert into there. Oh yeah, so for this in here, that would count actually, yeah, so this question is, would that count as a overflow? Again, depends on how you implement it. If you say it's only when you create a new bucket, then no, it wouldn't overflow. If you say that it's whenever the thing you're pointing at has an additional bucket, then yes, you would account as an overflow. Depends on the implementation. The math still works out correctly in both cases, though. So how do we determine when to split, when to split a bucket in linear hashing? So this question is, how do we determine when to split a bucket in linear hashing? It's whenever I run out of space. So the very beginning, right? I was here, I want to insert 17, it hashes to bucket one, the bucket already has three elements, I can't put another element in, so I have to overflow it, right? And I said, globally, I said whenever a bucket overflows, creates a new bucket in the chain, that triggers the split pointer to split whatever it's pointing at. So we ended up then at this point here, we split the first one. So if we insert two more elements in position one, it will account another overflow, another split? It depends on how it's implemented. If you say, again, you can say, whenever I create a new bucket, split. Or if the split pointer is pointing at a bucket that's already overflowed, then I split. So you could have it say, all right, I move my split pointer down here, this thing's already overflowed, so immediately go then split it and move it down again. Depends on the implementation. No one weighs better than another. So we have like two minutes left. Is it a quick question or? I didn't click it. What's the quick question? So what pointer points to the bucket at 17? Sorry, what pointer points to 17? This one does. It's like in chain hashing, it's an overflow. It just says, if the thing you're looking for is not in this bucket, oh, by the way, here's the pointer to the next bucket, you should go scan. And you have to scan the entire thing. Okay, I think we covered most of this. The only thing I'll say also too about linear hashing, what's really nice about it is it's really easy to go in the other direction. You could say, all right, whenever my bucket is empty, then that's the same thing as a reverse. And I just go then do a reverse split on whatever the split pointer is pointing at from before and move it back up. So you can do addition and a retraction very easily, much more easily than you can in some of the other schemes. All right, so just to finish up, I would say that all the data structures we talked about here today, ideally will give you O1 lookups. The constant factors actually matter and it depends on your collision rate. You end up having to do a sequential scan across all elements or all keys to find the thing you're looking for, right? But again, there's this trade-off between having very fast lookups and insertions with the flexibility of not having to maybe rebuild everything every time you touch it. So I don't have time to do the demo. I mean, we can just keep going if people have to leave but I can give a quick Postgres demo. What I'll say is again that this is, hash table is probably usually not what you're going to want to use for table indexes because they can only do single key lookups. Does something equal something? You can't do range scans, right? You can't have partial keys. You have to have all the key because otherwise it's not going to hash in the same thing and the way we can do better is having the order preserving indexes like in B plus tree, okay? Any questions? So again, next class, we're going to do B plus trees. So most of the time of that, we'll spend a little bit of time on skip lists and radix trees or tries. But B trees are really the granddaddy of all the data structures, okay? Okay. That's my favorite all-time job. Oh my God. No. What is it? Yes, it's the S-P Cricut-I-D-E-S. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no rules. Homies on the cusp of y'all will focus on drink proof. Put the bus a cap on the ice, bro. Bushwick on the go with a blow to the ice. Here I come. Willie D, that's me. Joe G and St. I's when I party by the 12-pack case of a thought. Six-pack 48 against the real balance. I drink proof, but yo, I drink it by the 12. They say Bill makes you fat. But saying I's is straight, so it really don't matter.