 Round of applause for DJ Mushu. So I heard you have an upcoming gig. Yeah, no. I got a gig overseas. So... Overseas? I'm playing in France. Quiet. Quiet. Are we going where? I'm playing in France. I'm going to Paris and Nice. Your shows lined up in Paris and Nice? Yeah, yeah, yeah. How did that happen? Actually, I don't want to get into it too much, but connections, I'll just say that. Like tour? Again. Not going to get it. Okay. All right. Congrats. I mean, is it a legal show or like... Okay. It's not illegal. It's just... It's not illegal, but it's not illegal, but it should not be happening. That's good. Okay. All right. So Charlie's asking whether he gets his own **** for the spring version of 4 ****. Specifically... DJ Mushu? Yeah. I mean, if he gets arrested overseas, no. We can take that one offline. Okay. All right. So for you guys in the class, homework two, we do September 25th. And then project one will be due October 2nd. Again, as I posted on Piazza last night, there will be an info session on this Thursday at 8 p.m. over Zoom, and the link is in Piazza. And then we'll have the extra special office hours on Saturday, 3 p.m., 5 p.m., and that will be on Gates. And also there's a post on Piazza for this. Any questions about homework two or project one? All right. And then we'll release the... If we haven't yet, we'll release the answers to homework one in a bit, or today or tomorrow. All right. Some other talks that are coming up that you guys might be worried about. They're going to be kind of interesting. So again, we have our Monday talks from people in industry. So next week we have the guys at Rockset talking about their database system. So Rockset was founded by the guys that built RocksDB at Facebook, and they forked it off and built a new analytical engine based on it. There's somebody from Yandex coming to talk about the Odyssey proxy. We'll talk about proxies later in the semester when we talk about distributed databases. Just thinking of this as something that sits in front of Postgres and handles incoming connections and can do connection polling. And then on October 10th, we'll have a developer from fly.io talk about Lightstream, which is a version of SQLite that can read and write to S3 files on AWS. If you pay attention to Hacker News, the SQLite's kind of the hot database now. There's sort of trends. Postgres is still the hot one. SQLite's the hot one as well. So the Fly Igoes are invested in this pretty heavily. So where we're at in the semester is that we're continuing to go up the stack in our database system and in our architecture. We talked about how we start pages on disk. We talked about how to manage things in memory. And now we're at this level here in the middle where we're going to sort of have the intersection between the pages that are in the buffer pool and the execution engine. And so these are going to be called access methods, right? It's the methods that the database uses to access data. And so for the next two classes, we're going to talk about two main data structures we're going to have in our database system that we can build on top of our tables. And that's going to be hash tables and trees. So today we're going to have hash tables because we'll need this for... We could use these for table indexes, but we're also going to use these for auxiliary data structures like you saw in the page table. You're building for project one. We'll use these for joins and a whole bunch of other things in the database system. And then on Thursday, we'll talk about B plus trees, which is the best data structure ever built. Which of that... What was that, sorry? Splay trees are best. You can go talk to Danny Slade about that. I know of no database system that uses a splay tree. Okay? Okay. I can't talk about it right now because we're recorded, but we'll deal with that later. He's a good guy. He's a good guy. All right, so again, we'll talk about hash tables and B plus trees on Tuesday. And then we'll have these different data structures that have different trade-offs. And we're going to use them in different circumstances inside of our database system. And again, I realize a lot of you... All of you should have taken the algorithms course. And so everyone should know what a hash table is at a high level. But the thing that's going to differ in our discussion today and our trade discussion on Thursday is how we're going to use them in a database system where we care about sequential reads versus, you know, random reads, where we care about organizing data in four kilolite pages or so forth. That's going to be slightly different than how you maybe think about things than before. Yes, Thursday. What is it today? Tuesday? Yeah. Is that okay? Yeah, yeah, yeah. Okay. I know it's this Thursday. Yeah. All right, so these data structures, again, whether it's tree or hash tables, we use throughout all the different database system. A bunch of these we've covered already, like internal metadata. We talked about using this hash table for a page table in the buffer pool. We can use this for core data storage, storing the tables themselves. We can see this, we'll see this in B plus trees where the actual leaf nodes of the tree could actually be the tuples themselves. There are certainly database systems that store the tuples inside the value portion of a hash table. We can use these for temporary data structures, like if we're trying to run a query and we realize, hey, be really nice if we had a hash table or B plus tree right now for make this query run faster, we can build it on the fly, use it for that one query and then immediately throw it away. And then, of course, obviously we use them for table indexes, like figure out the glossary in a textbook that you jump to in individual pages based on keywords. So again, these data structures can be used all throughout the system. And then sometimes we're going to care about parallelism, sometimes we're going to care about durability and recoverability. And there's different trade-offs we can make based on what data structure we want to use for different circumstances. So this is what everybody's like I said. The main two design stages we're going to have when we choose a data structure is how we're going to organize the data either in memory or in pages that we want to write on a disk. We don't do this in a way that is going to have the most efficient access capabilities for the use case that we're using for our system. So for now, we're going to mostly talk about single-threaded access. Next week we'll talk about multi-threaded B plus trees and we'll sprinkle a little bit discussion in today about how to handle multiple threads accessing our hash tables. But we'll focus on a lot of that later on. And things are going to be tricky when you use these data structures for indexes because not only can you have threads accessing the physical contents at the same time, you can have threads modifying the logical contents at the same time. That may not make sense right now, but when we talk about transactions you don't understand what I mean. And so again, how we actually handle multiple threads doing operations, doing things on our data structures and accessing at the same time is going to be one of the things we have to keep in mind as we go forward. So again, the definition of a hash table should not be new to anyone here, but the high-level idea is that it's going to be an unordered associated array that's going to map keys to values. We don't care what the keys are, we don't care what the values are. Other than to make our lives easier, we assume that the combination of a key and value together will be fixed life because it makes things a lot easier for us. And there's going to be some hash function that's going to be specified for a hash table where we're going to use that to compute an offset into this array that allows us to jump to the particular key value pair that we're looking for. Now, it's not always going to be direct mapping, right? The hash function may take us to a location in this array that may not actually have the data that we're looking for, but it's at least going to get us in the right location. And then we can do some extra steps to look around and try to find what we actually need, right? We can do a little extra work to figure out does the thing we're looking for actually exist or not? The space complexity we're going to have for our hash table is going to be on where n is the number of keys that we actually want to store. We're going to see in practice though the actual space complexity is going to be actually 2 to 4n, because we may actually allocate 2 to 4 times extra space for the number of keys that we want to have, again, based on what hashing scheme that we're actually using. But on paper, it's n. And the different schemes are going to have different trade-offs for how they're going to handle collisions. The time complexity for our operations are, on average, it's going to be 01, and then in worst case scenario, it would be on. We need to do a linear scan looking at every single key. So again, if you're coming from an algorithmous course, this all sounds fantastic, right? Because, you know, they're worried about exponential or polynomial time, and here we're saying we can do things in average case 01. But I will say in practice though, in the world of databases, we care about constants, right? So even though it's 01, if we can shave off a couple of milliseconds for each operation that we're doing, that's going to be a huge win for us. In their world, if we don't care, we make money, we care, right? So, you know, so we'll see in different situations, and as we talk, as we go along, there may be a good trade-off where we can do a little extra storage, pay a little extra to the storage overhead to get reduced computational cost for the operations that we want to do, okay? All right, so let's talk about the simplest hash table you could ever possibly build. There's a giant array that we malloc and memory, and we ignore about storing things on disk for now. And then we're going to have one slot in this array that's going to, that we use for every single element that we want to store. So the only thing we need to do now is we want to find the, we want to find a particular key that we want. We would just take the key that we want, hash it, modify the number of slots that we have, and that's going to give us a pointer to some additional storage. We're not defining what that is yet. That's going to have our, have the key value power that we want, right? So the simplest thing, we take some hash function, we don't have to find what it is yet, take the key, run the hash function, that's going to produce some new integer, be mod by N, and then we jump to right where we want to be, right? What's the problem with this? Yes, he says the only one slot per value. Yeah, so yeah, we could assume that there's only one, there's one slot per key. And so we have to know the number of keys we have ahead of time to map exactly to this. What are some other problems? I heard collisions, right? So we're assuming that there's no possible collision, but every key is unique, and after hashing it, we get a unique value, or unique hash location, right? So we basically covered all of this, right? So every key is unique, we know the number of keys ahead of time, and it's fixed, meaning like they say, here's your one million keys, I'm never going to take any away, I'm never going to add any more, so go to town, right? And then we're also assuming that we have what is called a perfect hash function, meaning for every single key that we give it, the hash function is guaranteed to produce a unique hash value. Is this in practice? No, right? It's in the theory, right? There's theoretical literature that discusses, hey, you could build a perfect hash function this way, but when you actually read how they do it, they use a hash table, right? So you basically need a hash table for your hash function for your hash table, right? So that's why nobody would do this in practice. So this approach is unrealistic, right? So we're not always going to know exactly the number of keys that we're going to have, sometimes we do, sometimes we don't. We're not guaranteed that every key is going to be unique, sometimes they are, sometimes they're not, and we're definitely never going to have a perfect hash function. So the design decision we have when we choose a hash table for our database is going to come down to two choices. The first is me, what is the hash function we want to use, right? And again, some function where we take some arbitrary key and then we're going to map it into a smaller domain of integers that will lose any sort of notion of order preservation to the key, to the hash value, but that's okay. And of course there's going to be this trade up between how fast our hash function can be versus our collision rate. What's the fastest hash function you could ever build? What's the fastest identity? Even faster. Constant what? Yeah, what's the constant value? One, exactly yes. No matter what key you give me, I'm going to give you back one. It's going to be super fast, but the collision rate is going to be terrible, right? So, but then the other extreme would be the perfect hash function where, again, I do all those extra look ups, but then I can guarantee that I'm going to give you a unique hash value. So we're going to be somewhere in the middle, right? The next choice we have to make is the hashing scheme. And this is going to determine how the hash table is going to handle collisions, right? Because we can't assume that for a given hash value that there's not going to be any collisions and we're going to land into a slot in our array that's going to be unoccupied. We can't guarantee that. So the question is how do we actually deal with those scenarios? And of course there's going to be another trade off between how large we want to pre-allocate our hash table versus additional instructions, additional steps we have to do when we have a collision. If I allocate an infinite size array, I will never have a collision, or very unlikely I have a collision. But of course that's not realistic. So how do we deal with our constrained environment? So the combination of these two decisions is essentially what's going to be a hash table or how we're going to define a hash table for us in our system. So the, today's lecture is about, again, how do you build a hash table to solve all these problems that we talked about? So the first thing we're talking about is hash functions. And I'll basically talk about, here's what the state-of-the-art functions that exist, implementations. I don't care about building one. Database here, people usually don't care about building one. You take whatever the fastest one and just use that. A lot of the old databases that came out in the 80s and 90s, they implement their own hash functions. Nowadays there's very good open source implementations. We just want to use those. As you explore, let me know what the fastest one is, or the best one that everyone uses today. It is not shallow 256 now. We'll talk about why we don't want to use that. So it's actually Facebook's XXHash, right? So the same guy that did Z-standard, that compression algorithm I said before, he has a hash function and we'll see in a second that crushes everything right now. And it's, in my opinion, still state-of-the-art. All right, so again, for, okay, the hashing schemes. So static hashing and dynamic hashing. So static hashing would be, if you know that the number element is fixed in your slot array and then dynamic hashing would be, you incrementally grow and shrink the size of the hash table as needed, okay? All right, so hash function. Again, as I said, it's a function that given some arbitrary key of any length, we don't care what it is, we're going to return back an integer, either 32 or 64 bits, that's going to represent that key, right? As I said, the fastest hash function you can have is just return always one, but in practice that will be back. So he brought up SHA-256, which is part of this class of cryptographic functions called SHA-2. We don't want to use this. We don't care about this. We don't care about cryptographic properties in our hash function, because we're building this hash table internally inside of our system. We don't care about leaking keys or anything like that, right? No one's ever going to see this data on the outside. Is this something that we're going to build internally in our system? So we don't want to pay the computational overhead to use something very expensive, like SHA-256. We want to use something that's faster. Yes? So someone always asks this every year, do I care about denial service where someone could basically skew the keys in a certain way that would then cause them to have a huge collision rate and therefore blow up the computational overhead of the system? No, because if the database system is used internally, it's assumed that whoever's using it has access to the system. So from Amazon or Google, whoever the data spender is, if you load a data set that you've set up to make it have this huge collision rate, they don't care because then you're going to pay the computational overhead. They're charging you for the resources. Don't gladly take your money if you try to do something scary like that. So since we're not giving raw access to the database system to anyone, we don't care about this. So again, we want something that is fast and has a low collision rate. So this is just a smattering of some of the more common hash functions that are out there. So CRC goes back to the 1970s used for networking. There's now, in modern CPUs, you can get CRC instructions. So sometimes you see some systems use these. So the modern era of hash functions I think came around in 2008. Again, this is from a non-cryptographic or non-hash function researcher. This is my perspective in databases. So some random guy on the Internet decided, hey, I'm going to build my own hash function and put it out in GitHub. And then people started using it, turned it really good. And it was designed to be this sort of fast, general purpose hash function that you could use for any possible domain. But then the database people picked up the Google data fork of RememberHash in 2011 and they designed CityHash to be better for faster and shorter keys. XXHash, as I said, is the state of the art one. This is from the same guy that did Z-standard compression. I think they're up to XXHash3, which again is really fast. And then FarmHash is an extension of CityHash that has designed to have better collision rates. I think there's also HighwayHash from Google, but that's for cryptographic stuff. We don't care about that, like cryptographic analysis. So there's another CLHash function that has the lines on new hardware instructions from Intel. In general, that's... XXHash is the way to go. So this is just a quick benchmark that I ran on my personal machine. Folks from this guy, he had some best working favorite we had on GitHub. And what I'm going to show you here is the level of difference of performance. So the y-axis here is the number of keys per second or bytes per second of this thing. These different functions could hash. So the bottom one is RememberHash. The red ones, I think, what you guys are using for Project One, CityHash, whatever bus tells not high performance, we don't care about that. And then the green line here is just showing you XXHash3. This is an older version even. The newer version might be even better, but it just crushes everything else. You might know why there's like a sawtooth pattern where it sort of goes up and goes back down. So that has nothing to do with hash table here. This is just like taking raw bytes. How fast can you hash it? Yes. Yeah, so he points it out. Yes, so the... In order to be cached line-to-line, you want to make sure that you're exactly up to 64 or 128 bits. So they pad it out. So that's discounted in this calculation here. That's why it sort of goes up and it goes back down, right? But the XXHash3 crushes everything, even the larger sizes, but even the smaller ones, it's still better. So okay, this is my opinion. This is the right way to go. City and FarmHash don't use SIMD or vectorization instructions because they said this hurts portability. I think XXHash3 uses SIMD, and that's part of the reason why they're getting better performance. Yes. So his question is, is using a faster hash function always the best thing to do because there's a trade-off between collisions and performance? As I said, return 1 is always the fastest hash function, but its collision rates would be terrible, right? So yeah, I'm just showing performance here. A great segue to what you're talking about is, there's this... I'm going to start with a benchmark from the guy that does member hash. So in the same way I think I'm assessed about databases, this guy is obsessed about hashing functions, which is fantastic. So he has... This page is huge. He has every single hash function he knows about, and he puts in this benchmark framework. And he ranks them in terms of their throughput, which I'm showing here, and their collision rate. Right? So, how does he say this? His results show that XXHash3, in my opinion, has the right balance between collision and performance. Yes. The question is, is this guy's metric, is this enough to determine the quality of a hash function? If you really care about hash functions, maybe, I don't know, but from my perspective on a database, yes. And the answer I'm telling you is just use XXHash3. Right? Like, that's it. That's the end of the discussion. Use XXHash3. I'm just saying there's different trade-offs to these different things. Postgres, I think, rolls their own hash function, I think still. I don't know what my SQL does. I think SQL that rolls their own. But if you're building a new system today, a lot of the newer ones just use XXHash3. Okay? All right. So, that's it for hash functions. There's one. Yes. So, the question is, are these, it's not just XXHash3, or XXHash. Like, are they operating on strings, or are they operating on any binary data? It's just binary data. They don't care. They don't know they don't care. Right? Whatever binary input, and you get integer as output. And it makes sense from a database system because it's interactive, because we talk about how we represent different values. It's just bits. We just throw bits at it. Okay. So, first, again, so first we're going to talk about static hashing. And this is where we're going to, so we're going to have to specify the number of potential locations we want to store in a hash table ahead of time. And then we'll talk about dynamic hashing, where we can allow the hash table to grow and shrink to support more and less keys over time. Right? So, we're going to start with linear probe hashing. And I apologize ahead of time. So, there's linear probe hashing, and there's also linear hashing. They're different. The linear hashing will be a dynamic theme. Linear probe hashing is a static theme, and it's the most common one. So, we're going to start with this. And then we're going to talk about Robinhood hashing and Cuckoo hashing, which will be variants of linear probe hashing. The spoiler is going to be most systems are going to implement, most data systems are going to implement linear probe hashing, because it's going to be just super fast. Right? So, linear probe hashing, also the textbook might call this open address hashing. I double check. But sometimes you'll see it called open address hashing in different systems. And the idea is basically that the actual address of a key may not be the address that's specified by the hash function. Right? So, it's sort of a lot of the float around. So, this is why it's called open addressing. Let's say also too, when you get a dictionary in Python, you're essentially getting this hash table like this, right? It's unordered. So, the way it's going to work is it's just a giant table of slots and we're going to hash our key, modify the number of slots, since we know the number ahead of time, and then that's going to jump to some location in the array. And then if we land in a location and it's empty, then it's ours, we put our key in and we're done. But if we land in it and something else is in there, meaning we have a collision now, then we're going to scan through one slot after another in linear fashion to looking for the next free slot. And once we find one, we can put our key in there. If we reach the bottom and there's no free slots, then we loop back to the top and start over. So, you can think of this, the array is a giant circular buffer. Right? And so, if we now keep scanning and we don't find a free slot, then the table's full. And so, the only way to then, to handle that is that you basically take a latch on the entire data structure, entire hash table, allocate a new hash table that's double the size of the previous one, and then rehash everything and put it over there. Right? Yes? So, your question is what? Sorry? Yes. So, this question is, I'll do an example. This question is, as I'm scanning along, will I end up seeing keys that aren't related to what I need? Yes. That's the trade-off we're making. Yes. That's why I get it. It's O1, if we land exactly to what we want, best case scenario, worst case scenario, we have to scan everything. End. Right? All right. So, let's say we want to insert all these keys, A through F. So, the first one, the table's empty. So, we hash key A, mod it by the number of slots that we have, and it's empty, so we can go put our value in. So, now, the contents is going to be the key and the value. We always need to store the original key because we need to determine whether the thing we're looking for is actually the thing at a given slot. Right? And again, the key value pair has to be fixed length because when we now hash mod n, it's just simple arithmetic to jump to the location that we want. So, that's why it has to be fixed length. It's similar to the dictionary compression stuff that we talked about before. All right, B same thing, hash B mod n. I land at this location, store my key value pair. Now, I get to C. When I hash that, it lands on this slot, but that's already occupied by A. So, all I need to do is just jump down to the next location, the next free slot that I find and put C there. Right? Same thing with D. D is where C wants to go, but it can't go there, so then we jump to the next location, put it below. E wants to go where A is, but it can't, so we just keep going down until we find the next free slot. Likewise, F wants to go here, we just jump down like that. If we had another additional value that way we wanted to go where F wants to go, then we would just loop back around and start from the stop. Right? Yes? Your question is, when I'm inserting something, is it guaranteed that the, this pointer here is going to jump to a particular location? No, no, it's defined by the hash function. So, the hash function gives me a random integer, right, and I modify the number of slots and then I know how big the slots are and I do math to jump there. Yes? So, this question is, the search stops when you either find the key. You find the key you want, yes, or? Find the empty location. Right, find empty location, meaning you know the thing you look for can't be there, or you actually loop back around and you start, and you come back to your starting location. Because then you know you've scanned everything and the table's full and the thing you want isn't there. Right? Do you have a question as well? Oh, that's the problem. Okay. Right, pretty straightforward. Delete to make this hard though. And I would say also too, before we jump to the leads, I'm not showing, this is just assumed this is in memory, so I'm not showing page boundaries. I'm not showing how this is actually broken up. We may write this out to disk and so forth. You can imagine easily chunking this up to eight kilobyte pages, right, and you have so many slots per page. You just know how to do the same simple arithmetic we talked about before on how to jump to the right page offset. All right, let's see how we want to handle the leads. So, I want to delete C. So, again, I hash C mod n. That takes me to where A is. I see that C is not equal to A, so I know this isn't the key that I want. I jumped down to the next location, and voila, I see C, and that's what I want, right? So I go ahead and delete it. What happens now? Is this good or bad? Yes? Yeah, so the same is, if I look for, say D, D would map here and I find empty slot. And, again, my protocol says, if I see an empty slot and I haven't found the key that I'm looking for, then the key's not there. So I would get a false negative in this case here, right? So how do we want to handle this? Well, the first approach is just to do movement. Basically, take all the keys that came after the C that we deleted here and just slide them up. This is way over a simplification, but basically what you do is you would look at all the keys that came after the thing you deleted and rehash them and then put them back in, delete them and put them back in. Right? So then now when I want to do get C or get D, I would find the thing that I'm looking for. Is this a good idea or a bad idea? Yeah, it says nobody actually does it, so yeah, but why? Right, so he says, there's two causes to move all the elements. Exactly right. So it may be the case that what you would actually want to do is, you know what you need to do is usually to hear, so you're going to find all the things that came, scan down all the keys until you get to the next empty space, which would be way back up here, and you got to delete them and put them back in. And that's very costly because again, you have to protect this data structure with a latch or these pages could be out on disk and now you're going to fetch them in and then sort them back in. Right? So as I said, nobody in practice does this. The B one, you may need to slide down as well, right? So the better approach what most systems do, if you're going to support deletes in your data and your hash table, which again, some hash tables don't need to do deletes at all. So we don't have to worry about this, but if you do, this is the most common technique. So now what we're going to do is instead of actually removing the, the physically removing the key that we deleted, we're just going to logically delete it. So what I mean by that is, instead of actually putting an empty space here where we're deleting C, we're going to put this little tombstone marker in that slot. And that tells anybody else that comes along next that's looking inside this, this hash table, that yes, there used to be a tuple here, but logically it's not. So therefore I can treat it as if it is occupied and then keep scanning if I'm looking for something. But I know that the, I can ignore any of the bits that are actually in there. So now if anybody does, does a look up for C again, even though it may land here, that's where C used to be, the bits still might be there, but we don't care. We see the tombstone, we see the tombstone, we say we can ignore anything else. Right? And then now later on, someone might come along, you know, you say you do look up in D, sees the tombstone, and skips that, goes down like before. Now someone might come along and want to store G, and G hashes to where C used to be, and we see the tombstone and say, aha, okay, well, I'll take that over and put G there. And that doesn't require us to do anything down below for any of the other keys. The linear hashing scheme still works correctly. Right? Yes. This question is, how do you represent a tombstone without what? You store an extra bit in the header, right? But a biter lines that may be an extra byte. Yes. So this question is, do I store the bit as a, like the, yeah, so his question is, how would I actually store the tombstone? Your concern is that I would actually store an extra byte per key. Yeah. Does that matter? I think that matters. I have a billion keys, right? And like, the keys are huge. And so, yeah, it's not, it's not, it's not trivial, but relative to the size of the data, it's usually much smaller. Because I have to store, I have to store the key and the value. Yeah. Again, it could be like the Slot Array approach if you organized it as pages, you could have a bitmap on the page, right, for all the Slots that you have. So now, maybe not storing a byte per entry, now it's, it's a byte per page. And so it's, it's not that big of a deal. Yes. So this question is, how does the page lay out that we're backed by disk pages? How does that make it harder to do this hash table here? Yeah. Yes. All right. So I'm just like, I'm not defining here how this is actually being organized, like physically. Meaning like, for simplicity, it's assuming I'm malloc the giant thing, right? But it may be the case, I want, I want this to be backed by disk because I may want to have some part of the hash table in memory and something, some on disk. So therefore, I want to organize it at the organizing pages because that's how we do things in the buffer pool, the disk manager. So what I was just saying is that like, it may be the case that like, you have eight Slots per page or something like that. And then that's, so when you go get a single Slot, you're actually giving the whole page the stuff we talked about before. Oh. Yeah. The logic of it, it doesn't matter for now. And then his, his statement is like, okay, if I'm storing a Tombstone for empty Slot, isn't that wasted space? And the way to amortize that or deal with that is like, instead of starting a byte per entry per Slot, you can sort of bit per page, or you have a bit map per page. Now I have a bit per Slot. And it's not that, it's not that much overhead. All right. So the one thing we have to talk about, is the key keys. Right? And this is definitely going to come up when we do, using hash tables for joints. So there's two basic approaches to this. The first is that you have a separate link list, where in your, in your hash table for a given key, you would have a pointer to some, some other auxiliary data structure, like a link list, that's going to have all the values that correspond to that given key. So now if I want to do a look up and see, does my key, key value pair actually exist? I would have to follow the pointer to the other thing and find what I need. And I have to do this because I want my values to be fixed length. The more common approach that is more wasteful but is easier to implement, and this is what everyone typically does, is you just store redundant key, key value pairs together in the hash table. Right? And again, I don't care about ordering of the, of the values of the keys in my hash table because the hash table destroys all that. It's really about what point does something get inserted in a particular spot. Right? So now if I want to know does the key exist, if it just does this key exist, I can just scan through here and find the first key that I find. If I want to actually remove a particular key, I need to remove all the keys or a particular key value pair. There's another trick to make this unique where we'll see this store, say I want to index on a column foo, what they'll actually restore is column foo and the record ID as the key pair. And then the value is actually just the record ID as well. And that guarantees uniqueness of the keys. We'll cover that later. Okay? All right, so let's talk about variants of linear hashing. So, again, if you read Hacker News, one popular one is called Robinhood hashing. And if you know what Robinhood is, it's an old English story about some guy in the woods that went gangster and stole money from rich people and gave it to the poor people, right? So the idea here in Robinhood hashing is that we're going to steal the slots from rich keys and I'll define what rich means in a second and give it to poor keys, right? And so what we're going to do now is that for every single key value pair we have in our slot array for hash table, we're also going to store the number of positions they are away from their original idea location. Meaning when you hash the key, you've landed some offset, that's the original location. So you want to record how many steps away you are from that. So then now when you insert a new key and there's a collision, you're going to check to see whether the incoming key is richer or poorer than the key you may replace and if you're poor you'll take their slot and require the rich one to move down further. So the closer you are, the richer you are and then the idea is that on average everyone would be equal distance to the original location. Right? So let's go back here. So we hash A lands at this location nobody's there so we just take it and then we store now the number of jumps we are from the first position. So we're exactly where the hash function told us to go our counter is zero. Same thing with B B lands at the location by itself so it gets zero. C comes along now C wants to go in the same location where A is but at this point since C is zero hops away from its original position and A is zero hops away from its original position they're equal so we'll leave A alone and then C jumps down here but now we set its counter to be one because it's one hop away where it should have been. D comes along D is zero hops away from C C is one hop away so we're going to leave C alone because again the higher the counter I mean the poorer you are so zero is less than one so we're going to leave C alone and then D goes the next line. All right now E comes along right so at the very beginning A its counter is zero A's counter is zero so we're going to leave them alone then we come down here its E's counter is now one C's counter one they're equal we're going to leave them alone but now when we get here E's counter is now two because it's one, two away from the slot jumps from where it should have been but D's counter is one so we're going to go gangster on it steal its slot put E there with the value of two and then D jumps down here with the value of two right and again the idea is that we're getting more advertised to make everyone sort of equal distance to where they should have been so we don't have to do these long scans right F comes along two is greater than zero so we leave that alone so F goes, goes here yes so the statement is there's a comment about there's more comment about linear prep hashing than Robinhood hashing that you may end up doing a complete symmetrical scan of all the keys to find the thing you're looking for does anyone pre-compute an additional filter in front of the hash table like a blue like a blue filter not everyone's going to know what a blue filter is we'll cover that in a few more classes but it's like set membership it's a set membership data structure does something exist or not it doesn't tell you where it is it just tells you whether it exists so could you put that in front of this hash table to avoid having to do this scan yes we'll see this when we do hash joins absolutely yes oh you're saying flooding could you have some flooding we have to sort of move everything it's sort of cascading like yes yes yes it's yes this is why in in the modern research literature this is a terrible idea you don't want to do this I know one database system B when they came and gave a talk the guy's like oh yeah we use Robinhood hashing and we asked him why Dave Anderson asked him why and he said oh because we saw it on Hacker News right it's not a good thing to say but yeah in practice this is a bad idea because actually as you said you're doing all this extra shuffling every time you do an insert just to make maybe reads go faster now there could be a trade-off if you're very very much read heavy or just write heavy then yeah this is a good idea but again in a lot of scenarios for databases this is a bad idea any questions back okay right because especially if everything's in memory because now you're doing branch merge predictions you're doing extra copying right the linear hashing is so simple and so again it's sequential operations which are great for modern super scaler CPUs we're not going to talk about CPU caching stuff and instructions caches as well but like just if everything's in memory you can rip through linear hashing much faster than this thing okay so the one that actually is used and is sort of like the sort of getting to the thing that he brought up where like okay if in the worst case scenario I have to do a complete linear scan on everything there's a way to avoid that the bloom folks approach is one of them but another one is actually used essentially multiple hash tables so that you're more likely to have a free slot when you land on something when you do a lookup to do an insert so this technique is called cuckoo hashing this one is used in a lot of systems the basic idea is that we're going to have multiple hash tables they each have their own hash function and every time I want to do an insert I'm going to hash the key multiple times and just pick any one that's going to have a free slot any of the hash tables that have it with a free slot and if you no table has a free slot then you're going to steal a slot from somebody there that key is going to come out then you're going to rehash it and maybe put it into another hash table and again you could have this cascading effect where you end up inserting and stealing and deleting you know inserting something and stealing a slot and inserting that back in all of the keys and check to make sure that you realize you've wrapped around and just doing the same thing all over again and then but in practice if you size the hash table large enough that you know you can't avoid this so the lookups and relations are always going to be 01 because you're guaranteed to whenever you hash you're guaranteed to either find the key or you're not going to find the key immediately you don't have to do that complete linear scanning it's the it's the puts that become expensive because you may pick them all back and forth so this is named after the cuckoo bird which is a type of bird where they steal the nest from other birds and lay their eggs in them right so that's sort of going back and forth and as far as I know the best open source implementation is actually from CMU from Dave Anderson Libcuckoo and they actually still maintain it so we were using we were using Libcuckoo in the hash table that we were building the databases that we were building at CMU but again as far as I know this is the best open source one alright so let's see the example so for simplicity I'm going to show two hash tables in Libcuckoo the default is three again there's theoretical guarantees about whether or not you're going to have to resize based on how many hash tables hash tables you have so the first thing we're going to do is you put a so we're going to have two hash functions now it's going to be the same implementation of the hash function meaning the same xx hash or city hash that we're going to use is just we're going to provide a different seed value so that they're not guaranteed to hash the exact same slots in the two hash tables alright so we're going to hash a twice do a lookup to see whatever one has a free slot now I'm showing this in parallel in practice you would sort of do one after another in a single thread but for visualization purposes we'll just do it together so this one they both have free slots we'll pick this one first we put a there they want to put b into our hash table same thing we hash it the first one here we recognize that a is already using that location so we're going to leave that alone and instead we're going to go see that on the other hash table it's empty so we'll go put b there right now we have c and this is where we have again the sort of thrashing back and forth we have to move move values back and forth as we as one value steals from another value so we're going to put c we hash it a's taken over there and b's taken over there so we're going to take b as our victim so we're going to steal the slot and then now we've got to take b out and then put it back on the other hash table so we're going to use the first hash function to figure out where it goes over here but remember in the beginning you get hashed to the same slot that a was so b is going to steal from a now and then now we've got to put a back on the other side so we hash it the second hash function it lands over here and we have a free slot and then we're done right again yes as I say yes you have to keep track of am I back to where I started at the beginning because then I know I'm stuck in the infinite loop and then you break out and in that case the way you would handle this is not just for cuckoo hashing Robinhood hashing and linear hashing for all these sort of static hashing schemes when you realize the table is full you're stuck in the infinite loop you basically allocate a new hash table that's double the size of the previous one and then you delete you reinsert this which actually might work I say but could you just create a new hash function with the same size and just insert the new one in there I think that actually might work I don't know though I don't think it does it by default I yeah it it may work I think it's okay yeah yes so you flip a coin yeah doesn't matter I'm incorrect yes yeah so do get get B again again this is why we need to sort the key value pair together because now I'm looking for B both slots are located are occupied but this is the one that I actually want and I'm done yes so this question is what if I'm going to look up A I mean going back up here so you're saying A hashes these two locations okay so now down here after I've done the movement back and forth A's over here but the first hash function would hash here what's the issue again we would know we're looking for A we would the first hash function would land us here we would see we have the original key there so we say O is A equal to B no this is not what I want A hash over here A equals A so this is what we want now if there was something else in this slot here then we would know that point that like since A wasn't here and then the key that's occupied here isn't A either we know that A doesn't exist so we can stop if we create a new table in his proposal isn't that considered a so his statement is his statement is wouldn't this make it if once the once the hash table is full if you're going to create a whole another hash table and then load it back up with the old contents isn't that considered dynamic I mean it's not doing it incrementally right so I mean the answer is at a high level yes is it dynamic enough that like it can make a new one yes but it's like a it's like a coarse-grained approach is like I allocated a whole new thing right like how do I say this if my car catches on fire and the only thing that I can save out of it is like the cup holders and I've built a new car and I put those old cup holders in it is it the same car right it's basically what you're proposing there's the this old Greek yeah there's like there's the ship thing in Greek mythology if I have a ship and I've replaced every single piece in it because it's over the years is it still the same ship at a high level sure it's a ship is it exactly the same one well the bits and pieces are different so that's basically what he's saying there the car metaphor is the ship one's better than the car one okay cool yes so his statement is the question is can we assume that the bytes for the key and values are mem copyable where that's not the case always in C++ I mean yeah it's just a bunch of bytes we can do whatever we want with it right so yeah yes so he's saying what would be the behavior with non-unique keys with B so you could do the thing I mentioned before where like you actually store the you could store the record ID part of the key and then therefore it is a unique key actually for hash tables that won't exactly work because yeah hash tables that won't work because like if I'm trying to find the record that has this key if I already had the record ID then I wouldn't need the hash table right so you can do that trick in B++ you can't do that here you'd basically have to have an auxiliary data structure like the value to that and then you can look in there the thing you're looking for it's another abstraction layer we'll talk about so the thing I was mentioning before where you couldn't put the record ID in there B++ trees are gonna allow you to do prefix searches meaning like I only have the first portion of the key not the whole thing and that's why you can pack the record ID and get that unique trick you can't do that in a hash table because you hash the whole thing so if I only have a portion of the key then it's not the same key so I misspoke on that one so you would use the pointer thing that I said before so the all these previous hash tables required us to know the number of elements that we want to store ahead of time again in the case of cuckoo hashing there's no guarantee that I'm gonna have exactly complete occupation of the table because of the way I'm doing this this movement back and forth the least in linear probe hashing if all the slots are full then the table is full and I have to basically build a whole new one again the standard practice is to build a new hash table that's 2x the size of the original one that's what I was saying before that the space complexity isn't always just n it could be something much larger than that because you want to allocate something that has enough free space to allocate to handle certain amount of growth so the dynamic hash tables that we're talking about are gonna be able to resize themselves on demand and still be the same hash table without having to reallocate everything all over again so the most common one is called chain hashing it's sometimes called bucketed hash tables this is usually what people are most familiar with when we talk about hash tables this is what you get I think when you get Jala's hash map class this is what they able to underneath the covers the idea is that the slot array in our hash table is gonna now point to these buckets that are gonna have the key value pairs and we can keep extending the bucket chain for a given slot in our hash table as we add new elements and we're just gonna do a linear search in the buckets to find the key value pair that we're looking for right so the way they resolve collisions is that if you map hash to the same thing then we just keep appending you to the list so the sort of thing is this way is like you're partitioning the linear probe hashing table that we talked about before into these smaller buckets so that the linear search portion is you're not the scan in the entire table it's sort of generalized to just within your bucket right so we go our hash table now looks like this so we have our bucket pointers and this would be equivalent to the hash array that we talked about before and then we have our buckets that are stored somewhere else so now what I'm gonna do a look up to sort of A I would hash it mod n by the number of bucket pointers that I have that gets me to some offset here and that's gonna have a pointer to the starting location of the first entry in the bucket chain for this slot and I find the first empty spot and I put A there same thing for B I hash it takes me to this top bucket chain up here and I find the first slot and I store my data so now if I have C C would take me to the starting point of A just like before I check to see whether the space is occupied it is so I jump down to the next one I store C there but now if I want to store D again I land to the first bucket in my chain both A and C are occupied so instead now I'm gonna allocate a new new bucket where it's now it's gonna have free spots and I can put D there yes the question is the size the bucket is chose by us us being the Davis developers yes yeah so in this case I'm sure for all these examples because I have to fit it on the slides too in practice it's gonna be the page size if it's backed by disk if it's in memory you might do one megabyte or something larger right so now I want to do E same thing I look at A C it's empty and then I go to D E so this is a good example of the thing you said before I could maintain in the bucket pointer table up here I could have a filter that says does the key I'm looking for actually even exist in this bucket chain and then if not that I would know I don't even need to bother go looking that's a very common optimization yes in practice it would still make sense to reallocate into our hash table after a while right because if it grows to so the statement is even though this thing can grow dynamically if everything hashes to the same thing then this bucket chain could grow super super long and then like now you have to do linear scan on that yeah so this one here like so the statement is like if I don't choose the right number of buckets bucket pointers ahead of time then like they may all go to the same thing and after inserting much of stuff this make it too big maybe the case I want to then resize this right and this is what the other two schemes will look at and list for us yeah yes so the statement is if this thing goes arbitrary long is there any way we could say alright I want to install f but instead of putting here put it down here but then how would I know it's there no no how how would you do it yes so think about it so his statement is I want to put f here right actually f does go here right but let's say f actually f actually maps here right and his statement is okay well if this thing goes too long maybe I want to put f here and then my pushback is how would you actually find it so you have two ways one is I would know that if I reach the end of this thing when I look for f and it's not there I should go check the next one but like that sucks for everything else right sucks for f as well right because now you know the the other issue the other solution is actually to have an auxiliary data structure up here it says oh by the way if you're looking for f go here not here what would be that auxiliary data structure it's a hash table right so you have another hash table for your hash table if this thing disappears and goes here yeah well you're actually getting to the linear hashing we'll get that in a second yes is it sorry Sevilla sorry are these going to be sequential memory or instead of a link list with arbitrary locations or oh yeah sorry it is within a bucket itself are these the sequential and not a link list yes yes so you should save it as the again if this is backed by disk we don't want to bucket size one because it's one 8 kilobyte fetch to go get one slot we want to pack this as much as together so we do that's cool so that's the whole point of this yes yes so the statement is this sounds very similar to the the separate link list to be maintained for the non-unique non-unique guys yes or non-unique keys yes same thing so 20 minutes ago let's go to the two most hardest data structures right it's always a good time so extendable hashing is going to be you guys are sort of I want to show what it is the extendable hashing is an extension to chain hashing where we're going to be able to split buckets instead of letting the link list grow forever which is the solution he was trying to deal with the key about this though is that now in our bucket pointer slot array that the pointers can point to this different locations in that array can point to the same buckets even though the hashed values may actually be different and then we're going to reshuffle and reorganize the bucket entries anytime we have a split and by expanding the number of bits we're going to look at in our hash values and so the advantage we're going to get from this is that we can reorganize the hash table but the data movement it's just localized to the part of the change or the part of the hash table that actually has to get split we'll leave everything else where it's located originally without having to move it at all that's why this is better than having to create a new whole hash table that's double the size of the previous one and load everything back in we have to move data we all have to move data for just a small portion of it alright so the way it's going to work is these are our bucket pointers here and we'll have this thing called a global counter that's going to tell us the number of bits we have to look at in our hash value for a given key and then for every single bucket chain they're going to have a local bit counter now we need this is just for the local one you actually don't need to really maintain this is for illustrated purposes we need the global one alright so the you can think of the local one it's going to tell us the number of bits we have to look at into our hash value here the binary representation of where the hash is so in the case of the first one here we only look at the first bit because the local value is 1 so when we just look at the first bit so no matter if the hash values are are different along as that first bit is 0 then we know we go to this first bucket here for these other two buckets down here we look at two bits right the local bits counters are set to 2 so we have 1, 0 we'll go to this bucket here and 1, 1 go to this bucket down here right so this is how I want to get a look up on A I would look at my global counter it says I want to look at two bits that would take me to here and then that would jump to that bucket right there that has what I want right now I want to put on B again same thing I look at the first two bits because that's what my global counter says then I map to this location in my bucket pointer array I jump to that bucket find the first free slot at the bottom and I assert it and I'm done now I want to put on C same thing look at the first two bits that maps to here which then takes me to this bucket but now this bucket is full so instead of doing what I did in chain hashing where I just extend the the link list of buckets over and over I'm going to split this bucket and I'm going to now increase my global counter to three because I I need to look at more bits to do the split and then I'm going to extend or double the size of my my pointer array here and now I'm going to keep track of three bits but in some of the some of the buckets I only need to look at one bit some of the buckets I need to look at two bits other buckets I'm going to look at three bits so I'm going to create a new bucket here and I'm going to rehash the keys that were in the original bucket and some will stay in the original bucket and some will go to a new one based on looking at the three bits instead of before we were looking at two so now I've got to figure out how to update all here so I want to look at the first bits for the ones that start with zero and so all of these are going to map to that same bucket at the top when I look at two bits that's going to take me down to the bottom and then for the three bits I'll have one go to the old original bucket and then one go to the new bucket yes so the statement is could you instead of looking at exactly the number of bits you need to look at for the that it's defined with a local counter you can do that but you have to update this make this larger yeah you could do that yes doing what this or what he proposed because now when if I say say this thing was getting full and just kept growing growing then it's basically a linear search to go find the key that I want right so I split it up now then I'm getting the benefit of the hash table where it's like a divide and conquer approach I do the hash but the key is that I only have to move these two guys these two pages around two buckets I didn't have to touch the ones at the top of the bottom yes we're assigning more buckets to the hot portion of the key so that they're split up more evenly yes yes the question is how do you go from a global to the local mapping so the question is how do I know that the ones I start with do I have to go to the first one so this and you know this one it deals with two so you increment that to three so you know that you would have to look at find all the ones where the three are different and then you leave all the ones where the first bit was alone at least left the map to the old one this question is how do you know that 100 maps to the one points to the first bits like I'm sorry go back here so this question how did you know that 100 maps to that would be populated right but it's like start from empty and as I insert it I'm doing the bookkeeping to keep track of where things are actually going again it's pretty think from start from nothing and as you build it up then you update the counters and you know where things go yes the correct statement is what if I double the size of the full bucket and then do what sorry so the statement is what if you just had a basically a hash table hierarchy like I mean no well yeah so the statement is like again at a high level is it the same yes but at what point do you stop like you're just like okay like my hash table my bucket is too large I'll build another hash table but what if you still have a hot spot do you build another hash table this thing handles that for you yeah yeah it's the Charlie's statement is if you squint this looks like the perfect hashing from the beginning question over here yes the local counter for illustration for bookkeeping you would know you would know this you would have you would have this metadata over here alright let me get through the last one linear hashing because this is another approach to me I think this one is even more complicated but I think it's super clever alright so the hash table is going to maintain this pointer this cursor that's going to keep track of what's the next bucket I want to split and then when any but when any bucket overflows I'm going to split whatever that pointer points to even if it's not the one that overflowed and then and then I'll just keep it doing extending the hash chain and the idea is that we want to again to to sort of smoothly grow the size of the hash table over time we'll eventually get to the hot hot bucket and split that but the idea is that we want to grow it so that we can accommodate things growth in the future the extendable hash table only splits the one that fills up right away but the idea is that I can maybe split other ones in preparation of them becoming the hot spot in the future alright so let's look like that so we have our slot array we have all our buckets and again we're going to start off with it it's basically going to split right but when we overflow we're going to have this pointer that tells us which one we should actually split so we have a split pointer here that points to slot zero so the the hash key is going to be there will be two hashes will be the key mod by number n we'll start off with that so say I want to get six it takes me here I hash it by the first hash function that takes me to this location and I'm done alright unlike in extendable hashing where I would split it immediately I'm going to extend it allow myself to build a new overflow bucket and then I'm going to split whatever the split pointer points at so this case here it points at the top one zero so I'm going to go over ahead and split this I'm just going to add one new entry in my slot array and a new hash function here now it's mod 2n so I know that if anything is 2n otherwise above it I'll just use n and then I'll rehash the keys that are in my split bucket that I'm splitting and then rebalance that portion there and then I move the split pointer down by one let me keep going sorry so then I want to get 20 so again at this point here I know that the hash value for this one it would be mod by four so that takes us to the location that we want to get there if I want to do a look up on sorry to take it back I would look here recognize that it is above my split pointer so therefore I don't want to use the first hash function I'm going to use the second one because then that'll take me to location that I want down below so now if I say I do another a get like this the get is below the split pointer or at the split pointer so therefore keep splitting my buckets until the split pointer moves down then I'll eventually split the one that overflowed yes yes so his statement is I increased the size of the hash the slot array by one in practice you would double the size I'll get a new one right in anticipation of splitting everything so so his question is how at this point I want to get 20 how do I know I need to use the first hash function of the second one because I have this marker line that says where the split pointer is anything above that I've already split so therefore I need to use the second hash function anything at that line or below it I can use the first hash function within within obviously here so the question is what would be the third hash function I would use it would be mod4n yeah yes his statement is 8 is still on bucket 0 yes but hash when I did the split I rehashed everything with this you still still say the same spot yes yes so the question is why do I add the 4 because this thing got overflowed so whatever the split pointer points at I'm going to add a new I'm going to split the bucket that it points at even though it's not the one that overflowed because the idea is that eventually I'll get to it right it's basically planning for the future yes like if this thing's like if that second bucket bucket chain is super hot and I keep overflowing eventually the split pointer will get to it the statement is why not use extensible hashing I mean this is a different approach a different way to solve the problem this question is this better than extensible hashing yeah this when I said this is easier to hose extensible hashing is easier to code you said this is easier to code potentially yes yes if one overflow again so the statement is if if one overflowed again would I would I have to rehash 0 no if one overflowed right now then I would split whatever the split pointer points at so it points at 1 I would split that then rehash everything with mod 2n right and then if it overflows again after that the split pointer would move down to 2 and I would split 2 and then just hash it and if it keeps overflowing I'll eventually come back around right and then I'll get to it again as he said it's sort of like preparing for the future yes yes so the statement is when I say rehash everything it's like whatever the split pointer is pointing at you rehash it with the new hash function and it decides do I stay in the old bucket or go to the new bucket yes it's a question and it's just before this example yeah so the statement is yeah for illustrated purposes when I split and I add a new hash function I double the size of this I run out of space so I yeah try to simplify it but yes you're right you would double the size you would have 8 slots not 4 not 5 alright so the in practice as far as I know most people don't implement extendable hashing or linear hashing they do the linear probe hashing and you pay the penalty to double the size so the magic is can you set the original end value to be good enough to to so you don't have to you know rebuild the whole hash table and then when we talk about query optimization this is something that Dave Simpson is going to try to figure out for you automatically based on like if you're using it for joins the number of keys you're going to have to put in it so you avoid that having to rebuild everything yes chain hashing is still common practice for indexes for doing joins it's typically going to be hashing yes her statement is every time I split do I have to get a new hash function again like it's the same hash function implementation so it's still scd hash but I'm giving it a different seed so that it doesn't it's not guaranteed to jump to a different location sorry it's not it's guaranteed to jump to well not guaranteed it's it's not always going to jump to the same location hash one to have a different seed yes okay so I'm going to see if it deletes because that's another horn is nest alright so let's finish up so hash tables are fast data structures they're going to barize O1 lookups again as I said linear prep hashing is probably the most common one for databases after followed by the chain hashing and again there's a trade-up between speed and flexibility and not having to latch the whole data structure and then rebuild it if you run a space so the hash tables though they can't use them for indexes they're not going to be the most common data structure and I've already said what the most common data structure is going to be is going to be B plus trees or the variance of them right sometimes you see in the literature say B trees in databases post-curses of the B tree in practice it's always going to be a B plus tree again and this is the greatest data structure of all time I don't care about splay trees I don't care about other things it's this one okay super snakes