 We started off at the beginning talking about relational model and SQL and now we're going from the bottom to the top of the stack and we're sort of describing all the things you need to build at each individual layer. And the idea is that these abstraction layers inside of our database system are going to allow us to change how we implement different parts of the system without generally affecting the different layers. So we're at the middle now where we're going to start talking about access methods. And access methods are the way that the other parts of the system are going to be reading and writing or accessing the actual unaligning data. And these access methods are going to be using the buffer pool manager to get memory pages and the buffer pool manager uses the disk manager to get the move things on and off a disk. So for this week we're going to talk about two different types of data structures. Today's class will be about hash tables and then on Wednesday's class we'll talk about trees. And it's important as we go along to understand that these different classes of data structures make different tradeoffs and make different design decisions that will have different implications and how we can use them inside of our database system. So these data structures in particular are used in four different areas of the system. So first is that we can use them for internal metadata, internal information about the database system keeping track of what queries are running, what data it's stored, what index it has, things like that. And this is essentially the extendable hash table that you guys are building for your first project. This is not exposed to you and the application. This is something the database system is maintaining on the inside to keep track of what pages that it has. Then there's sort of what these are called core data structures where it's the way we're going to organize pages, for example. We showed this before when you can maintain your heaps as a bunch of unlinked lists. This is essentially a data structure that's maintaining those pages. Then the next class of usage would be what are called temporary data structures. And this is where the database system will build and instantiate a data structure on the fly to help it aid in doing some kind of operation, some kind of maintenance. So for example, if you need to join two tables together, one particular type of algorithm will build a hash table on the fly, populate it with the data from the tables you're trying to join, and then when that query is done, it throws that hash table away. And it rebuilds it every single time you run that join algorithm. We can use it for other things like doing aggregations, finding distinct elements, things like that. And the last one that people are probably most familiar with when you think of data structures inside of database systems, and that's using them for table indexes. I think of this as the glossary in your textbook in the back that allows you to find for a particular keyword, jump to the exact location in the book where that information or that keyword is referenced. So this is the same kind of thing we're going to have in our database system using these data structures, whether it's a tree or a hash table, that's going to allow us to jump to pages that have the tuples that have the values that we're looking for. So for today's class, we're going to focus at a high level how to build these hash tables. We're not going to talk about in particular instances of when you want to use them. And again, using the extendable hash table for the page directory in the first assignment is one instantiation of using the hash table. But as we go out through the semester, they'll come up over and over again in different scenarios. We can use them to help us find the data that we need or organize our data. So when we build a data structure in our database system or we would choose a data structure to use in our database system, there's two different design decisions we have to consider. The first is just roughly how we're actually going to organize the data. And this is essentially how we're going to lay out the data in such a way that it's going to make it efficient for us to access this. And again, because we're focusing on a disk-oriented architecture in this course, the underlying... If the data structure needs to be backed by disk, then we need to organize our data in terms of pages. If it's an infameral thing that's going to sit in memory, then we don't really care because if it's in memory, then it's byte addressable. But we're going to do it in such a way based on what assumptions we have of how the data structure will be stored. Then there's a whole other class of design decisions we have to make to deal with concurrency. So if we have multiple threads running at the same time inside the database system accessing the same data structure, we want to make sure that they don't interfere with each other and we don't end up with an incorrect state or inconsistent state of our data structure. It's actually even more complicated than that because not only do we need to worry about threads possibly reading and writing the same regions of memory in our data structure at the same time, we also need to worry about the logical contents of the data structure if we're using them for transactions to make sure that they have a consistent view of the database. Again, this won't be an issue for what we're going to talk about in this class. This will come up later when we talk about concurrency control and transactions. But the way to think about this is that there's the physical structure of the data, the physical contents of the data structure. Then there's also the logical contents. So in this lecture, in this week, we'll focus on the physical structure, make sure we protect that. We'll talk about the logical stuff later in the semester. Again, in fact, I think we have a whole lecture just dealing with concurrency control inside of indexes to deal with this problem because it's a very complicated thing and it's on most courses actually overlook it. All right, so again, so today's class we're focusing on hash tables. So at a higher level, the way to think about a hash table, and again, if you've taken any basic CS intro course or data structure course, we should all be familiar with this. But the way we'll define it is that a hash table essentially implements a associative array API that's going to allow us to map keys to particular values. And the way we're going to do this is that we're going to have some kind of hash function, which I won't define now, but I'll define later, that will take that key and then do some math on it and we can compute an offset in the array that we're looking for. So the most basic type of hash table you can have is what's called a static hash table. And again, think of the static hash table, you malloc a giant array in memory. And then for each offset in that array, will be one particular element in your hash table. So for a given key, if you wanna find it, you just take the hash of the key that you want, mod n by the number of elements you have in your array. And then you can jump to the exact location that you're looking for, right? So in this case here, I could put in elements like this. So here, I'm assuming that they're all fixed length. So I know I need to allocate, in this case here, three or eight bytes for every element I wanna store in my string. But this only works if your elements are fixed length. So if they're variable length, then in your array, will essentially just be 64 bit pointers to point to some other memory location that'll actually have the real value that you're looking for, right? So this is at a high level essentially what we're trying to do. And then we'll show different data structures that can do different things and handle different scenarios and make different trade-offs to do some things better than another. So with the static hash table, what are some obvious problems with this? What's that? Sorry? I'm just gonna ask you about hash conflicts. Hash conflicts is one, what else? What's that? Scalability. Scalability, in terms of what though? Actually, this is perfectly scalable, right? Like, this is the best, actually, this is the best you can actually do. Space utilization, yeah, that's a big part of it, yes. What else? Exactly, the number of elements is static and there's one more. Okay, in this I assume that all the keys are unique, right? So the three assumptions in the static hash table that are problems is that the first one is that we have to know all the elements we're gonna have ahead of time. Some cases, you will know this and that's okay. Other cases, you don't. And so he was saying if you make an assumption that you know the number of elements, in this case here I say n is infinity. If I just malloc the biggest array that I could ever have, and if I only have four elements in it, then you're just wasting all the space. The second assumption that we made is that each key is unique. Now, in some instances of a hash table, this is okay, right? In your page directory, in your, in your, in your buffable manager, every page ID is guaranteed to be unique. So this, this is fine. But because we want to use these for possibly table indexes in our database, they're not always guaranteed to have be unique. And so this doesn't, this won't work for us because we'll have collisions and we won't be able to handle that. This also assumes that we have what is called a, a, we can have sort of collisions on the same key, but we also have collisions on two different keys both getting mapped to the same value. So in this particular example here, I have what is called a perfect hash function. Which is guarantees that if you have two keys that are distinct, then their hashes for whatever hash function I use, they will also be distinct. Right, but we'll see that this is hard to do if you don't know the number of elements you're gonna have ahead of time. So people spend a lot of time coming up with different hash functions to try to reduce this collision rate, or getting close to being a perfect hash, hash function without actually being one. So this is, this is essentially what we're gonna try to solve today, right? The static hash table is actually the fastest way you can do this, but it wastes memory and doesn't handle conflicts and all these other issues. And so we're gonna first start talking about the chain hash table, which is sort of the most, second most basic hash functions you can have. But then as we go along, we'll discuss all the different problems. You can have each of these and refine them and get better, better approaches. And then we have time finishing up at the end. We'll talk about how to actually choose your hash function, right? Right, so sort of the way to think about this is the first three will be sort of fixed size hash tables where you know the number of elements you're gonna have. The second two will be dynamic hash tables where you don't know the number of elements and then we have a sort of intelligent way to grow the hash table up or down based on the number of things we have. All right, so the first hash table, the first sort of static hash table we can have is a chain hash table. So chain hashing is essentially what everyone thinks of when you think of a hash table, right? This is what you get in Java if you allocate a hash map. They essentially use the same data structure here. And the idea is that you're gonna have an array of pointers and those pointers are gonna point to a, at each offset point to a linked list of buckets and these buckets can have multiple entries where you're gonna store all the elements that hash to that particular bucket. If you wanna handle unique keys, then you need to be careful how you do this. And there's two approaches I'll show at the end. But the basic idea is that if you hash to a bucket, if you wanna do an insert, then you just insert it to the end of the bucket. If you don't care if that's unique, you just insert it to the end. If you do care it's unique, you have to do a linear scan to find whether the key's already there and then do a lookup is basically the same thing. So a really simple example looks like this. Again, we have an array of offsets and these offsets are just 64 bit pointers to some buckets, right? And these buckets usually are the size of a page, because you may want to spill this to the disk. And then if I wanna do an insert into a bucket that's already full, then I just extend the chain out, add a new bucket, and then I can insert my new entry there. So now the key thing to understand here is that inside of each element in a bucket, we need to store both the key and the value, right? And the reason we have to do this is because we can have two different keys, get mapped to the same bucket, even though they're distinct. And when we're doing our lookup to find our entry, we need to make sure whether we're actually dealing with the entry that corresponds to our key, right? I think if, say we wanna delete an entry, right, delete a key, we wanna go look in our buckets and do our matching, make sure we actually delete the right value, right? This is pretty straightforward, this is pretty easy. The downside though of chain hashing is that the hash table can essentially can grow infinitely along the bucket list, right? Because if everything keeps hashing to the same bucket, if you have a bad hash function, then that bucket chain keeps growing forever, right? And then that makes essentially your data structure just the same thing as a sequential scan or a linear search, right? Because you would hash the key, land to the head of the bucket chain, and then now you need to do a linear search between every single bucket to find all your keys, right? So if every single element hashes to the same bucket list, it's the same thing as a sequential scan and you're building a hash table for nothing. To make sure that this is protected though, we have multiple threads, all you need to do is just take a latch on each bucket and then make whatever change or do whatever operation it is that you wanna do on them. You don't need to hash the entire thing. So to handle non-unique keys, there's two ways we can do this. The first is that we can just maintain a separate link list for every key of all their different values, right? So let's say that we have our bucket and we have two keys, X, Y, Z, and A, B, C. Instead of actually inlining the different values in each element in the bucket, it's just a 64 bit pointer to some other page that has all the values for each corresponding key, right? So now if I wanna say, you know, find whether this key exists, I only need to look at one entry in my bucket. All right, sorry, I need to scan the bucket list. I don't need to go through all the individual values, right? So this will make searching much more quickly. Order some downsides of this. Yes? Right, he says you might have to follow a pointer to a different page. Exactly right. What's another problem? If I have only one value for a particular key, then I'm allocating an entire page for just that one value. Now you may be, you know, try to do something a bit tricky and say, all right, well, if I have a, maybe what I'll do is I'll inline the value in the bucket, and then if I go to two values, maybe then I'll split it off and put a little flip bit to say, go look in this value list. But then now you're making this all more complicated because you're adding different rules and conditional rules to make this thing work. So the other approach is just to actually store the duplicate keys and values together in the bucket, right, in this case here, for X, Y, Z, and A, B, C, the different buckles are just interleaved with their, with their different combinations of their values. Right, so the downside of this is that we're stirring the key multiple times. If it's a large key, then we're wasting space. But the upside is that the advantage over the other approach is that now you don't need to go follow all the pointers potentially to go find all the values for a particular key. And you're not really wasting that space because presumably your buckets will be mostly full. Right, so this is another example of like, there's all of these different trade-offs in computer science, especially in databases, where you, you know, there's no one exact answer you can say, you know, that solves all your problems. Right, there's always gonna be downsides and positives for each of these things. As far as I know, most systems do the bottom approach. Right, nobody really does the top one, but you could do the top one if you wanted to. Yes? Do we have a stable, like, itch top end? This question is, are we assuming that each, sorry, each entry, in this case here, it doesn't matter. You know, I think it matters is why if you need to look at it. Oh, so his statement is, it's not a bitmap, right? The statement is you need to know what the offset is, or you need an offset array. Yeah, so correct, yes, you have to, and so you could do this by storing either set an array or you could just scan, literally, and then the beginning of each entry has, is prefixed with the length of that, of the value, so you know how to jump over it to find the next element. You also, it's stored in the, yes, correct, yes. I spilled water, but I don't think I don't think it's what we find. Okay. All right, so the, another alternative to the chain hash table, another basic hash table implementation is called open addressing hashing. And basically, instead of having the array point to different buckets, lists, you just have one giant array and then you inline the actual values that you, the key values inside the single array. And so what happens is the way you handle collisions is that when you hash your key, you'll land some offset in this giant array and if something's already there and it doesn't have the same key that you're looking for, then you have to keep jumping down into the, into the, into the array, looking at the next element till you find either an entry slot, empty slot, meaning like you've looked at everything, or you loop back around and you know you've scanned the entire thing, right? So let's look at this example. So again, we have our, we have some, this separate array that has these offsets for our hash keys, right? And so say for the X, it would go here, right? Just hash it and it lands on some offset, Y goes there. But now let's say Z, we wanna insert it and it hashes to the same location where X is. And so when we land there, we see this slot's already taken in our array. So then we just jump down to the next one and take that one, right? And so now if you wanna do a lookup on Z, you do the lookup and you land on X, it's not the thing you're looking for. Jump down, you would find Z and then that's the thing you want. Let's say that there's W and W maps to the same location. You look as X, you land there, it's not the same, skip that, go to Z, it's not the same, skip that. And then the next one would be empty so you know that W wouldn't exist. So who thinks this is a good idea versus who thinks this is a bad idea? It's a lot of water here, sorry. That's probably not good. Okay, go ahead, sorry. Okay, so assuming your load factor is good, this is gonna be faster when you create this. So his statement, assuming your load factor, the fill factor, right, is good, then this would be better than the hashed in one. Right, because what'll happen is, right, you don't have to follow pointers along the bucket list. We're not even, we won't get into this now, but it's in every database and then now you're bringing things that are in the same cache line so that's really, really fast, right? Yes, correct, you have the problem of that if it's not in the table, you basically do a linear scan across all elements and then just to find out that it didn't exist, right? So the way you sort of have to handle that is that you wanna allocate your hash table to be a little bit larger than maybe the actual number of elements you think there's gonna be there so that you at least have some empty gaps, right? That can sort of short circuit the search. Okay, so again, this is everything that we said before that essentially to reduce the number of wasteful comparisons then we wanna avoid collisions in the hash keys because again we'll have X and Z mapping to the same location even though they're actually distinct values. So typically what you need to do is you need to allocate your hash table to be approximately two times the number of elements you expect there to be and that ensures that at least there's always gonna be for every single slot, there'll be some gap that comes after it, yes. We didn't let the InfluxDB guys say that they do something like this but then they use a bloom filter. Right, so his statement is the InfluxDB guys talked about how that they do something similar to allocating 2X, right? Which is the next hash table we'll talk about. But then they also use a bloom filter to probe before they check the hash table to see whether something actually exists, correct. So we'll see the bloom filter approach when we talk about hash joins, right. You still need a hash table behind the bloom filter but you can build a bloom filter ahead of it, correct. Yeah, the bloom filter will always say well could give you false positives and never give you false negatives, right. So we'll cover that when we talk about hash joins. Okay, so this idea that we can, if we allocate a hash table between two times the amount of elements we expect to be there, but we can actually strap this a little bit further and actually be a bit smarter about how we allocate this memory and actually just maintain two separate hash tables with two separate hash functions, right. In the open dressing case, I use a single hash function, I jump to some location and then I hope that either I find the thing I'm looking for or I hit an empty slot to know I can terminate my search right then and there. But if we just add a separate hash function, then we can actually improve the likelihood that we're actually gonna find an empty slot or find the element that we're looking for in less time. So this is essentially what a cuckoo hashing does for us. So with cuckoo hashing, we're gonna maintain multiple hash tables and we're gonna use multiple hash functions for each of the hash tables and then what'll happen is we'll check the both hash tables to see particular key to see whether either one of them has the thing that we're looking for. So for our example, I'm just gonna assume we only have two hash functions and two hash tables but there's no reason you couldn't have three, four or more but in general, most of the times people only use two. So again, on insert, we'll check both tables, see which weather one has a free slot and we'll sort of entry there and then we wanna do a lookup, we do the same thing, we hash both of them, see whether they have the thing that we want. And we're always guaranteed in our lookups that when we hash your location, it's always gonna have the, it'll be in the first one or the second one or not there at all. We don't need to keep searching, you do our scanning the way we do in open addressing. The issue is now if we do an insert and both tables, those slots are occupied, we need to pick one as our victim and we'll steal their slot and then rehash it and put it somewhere else in the other tables. So let's go through an example. All right, again, forget I have two hash tables, they have two separate hash functions. So the first thing I wanna do is insert X and so I'll hash it with both my hash functions, hash one, hash two and they land to different locations in these different hash tables and then I'll flip a coin and pick the first one to store my value. Now I wanna insert Y, same thing, I'll hash it twice using the two different hash functions. In this case here for the first hash function it maps to the same location as X which is the last thing I just inserted. So I don't wanna put anything there but then in the second hash function it maps to an empty slot in the second hash table so I can go ahead and insert Y there, that's fine. So now again if I wanna do a lookup on say Y I'll apply both these hash functions again and then I'll see that in the case of the first hash table since I'm storing the key in there I'll see that it's not the key that I'm looking for so I know that's not what I want but if I go to the second hash table I'll find the actual element that I'm looking for. All right. So now let's see what happens when we have collisions we try to insert an entry and both slots are already taken. So say I wanna insert Z and again both the hash functions end up mapping to the location X in the first hash table the location of Y in the second hash table so I had to flip a coin now and pick one of them to evict and essentially Z is gonna steal their slot. So let's say that I choose to steal Y slot in the second hash table so I'm gonna put Z in there and then come back now and apply the first hash function on Y to flip it back over here. So this is where you get the name cuckoo hash because you're sort of like a cuckoo clock going back and forth between two different hash tables. So apply the first hash function on Y it then lands back here in X's slot and because I know that in the case of Y I've already taken it out of the second hash table and so it has to go in this first hash table then Y is allowed to steal X's slot. So now Y goes there and now I need to hash X but now using the second hash function to put it on the other side and then it lands in a free slot there. So again, the lookups are gonna be really, really fast because I'm gonna apply two hash functions and I'm guaranteed to have to look at it right there and find what I'm looking for. In the case of insert, worst case scenario is that I basically reshuffle everything because if I take something out and try to put it in the other hash table that slot's already taken so then I gotta take that out and probably put it into another slot. Right, so this question is if I have a new entry, I hash it and I get the same entry as Z on the second hash table. Right, so there's no guarantee that you're gonna end up with a different slot on the first hash table but if you choose a good hash function that has a low collision rate then probabilistically, it's unlikely that you're gonna end up mapping to the same thing. It can still happen, right, but mathematically it's unlikely, right? So the thing we have to be mindful of is exactly the scenario that you're talking about is that if you basically hash everything and you realize that, oh, I'm back where I started, this is the thing I actually tried to insert the first time, right, then you know you're stuck in an infinite loop. So when this happens, you end up having to basically just rebuild the entire table. You double the size of the hash table and you rebuild it. Question? So this question is what about duplicate keys? Right, so you can do the, you can do basically the two things we talked about before, right? You can have it be a pointer to a bucket link list or the values or you can sort of try to pack them in together in the same slot, right? Yeah, with quickly hashing it's a little bit more tricky. Yes, correct, yes. But then what if there's a delete if you insert X and then insert Y and then Y is the next spot after X and then delete X, then how do you? Oh yeah, yeah, yeah, I see what you're saying. You have to reshuffle, move them up. Yeah, yes, deletion is hard, yes. Yeah, I'm glossing over the details a bit there. There is some extra metadata at the store to say where should I have been, right, yeah, in some cases though, like so he brings up a good point that I didn't mention. Some hash tables, you actually, they can support deletion, but it's slower to do and therefore you may not want to use them for everything, right, so like if you're using for a table index, certainly you insert delete values all the time so maybe open address since it's not the way to go but if you're using it to do a join, a hash join, you build a hash table once with all the tables and all the tables in your table and then you just probe it nonstop, you never delete things so open addressing actually turns out to be the best way to do this, so that's a good point. Any other questions? So as I said, again, we need to keep track of when we do an insert, what was the thing we actually inserted so if we come back around and try to reshuffle that thing again then we know that we're stuck in infinite loop and then we essentially have to rebuild the entire thing, right, so there is some math involved to say what's the probability that we're gonna have to rebuild based on the fill factor of the hash table. So with two hash functions and two hash tables, the math says you probably not gonna have to rebuild the table till it's about 50% full. With three hash functions, then you probably don't need to rebuild until it's about 90% full. And again, rebuilding a hash table is really expensive, especially in this case here, because essentially you're allocating another huge blocks of memory and you're making just another copy of all the entries in the first hash table and putting it into the second hash table. And depending on what you're actually using your hash table for, this may be a bad thing because you're maybe trying to have process queries that are updating the table at the same time you're trying to rebuild it and you may end up losing entries if you're not careful. Question, yes. You have two choices. This is from where you can evict the one which was specified by hash function one and the other is specified by hash function two. Right, see, so in this case here, for inserting Z, right, we have two locations. And evictions. So there are two to the power n possible choices we're gonna make. So do they rebuild the hash table after checking all the two to the power n? Today, so here, so his question is, in this case here, I need to insert Z and I decided to evict Y. But if I end up evicting X, it may end up with a different combination of evictions and that may not have a cycle. Good question. So I think probably what they do for simplicity of software engineering, instead of flipping a coin you could just pick, I'm always gonna try to evict from this side and don't worry about the other path, right? I think in general it makes it too complicated to say, all right, I went down this way this first time, now I wanna go down the second way, right? Most of the time you just flip a coin, okay? And if you come back around then that's good enough. Actually, I don't have a link, but the best open source implementation of a cuckoo hash table is actually here from CMU by Dave Anderson. So if you search like lib cuckoo, it's the one written by Dave Anderson students. It's not lock free, but it's pretty good. Okay, right, so again if you have multiple hash functions, depending on how many functions you have in your cuckoo hash table, it'll tell you what's the likelihood you have to rebuild. Okay, so in the three previous hash table implementations that I talked about, we have this problem that anytime we need to grow or shrink the hash table, we basically have to reshuffle everything, right? We basically have to increase the number of slots, and then when we rehash everything, things that used to be maybe in the same bucket or the same location now maybe in different locations in the new hash table. So essentially we're doing a mem copy again of the entire contents of the table. So to avoid this, if you're using these three other hash tables, is that you try to estimate the number of elements you're gonna have ahead of time so that you don't go above this, right? So I said this earlier in the lecture, but in some cases you actually do know, you do know the number of things you wanna put in your hash table, right? If you're doing a hash join, for example, say I have this simple query here when I would join table A and B, the join operator is just getting all the tuples from the left side from table A and building a hash table on all those entries. So the database system's always gonna know exactly the number of entries it has in each table. So you know exactly the number of elements you wanna build in your hash table. It's never gonna grow, it's never gonna shrink. So in that case you can use open addressing and not worry about deleting things or growing too big, right? You may still have collisions because that's based on what the hash function does and based on what the values are in your table, but again, you'll never exceed the number of elements you have. Again, and I told you this earlier, was an example of another instance of a hash table in a database system where you may not know the number of entries. Again, you guys are building this in the first project, right? The page directory in your Buffable Manager, right? There's nothing preventing the application of just keep inserting things over and over again and you increase the number of pages you have. So when the database system first turns on, there's nothing in it, right? So the number of elements it has is zero, the page directory is essentially zero, but then over time it gets big. So question yes. So a statement is let's say you don't need to increase the size of the hash table. You don't increase the number of element entries you're trying to store. That you just want to reduce the number of collisions. So therefore you choose a different hash function to sort of randomly permeate the different entries. Are we starting with cuckoo hashing or? So his question is when I go, say I first build my cuckoo hash table with two hash tables and then I get, it gets too full. So then I dynamically, when I rebuild, then I want to say, instead of maybe growing it, I just pick a third hash function, a third hash table, and maybe that reduces the number of collisions. Yeah, so I suppose you could do that. I don't actually know what the lip cuckoo actually does. I think it always just doubles the size of the hash table when you get too big, right? Yeah, that's a good question. We should talk to Dave about that. I don't know what they actually do. Okay, so again, not all your hash tables will be statically sized. Some of them will actually grow and shrink. And so to handle this, you want to use what are called dynamic hash tables that can grow and shrink the size of the hash table on demand, right? So we'll first talk about extendable hashing and then, which is in the textbook. And then we'll talk about linear hashing, which is another approach, which I don't think is in the textbook, it's actually in older versions, other versions of database textbooks that are covered in other classes. Okay, so with extendable hashing, the way to think about it is that it's gonna be like chain hashing where we have buckets. But instead of letting the linked list of buckets per slot in our lookup or offset array, instead of letting those bucket lists grow indefinitely, what will happen is that we're gonna split them incrementally. And just reorganize the entries of the bucket we wanna split. So instead of having to, again, reshuffle the entire table, we'll sort of have a localized change to just the one that gets too big. So again, let's look at entry here. So our example. So here I have four slot offsets and then I have three buckets. So the first thing is that we have this thing called the global counter. And the global counter is gonna tell us the number of bits in our hash function output that we wanna use to figure out the location of a key that we're looking for. Right, so again, think of the hash function. The most simplest thing is take an integer representation of the key, mod it by the number of offsets in our array, and then that'll give you a location. And then we see now in our, the offsets point to three different buckets, then each bucket's gonna maintain what's called the local counter. And this corresponds to how many bits in the hash output, the hash key, that was used to get to this particular bucket. So in this case here for the guy at the top, the local counter is one. That means that it only uses, you only have to use one bit in the beginning of the hash key to find that location. So here, the global counter is two, but the local counter is one, so we're just using the zeros here in the first bit location. So that's why zero, zero, and zero, one both map to the same bucket. Because this bucket hasn't gotten full yet and we haven't split it. Now in the case of the other two buckets, their local counters are two, which matches the global counter, and that's telling us we need to use two bits to figure out how to get to that particular bucket. So let's look at an example now. Say I wanna find x and we run x through our hash function and it produces some bit sequence. And so we look at our global counter, our global counter says that we need to look at two bits. And so we look at the first two bits in the hash output and then that will direct us to zero, one. And then again we just follow the pointer and that takes us to the bucket that has our entry. Yes. Does it matter if you look at the most significant or least significant bit? His question is whether you look at the most significant or least significant bit. It does not matter. You just need to be consistent in your implementation. I think the textbook does from left to right, right? Other examples online will go right to left. It doesn't matter. All right, so let's say I wanna insert y. And again y hashes to some bit sequence. We look at our global counter, our global counter says two bits. So we'll use the first two bits to map to some offset here. And then we go ahead and insert our entry. All right. Now I wanna insert z. Same thing. First two bits, all right. It ends up hashing to this location here. But we see that our bucket is now is full. So we don't actually wanna split this. And now set the global counter to three. So essentially what happens here, now we're gonna double the size of our directory. Now include three bits to correspond to the offset in each slot. And we are going to reshuffle the entries of the bucket that we end up splitting. And so now the global counter is three because we have two buckets where the local counter is now three. But we still have the bucket of the top that has the local counter one. And then the bucket at the bottom has a local counter two. Those are unchanged after we did the split. I'm seeing a few spaces. Any questions? What's that? His question is, it may be the case that splitting once may not be enough. Because you may end up splitting, right. So in my example here, you're gonna split this guy here. So you have one zero, the first two bits, when the global counter is two, they're all the same. It's all one zero. The entry that now I'm inserting, if you go to three bits, the first entry here is one zero one. The second one is one zero zero. The next guy is one zero one. So that case, yes, they would end up getting split into separate ones. And there'll be a free slot where I insert the new entry. The, maybe the case at the third, when you have three bits, they're still all the same. When you go to four bits, they're still all the same. So you may end up keep splitting over and over again to find actually some point where they're actually all distinct, right. In the simple case, we don't worry about that. But yeah, it's just recursive operation. In the back, yes. What exactly is the function? The function of local counter is just sort of internal metadata to say, if I, you know, when I wanna split this, how many bits now should I use in my hash function to reassign the entries to the tuples, right. So in this case here, I only look at zero one. Say now if I go to the local counter to two, I now need to consider two bits and assign them to buckets that way. Yes. When you extend the, instantiate an entirely new. Correct. So his statement is, here I double the size of the directory and I need to instantiate pointers to now point, you know, to the potentially new locations of the buckets. You're correct. But in general, this is, it's a mem copy, but it's not like we're copying all the buckets. That's usually the most expensive thing. This is just 64 bit pointers. There's an array. So to make a copy of that and then, you know, atomically have it be now the master array for this hash table, it's not a big deal. Copying buckets and the actual key and values inside of them is the more expensive operation. But that's a good point, yes. So, again, the extendable hashing is, in my opinion, it's pretty straightforward. Again, the key thing is that the, you always want to look at the bits in that, you know, that you're hashing on just to figure out where they go in this, always in the same direction. In your, in the first project, I'm not saying we're, don't worry about, I say don't worry about shrinking the, the hash table, only worry about growing it. Shrinking it is essentially a bit more complicated. You basically have to recognize that you have a bucket that now is empty and then you can have the, you basically subtract the local counter to reduce it to say the number of bits you should look at and then you map whatever slot array it is that's pointing to it to now point to another bucket. And if you're trying to do this in a more sophisticated manner using low, low latches instead of actually a giant mutex for that entire hash table, getting the ordering of those operations can be tricky. So we decided to not have you guys worry about that for the first assignment. Yes. Say it again? Correct, yeah. I'm only showing the key here. Yes. Okay. So the problem with extendable hashing is that the occasionally you're doubling the size of the directory. And again, I said it's not that big of a deal because it's just 64 bit pointers, but at some point it's exponential and you may end up doubling it to be a really large size. And that means that the way you do this sort of atomically, you have to take a read write latch on the array anytime you access it because you don't know if another thread could be doubling the size of it and reshuffling things. So with linear hashing, the idea is that we're gonna try to be a bit more smooth. I'll use that word in quotes, we'll sort of gradually increase the size of our offset array by splitting things incrementally. And so the way we're gonna do this is that we're gonna maintain a separate pointer that's gonna keep track of the next bucket we wanna split. And every time we have a bucket gets too full, then we'll split whatever the thing is pointing at even if it's not the actual thing that got overflowed, and we'll split that, move the pointer down and keep proceeding until we reach the end that we loop back around and do the same process all over again. So in linear hashing, the criteria or the decision about when a bucket is considered too full and it's overflowed is left up to actually how you wanna implement it. So it could be based on spatialization, it could be the length of the overflow chains. It's up to you to decide what method you wanna use to decide now's the time to go ahead and do a split. So let's walk through an example here. So here I have now, we don't keep track of any local counters. The only thing that we maintain is this extra split pointer that's gonna again point to the next bucket we wanna split whenever any bucket in our hash table overflows. And so like in cuckoo hashing now, we're also gonna have multiple hash functions. In the very beginning, the split pointer is pointing to the, at the first offset in our array, so we only need one hash function. But as we see it as it goes along, you actually end up having two of them. So let's say if I wanna do find on five, for this again, just run it through our hash function and say for simplicity reasons, we'll say this is just the integer representation of the key mod by the number entries we have. So in this case here, it's six mod four and that gives us the two. And then we can jump down and find our entry in the third bucket there. So now I wanna insert 17, but when we hash it, we end up pointing to this bucket here, but that bucket is full. So instead of splitting this bucket, we're gonna just add an overflow bucket and we'll maintain that link list that we saw in the chain hashing, just extended out the number entries we have for the slot location. But then we'll end up splitting whatever the split pointer is pointing at. So in this case here, it's pointing to the first one. So what we'll do is we'll make a new slot entry and use a second hash function, which is now the doubling the number of elements we're modding the key by and that's gonna end up reshuffling all the entries in that first bucket, just have some of them point to the first bucket, some of them point to our new bucket. In this case here, eight stays at the top, 20 comes to the bottom. And again, we have the same problem that he mentioned before is that we may end up still, based on what hash function we're using, we have a bad hash function, may end up still putting everything in that first bucket. But in this case here, that's fine because we don't care because in this case, the first bucket didn't actually overflow anyway, so it doesn't matter. All right, so now I have two hash functions. So now if I want to do a find on 20, when I apply the first hash function, it ends up mapping to the first slot, slot zero, but now because I know that I've actually done a split and the split pointer's not no longer pointing to the first location, I need to check to see whether the first hash function is mapping to a slot that has come before the current location of the split pointer. You think of the split pointer as sort of a threshold in your offset array that says anything below this has not been split yet, so therefore I can still can use the first hash function. Anything above this has been split, so I actually need to apply the second hash function. So now in this case here for find 20, again it maps to the first location using the first hash function, the first slot, so therefore I know I'm below or above the split pointer location, so now I need to apply the second hash function and then that will map me down to this slot here and then I can find the entry that I'm looking for. Again, this keeps going if another bucket overflows, I'll apply the same split operation. I only keep the second hash function and I move the split pointer down. And I keep doing this over and over again. Then at some point you'll reach position three, which is sort of the end of the original slot array, like this, once I reached point three, I would have extended down the directory for all the other buckets I had to split to get there. Then I loop back around, reset the split pointer, delete the first hash function and start over again and apply the same operation. Question. This question is, when you split a bucket, do you rehash anything that's above it? No, you only rehash whatever the split pointer is pointing at. Because the second hash function that comes along would have already been applied to any bucket that came before it. And the hash function is gonna say, in this case here it's two n, it's either gonna be the first location or the new location. This question is every time you split do you need a new hash function? No, you always have the second hash function that's created the first time you split. And that's always gonna be the second hash function no matter how many more times you split. But then when you reach the bottom and you loop back around the top, you delete the first hash function and just use the second one as the main one. So again, so this seems odd, right? This seems like not the right thing to do because in my example, the bucket that got overflowed is not the one that ended up splitting. It ended up splitting one that was not full at all. But in practice this works out because you don't need to maintain some extra metadata to say here's hash function for this thing that's been split, here's a separate hash one for the other one that has not been split. The math works out that as long as you're below that if you haven't been split yet by the split pointer, then you can still apply that same hash function and you don't have to reshuffle everything. So you're only reshuffling things incrementally. And eventually, the split pointer will cover all the buckets and the thing that may be causing all your overflows will get split and then you can loop back around and do it again. So again, it seems like it's not doing the obvious thing but the trade-off for not splitting the thing that overflowed immediately when it overflows is that there's less metadata and there's less overhead of actually trying to figure out where something's located. So linear hashing is a clever idea. I think it was developed in the early 1980s and I think it's used in quite a number of systems. So if the pointer can also move actually in the reverse direction, so say you have a bucket that gets essentially empty, then you can go ahead and coalesce it and move the split pointer back up. All right, so in the remaining time, I wanna talk about hash functions. So all of everything that I talked about here today and I think somebody brought this up earlier in the semester doesn't work at all if you have a terrible hash function. The worst possible hash function you can have is just take any number and the hash is zero. Every possible key will hash to the same value. That's the worst possible thing you can do. So you want something that's gonna be have a low collision rate, meaning different keys that are completely distinct will unlikely to get mapped to the same value. So the key thing to point out too also is that when we say hash function, we don't necessarily, we're not even talking about cryptographic hashing like SHA-1 or SHA-256, right? Because we don't actually need to reverse any of the keys after they've been hashed. In these cryptographic hashes, you do actually wanna do that, right? Because if someone sends you a encrypted message, you need to be able to decrypt it. So in our case is we just wanna get some random bits for any arbitrary string. So we're not talking about things that you would use for a cryptography. So that means that we wanna make this thing be really fast and again we want that low collision rate. So if you take an algorithms class or a class that covers hash functions, they'll talk about the math of how these things actually work. From a database perspective, we don't care. We'll just take whatever actually works. So these are the four probably most popular or well-known hash functions that are out there today that at least the newer systems are using. Things like Postgres SQL Server and DB2 and Oracle, if you actually look in their code, at least Postgres can do this, they all implement their own custom hash function. But most of the newer systems that are out today or that have come out in the last 10 years, they're probably using one of these. So murmur hash was sort of the first hash function that got notoriety that was designed to be this sort of fast general-purpose hash function you can use in these hash tables. And then the Google guys in 2011 basically took the second version of murmur hash and optimized it to work better for shorter keys. So the first murmur hash was written by, I think there's some random dude on the internet, right? And then people started picking up it, picked it up and said, oh, well, this is really good. This actually solves the problem we wanted. So the Google guys took it and then they sort of tweaked it to handle keys that are less than 64 bytes because that was a particular use case that they cared about in their systems. Then in 2014, the Google guys came up with a newer version of city hash called farm hash that had even better collision rates. And then there's a newer hash function that was written by a professor in up in Canada called CL hash. I don't know where this is actually being used just yet and I'll show some numbers in the next slide. What's actually very interesting about this is that they rely on what's called carry less multiplication, which is a different way of doing arithmetic to compute these hashes, which I think is kind of interesting. So that's the link to the Wikipedia article if you wanna learn about it. So in this slide here, this is actually a benchmark framework that I ran a year or two ago that compares a bunch of different hash functions for computing hashes for different key sizes. And the way to read this is that the y-axis is measuring the amount of data that they can hash and compute hashes from. We're not measuring collision rates here, just how fast can they hash things. And so, and then along the x-axis is the size of the keys that they're hashing. Think of this as like the number of random characters in a string. And so, in this we're comparing against STD hash, which you guys are using for your first project, against murmur hash three, Google city hash and farm hash. And I'll show CL hash in the next slide. So what you see is that murmur hash and STD hash essentially plateau right around here in the middle and they're essentially doing the same. In the case of STD hash, you sort of see this sawtooth pattern where it sort of, it gets faster and it gets slower, it gets faster and it's slower as the keys get bigger. In the case of farm hash and city hash, you see a very interesting pattern where it spikes up, goes down, spikes up, goes down, and then it changes the sawtooth pattern and it becomes more elongated as you go along. Do you want me to take a guess of what these spikes correspond to? Byte boundary. Byte boundary, exactly right. So this is, first spike is 32 bytes, then 64, 128, 192. Why is it doing this? I'll also say too, this is all in memory, right? We're not reading anything from disk, it's just taking in memory string, hashing it and producing hash value. Why does it do this? I was wondering. This question is SIMD being used. I think for these, yes. Yeah, that means yes. Right. Why? Because it takes time to like, you know, see if you would have to like, there's two ways to do that. The first one is to like, to expand within like 16 minutes, but that is quite long. But the other steps is that we'll base off the space and actually thus incurred extra items. And the second thing is to, well it's a little bit tricky to use a compressor for that. Right, okay. So let me try to read through what he says. So the CPUs want to store data or they prefer to operate on data that is called word of line, meaning they sort of fit the boundaries of either registers or cash lines or individual SIMD registers. We won't talk about SIMD in this class, we'll cover it in the advanced class. But essentially what happens is like, again it's measuring the throughput and the amount of data it can process. So at these different, at these peaks, you're able to pack in exactly data that aligns with the registers that the processor's gonna use to invoke these special instructions that can crunch data in parallel. So that's why you get these sort of spikes at these offsets here. And then what happens is these registers are only so big, so when you get to the larger key sizes then you can't pack as many keys in the same, you're not processing multiple keys at the same time, you're processing different sections of the same key at the same time. So you can't pack as much data into these SIMD operators and that's why you see this up and down become longer here. Again, I said before in the last slide, in the case of FarmHash, they optimized it for our, and sitting in FarmHash they optimized it for things that are less than 64 bits and that's why you see they're much, much faster than all these other ones. That's sort of like the best case scenario for these hash functions. So now we throw CLHash in, and these are the same results for the other hash functions in the last slide, but now I've just changed the Y-axis. With this CLHash function, it's slower in the beginning, but then when you get to much larger keys it actually does much better. And you sort of see the same solitude pattern corresponds to things being, whether they're word aligned or not. So this is the link here to, again, some random dude in the internet wrote a little microbenchmark framework that takes all these hash functions and runs these experiments so you can regenerate these graphs if you want, and then I sent him a patch to actually run the CLHash one to throw it in the mix. Okay? So, hash tables, again, everyone should be aware of what they are. They're really great because they'll also do 01 lookups in some cases, but typically hash tables are not what we're gonna wanna use when we actually talk about database indexes or table indexes. And this is because, compared to the tree data structures we'll talk about on Wednesday, they're not gonna be allowed to do range scans and other operations where you may not be looking for an exact key. Right? And this is what an overpreserving tree can actually do for you. So in the remaining time, what I'll do is I'll just give a quick demo of this in Postgres, which I think been disconnected though, awesome. Which may not be happening, hold up. Okay, so I have a simple table that has a, what is that? Two 27 million email addresses from a well-known Canadian dating website. If you follow the news and you know what I'm talking about. So we have 27 million unique email addresses. So what we can do is we can build in Postgres when you create your index, you can specify, I'll just write it in a scratch. Right, create index, IDX emails. You can specify that you wanted to build a hash index. By default, without this, Postgres is always gonna give you a B plus tree. But if I say if I'm using hash, if I add that keyword in and it tells me that you're telling the data system I wanted to use a hash function or hash table. Or not, all right. Right, on emails using hash. I think this should take about 14 seconds. Everything should be in the buffer pool. So again, with the hash index, I can only do single key lookups. I can only find entries that where I have an exact match. I'm not gonna be able to do to scan, to do range scans or do other kind of predicates. What's that? His question is, is it possible to use? So his statement is, if the data is naturally clustered, then, no, no, hold up. But then you use a terrible hash table. Yeah, hold on, so I'm talking about, can you use the hash table to do range scans? The answer is no, because the hash function randomly reassigns them in different locations. So even if everything got inserted in sorted order, the hash table jumbles that. Okay. All right, so I built my hash table now. So if I go, so this is, I've read this before and this is looking up the guy. This is the, somebody registered an account with this email address. So this is, alphabetically, this is the last email, right? So again, I can run explain and explain tells me that what, how it's gonna execute my plan. So here you can see that it's telling me it's gonna do an index scan using that hash table I just built as my index, right? So now when I actually do the lookup for it, it comes back really, really fast, right? But let's say that I want to do a I want to do a query where I want to find an email address that starts with Andy, right? And here you see that again, the database system says I can't do an index scan even though I built that hash table index, I can't use it because there has no way of actually, you know, scanning the keys, right? It's just gonna scan and scan all the pages, all right? So if I do this, this should, I don't know how long this is gonna take, we'll see. So here then the data system says you can do a sequential scan to find the entry that I'm looking for and there's a bunch of people, right? So now let's say I build a, instead of using the hash table, I can tell it to build a B tree. And this is what we're gonna be talking about on Wednesday, right? So this might take about a minute. Again, with these tree data structures, they're essentially gonna be allowed to do, since they're preserving the order inside the data structure itself of the keys that we're indexing, the database system's gonna be able to use that to find entries that it needs for certain types of queries. So in my case here, when I was doing my lookup to say find me all emails where the first part of the key starts with Andy, we'll be able to jump into the tree to some offset where all the entries that start with Andy are located and then we can scan across the leaves to find all the elements that we want. And the hash table, we can't do that because again, there's no mapping of, there's no easy representation for the extra keys themselves, yes? If you have both a tree and a hash table, could you use a hash table to find the first one and then scan it to the other? So this question is, if you had both the hash index, the hash table and the B plus tree, could the database system use the hash table to find the first entry and then do what? And then jump into the B plus tree and find it? How would that work? So the statement is the value in the hash table could then be a pointer to a location in the B plus tree. So think about this. Let's talk about why this is a bad idea. As we'll see when we talk about the B plus trees, the B stands for balance. They're self-balancing indexes, meaning as threads insert and delete entries, it's gonna be splitting and merging and reorganizing the location of those pages. This is sort of an example of what I said at the beginning. There's the physical organization of the data, the contents of the data structure. That's essentially where the blocks are aligned, what's pointing to what. Then there's the logical contents. So the B plus trees could get reorganized and now the location of different elements in it could change. So now I need to go figure out where the hell they actually are in my hash table to now update them, now point to their new location. And so that would be a bad thing to do. And again, there's nothing actually that the hash table is gonna give us that just doing that quick scan on the offset will do. So now if I go back here and I run that same query on explain, oh, it still wants to do a scruncher scan, no, nevermind. No. Let's try another example. This is because it's the Varchar and that's by what, it's tripping it up, that's fine. Anyway, this is a bad example, but all right. So that's it for today. Again, the main thing I want to try to drive home from this example is that when you call createIndex in most database systems, you call createIndex in SQL, most database systems are not gonna be giving you a hash table, right? By default, they're all gonna be doing B plus trees or some other order preserving tree index. And this is because the hash tables are gonna be really fast to do single key lookups, but when you wanna do range scans, not on Varchars apparently, whatever, but for other queries, it's still gonna always have to do a scruncher scan because the access methods that the hash table supports are only supports of equality predicates. Does something equal something, but you can't do other more complicated things. Okay, any questions? Yes. Why does this doesn't work? We'll find out afterwards. Okay, all right, we're done guys, thank you so much.