 Let's get started. OK, let's, again, give it up for DJ Drop Tables. Thanks as always. How was your weekend? It was good. You know, I was recruiting. I went to TOC, but they don't have any DJ jobs. They don't have any DJ jobs? That's hard, right? Actually, I found out Salesforce has, in the lobby, their main building in San Francisco, they have a DJ every morning. And it rotates. Oh, that's crazy. Do you want me to figure out how to get you that job or what? You want to focus on it? Can you put my resume? OK, yeah, I can see what I can do. OK, so I think most people are at the TOC today. So that's why it's a low turnout, which is unfortunate because this is one of my favorite lectures, hash tables. So we have a lot to discuss, so let's get right through it. So real quickly, and reminders for what's on the docket for you guys, what's due. Project one is due next week on Friday at 27th and midnight. And then homework two, which we'll be releasing later today, that'll be due the 30th of Monday after the project. So any quick high-level questions about project one? Say it again to what? The question is, when would the autograder be released? I mean, it's live on Grayscope now. You can submit things today. Well, we're not giving you the source code for the test, obviously, because that's when you use that to grade. Yeah, so it should be live. If you submit it and it doesn't work, please post on Piazza, OK? Any other high-level questions? OK, so where we're at now for the course is that we've spent the first couple weeks, again, starting at the bottom of the stack of a database system architecture and working our way up. So we've discussed how to store data on disk, the pages on disk. Then we talked about how to bring those pages into memory in our buffer pool or buffer cache and having the policy decide when it's time to evict something, how to pin things when they do writes. So now we're going above the buffer pool manager, and we're going to start talking about access methods. An access method is a way we're going to essentially read or write the data in our database that's stored in the pages that are stored out on disk. So today, we're going to talk about a set of lectures we're going to do on data structures that we're going to maintain internally inside the database system. And we're going to split it up between two discussions between hash tables and order-preserving trees. So each of them have different trade-offs because you've taken an algorithms course by now. So you understand the implications for both of these, but we're going to describe what matters to us in the context of database systems. Because just because you have a tree versus a hash table, maybe you understand how to do proofs on it or write algorithms to interact with it. Now let's talk about what happens when we actually put it inside a database system and actually try to use it. So data structures are used all throughout the database management system for a variety of purposes. So one thing we've talked about so far, we've shown how to use data structures for maintaining the internal metadata about what's in our database. Now we talked about there being a page table or a page directory, and that was a hash table to do a lookup between a page ID to a frame or a page ID to some location on disk. The next thing we can use them for is actually just the core data storage of the database itself. So what I mean by that is, instead of just having on order to heap a bunch of pages, we can actually organize them at a higher level to be a hash table or a B plus tree or tree data structure and have the values in the data structure actually be tuples. So this is very common in a lot of systems like Memcache, for example, essentially is a giant hash table. Or MySQL NADVs engine is just a B plus tree where they store the tuples themselves inside of the leaf nodes of the tree. We can also use data structures to maintain temporary data. So this would be like if we're running a query and we need to compute something very efficiently, we could build a data structure on the fly, populate it with whatever data we need, finish executing the query, and then just throw away that data structure and be done with it. And the last one that's probably your most familiar with is using these data structures for table indexes. Essentially building a glossary over keys inside of our tuples and allows how to do quick lookups to find individual elements that we want rather than having to do a sequential scan throughout the entire database. So for all these purposes, again, we need good data structures to do all these things. So the things we're gonna care about and how we design our data structures is the two following things. So the first is we're gonna care about what the data organization is. We need how are we gonna represent the key value pairs of the elements of the data that we're storing in either in memory or in pages that we're storing on disk, and we do this in an efficient way that can support fast reads and writes without having to do major overhaul or maybe restructuring of the entire data structure every single time. The second issue is that how are we gonna allow multiple threads to access our data structure or multiple queries access to our data structure at the same time without causing any physical violations to the eternal representation of the data. So what I mean by that is we don't want to have maybe one thread update a memory address while another thread is reading that address and then they see some torn write or some corrupt version of that address and now that points to some invalid page or invalid memory location where we end up producing incorrect results. So we'll see how we actually handle this. We'll talk a little bit as we go along today but we'll spend a whole lecture on discussing how to do concurrency control inside of these data structures. But for our purposes today we can sort of simplify discussion just assume we only have a single thread. And because this is gonna matter later on also too when we talk about transactions because the type of things we'll talk about here we'll use latches to protect the physical data structure that prevents from reading invalid memory addresses or invalid page locations. There's also a higher level concept of what's the logical correctness of our data structure that we need to care about as well and that'll come later on in the semester. So essentially what I mean by that is say I have an index, I delete a key. If I come back, my thread comes back and tries to retrieve that key again I shouldn't get it because I know it's been deleted. Even though the physical bit still may be there because I'll do some background garbage collection to clean up later on but logically my key should be gone even though physically it's not. So this topic is very complicated and so we'll touch on a little bit today but mostly care about the physical integrity of the data structure rather than the logical one. Okay, today again we're gonna focus on hash tables. So a hash table is a abstract data type that we're gonna use to provide a unordered associative array implementation or API. And all that means is that we're gonna be able to map arbitrary keys to arbitrary values. There's no ordering to this thing like we're gonna see in trees. And so the way we're gonna be able to do this these fast lookups to find elements that we want is that we're gonna use a hash function that's gonna take in our key and then compute some offset in some way to some location in my array and that's gonna tell me either exactly the element I'm looking for or I can roughly look around close to by where I land after I use my hash function to find the thing that I'm looking for. So the hash function isn't always gonna get it's exactly where we want but at least get us in the right location and we know how to then look around to find the thing that we are looking for. So again, so this none of this should be new you should also take in an algorithms class. So the space complexity in the worst case of a hash table is big O n. That means that for every single key we wanna store we at least have one entry for it in our hash table. So we have to allocate that amount of memory of a space. The operational complexity is interesting because on average we're gonna get O one lookups meaning in one step in constant time you can find exactly the thing that we're looking for. Worst case scenario and we'll see why this happens in a few seconds. The worst case scenario we'll get a big O n meaning we'll have to do a sequential scan or a linear search to find to look at every single possible key to find that the key that we're looking for. So you may be thinking, all right this is great any hash function or any hash table will do because I'm always gonna get O one for the most part in practice even though this is super fast in the real world where money's involved constant factors actually matter a lot. And so we'll see this when we just look at hash functions, right? Hash functions will be sometimes it'll be it'll still be super fast but there'll be some hash functions that'll be twice as fast or three times as fast as other hash functions. So you may say, all right for one hashing who cares but if not I'm hashing a billion things and my crappy hash function takes a second slower than the fastest one. Now that's I'm spending a billion seconds to do this lookup. So when there's real money involved when we're looking at large scale the constant factors actually matter. When you take your algorithms class there's like O one we don't care about anything else the constants don't matter in our world it does. All right so let's look at the most simplest hash table you could ever build, right? And all it is is just a giant array which is now like a big chunk of memory and then we're gonna say that every single offset in our array corresponds to a given element. And so for this to work we're gonna assume that we know exactly the number of keys we're gonna have ahead of time and we know exactly what their values are what their actual values are. So now to find any key in my hash table I just take a hash on the key mod it by the number of elements that I have and then that's gonna get me to some offset and this is exactly the thing that I'm looking for. So let's look at see how this works. So let's say that we have three keys A B C D D E F X Y Z so again I can just take this thing A B C D A B C hash it and then that'll tell me that offset zero is exactly the thing I'm looking for. So this is not exactly what our hash table actually could look like because this is just storing the original keys and practice what we're gonna need to have is actually store pointers to where the original some other location where that original keys is located. Again think of this as like a table index I don't wanna store the keys maybe in my hash table I wanna store a pointer to where the key is found. All right. So what are some problems with the assumptions we made with this kind of hash table? Yes, in the back. That we know the number of elements could be in the first place. Right, he said that we know the number of elements ahead of time in the first place. That's one. What's the second assumption? The values for the keys are all near each other. He says all the values are near each other in the cache. For this purpose that doesn't matter here. There's no collision between keys. Perfect, he says there's no collision between keys. So what is a collision? The hash in the same spot. He says the hash in the same spot, exactly. So there's really simple hash table. This is actually the fastest hash table you could ever possibly build but you have to make these assumptions in order to make it work. So the first is that as he said we need to know exactly the number of elements that we had ahead of time. So we know exactly how many slots we wanna allocate in our array. And in practice that's not always gonna be the case. If I'm using my hash table as a hash index and I am on a table, when I create the table I don't have any, I don't have any data in there in the first place. And as I start inserting things, then the number of slots I need actually grows. The other assumption that we mixed was that we said every hash, or every key is unique and that's what he's saying that there's no collision. So we're assuming that every time we hash it, it's always gonna land into a unique slot for that one key and only that key ever to be able to exactly find the thing that we're looking for. And so because we know all the keys ahead of time and because we know that they're unique, when we hash them, this is using what is called a perfect hash function. So a perfect hash function is like this theoretical thing, it exists in the research literature, but in practice nobody actually does this because it's impractical, you can't actually do this. And a perfect hash function just means that if I have two keys that are not equivalent, then whatever hash I generate for them is also not gonna be equivalent. So for every unique key I generate I'm exactly unique hash value. And again, you can't actually do this, there's no magic hash function that exists today that can guarantee this. The way you would actually implement a perfect hash function is actually use another hash table to map a key to another, to the hash value, which is kind of stupid because now you have a hash table for your hash table. So nobody actually does this in practice. So the thing that we're gonna talk about today is how do we actually build a hash table in the real world to not have to make these assumptions and be able to use them in a database. So when people say I have a hash table, they essentially mean it's a data structure comprised of two parts. The first is the hash function, which is a way to take any arbitrary key and map it to a integer value in a smaller domain. So I can take any string, any integer, and if you float, doesn't matter, I float my hash function and it's gonna produce either a 32-bit or 64-bit unique hash value integer. Or not unique, sorry, hash integer. So this is gonna be this big trade-off in what kind of hash function we're gonna use between how fast it is and the collision rate. Because again, if we have different keys mapped to the same slot, that's a collision and now we have to deal with that in our hashing scheme. So what's the fastest hash function I could ever build? What's that? He said mod prime number, even faster. What's that? He said that value itself. You're close, but what does that mean? If I have a string, how do I return back that value and then put it into my slot? Even faster. It's bits of memory, but if it's a large string, he said mod, there's a mod in there, yes. Constant, one, right? No matter what key you give me, I return back the number one. That's gonna be super fast because that's gonna be on the stack, that's gonna be impossibly fast. But your collision rate is because it always goes to the same slot. So again, the other end of spectrum is that perfect hash function, but I said I need another hash table to make that work. Matt's like the worst case scenario. So my collision rate is zero, but that's the slowest. So we want something in the middle, okay? All right, so the next piece is the hashing scheme. The hashing scheme is essentially the mechanism or procedure we're gonna use when we encounter a collision in our hash table, right? So again, there's this trade-off between memory and compute, which is the classic trade-off in computer science. So if I allocate an impossibly large slot array, like two to the 64 slots, because that's all the memory I have on my machine, then my collision rate is gonna be practically zero. Of course, now I can't do anything else in my database because I've used all my memory for my hash table that's barely even full, but my collision rate is gonna be amazing. If I have a slot array of size one, my collision rate is gonna be terrible. And therefore I have to do a bunch of extra instructions to deal with those collisions, but my storage overhead is the minimal. So again, we wanna be sort of in the middle here. We wanna sort of balance the amount of memory we're using or amount of storage we're using for our hash table with the extra instructions we're going to have to do when we have a collision. All right, so today we're gonna focus on, again, the spend beginnings is talking about hash functions. Just to show you what hash functions are out there, the modern ones that people are using. And then we're talking about two types of hashing schemes. The first is static hashing is where you have an approximation of what the size of the keys you're trying to store, the key set, and then we'll talk about dynamic hashing where you can have a hash table that can incrementally grow without having to reshuffle everything. Again, the combination of a hash function and a hashing scheme is what people mean when they say I have a hash table. All right, so again, a hash function is just this really fast function that we wanna take any arbitrary byte array or any arbitrary key and then spit back a 32-bit or a 64-bit integer. So can anybody name a hash function, maybe one they've used before? Shah 256. He says Shah what, Shah 256? That's one. Can I name another one? Yes? MD5. MD5, perfect, all right. This is actually a great example. So he said Shah 256, he said MD5. Shah 256 is a cryptographic hash function that's actually reversible, right? It's a public private key thing. So given a key, I'm gonna hash it and then I know how to take that key and reverse it and get back the original value. He said MD5, which takes any arbitrary key and spits back a 32-character unique hash. That, it's not supposed to be reversible, it is now because people have cracked it, but that's something where it's a one-way hash. His is a reversible hash. So in our database system, we do not care about cryptography for when we're doing, we're doing hash tables. Now that's, you know, you can encrypt the data when you store it on a disk or on a public cloud infrastructure, but when we're doing our hash join or building our hash table, we're not gonna care about cryptography, we're not gonna care about leaking information about our keys because we're just trying to build this hash, you know, build our hashing data structure. So we're not gonna use something like Shah 256 because one, we don't care about the cryptographic guarantees it provides. It's also super slow, so we're not gonna use it at all. MB5 is one, is essentially a one-way hash and that's something we could use for our hash function. We don't because it's super slow and we'll see other ones that are faster. And it's all supposed to be one-way, but people have rainbow tables to reverse it. So it has, doesn't even have good cryptographic guarantees. All right, so again, we care about something that's fast, we care something that has a low collision rate. So the, this is just sort of a list of some of the hash functions that our people are using today. So CRC is used in the networking world. It was originally, you know, invented in 1975. I don't remember whether it was 32 bits or 16 bits back then, but now if you wanna use CRC, there's a 64-bit version and you would use something like that. So again, this will produce something with a reasonable collision rate and, but it's gonna be super, super slow. So nobody actually does this in practice. So this is sort of, murmur hash sort of, again, from a database perspective, this enters the era of modern hashing functions and these are the ones that we're gonna care about. So murmur hash came out in 2008. It was just some dude on the internet posted up his general purpose hashing code on GitHub and then people picked it up and then started using it. Google then took murmur hash in the early 2010s, modified it to be a faster on shorter keys and then they released something called city hash. And then later on in 2014, they modified this again to have farm hash that has a better collision rate than city hash. So farm hash and city hash are pretty common in some systems. What is now considered to be the state of the art and the fastest and has the best collision rate in, for hash functions today is actually Facebook's xx hash and not the original one from 2012. There's a xx hash three that is actually under active build and now I think it came out in 2019. So right now this is the fastest and has the best collision rate of all these hash functions. So if you're building a database system today, you wanna be using xx hash. So again, we don't care so much how this is actually implemented, right? I don't, this is not an algorithms class. I don't care about the internals are. Again, all I care about is how fast it is and what the collision rate is. And there's benchmarks to measure and the quality of the collisions, collision rates, all of these algorithms. So this is a benchmark, a micro benchmark I run every year. So this is like an open source framework that I took and modified that just scales up the number of keys you throw into the hash function and see how fast it can actually compute hashes on things. So this here we're looking at from one to eight bytes for the key sizes, which is pretty small, right? When you think about essentially a 64 bit integer, but this is beyond what like email keys or URLs will be. So at the really smallest level, right? The CRC actually does the fastest here. But then of course as you scale up the farm hash city hash and the Facebook xx hash do are getting much better. What really matters is when we're actually looking in larger key sizes, which is something like this. So now you see the CRC hash sucks ass and no matter how much bigger the key is, the throughput rate is essentially the same. But you see these really nice spikes here for at 32 bytes and 64 bytes for farm hash city hash and xx hash three, right? And this is because the key that they're processing that they're computing on fits within a single cache line. So as I see a single fetch into memory, I'm bringing in 64 bytes into my cache and I can operate on, for that single fetch I'm operating on all the data within that cache lookup. So that's why there's sort of the sol2 pattern here. And then beyond this I think after 64 bytes city hash and farm hash switch to a different algorithm and then it's slightly different. You see slightly different properties whereas xx hash still does quite well. So again, I'm not showing the collision rate here but there's benchmarks online that can show you that even though xx hash is actually the fastest here it actually still gets as good as collision rate as city hash and farm hash. So in our own system today we're using xx hash as much as possible. So any questions about this? Again, hash function just takes a arbitrary key space back of value. Yes. Why don't we try about cryptographic properties because if we're storing like user provided data then the user could give us a bunch of data that's designed to collide under our hash knob if they know it and then our database will run really slowly as it's now observed to that. So his statement, which is one I've never heard before is that it's possible that someone could give a someone could give a data in such a way that they know that the values are always hashed at the same thing and therefore you would have a potential denial service attack because you're causing the collision rate to be super large and now for it's taking longer to run your queries. All right, let's talk about this. So one is in the database world at least the data that we're talking about here the users are trusted, meaning I'm running Postgres on shop or whatever system in my own hardware whoever is supposed to give me access to that has already bended me and trusted me so I'm not gonna be that malicious. Two, you also provide a seed when you do these hashings so that one, once it's hard coded you may not know exactly what that is. And then three, you may say all right well what if I'm on a cloud system and I, you know someone is malicious that way well Google doesn't care or Amazon don't care because you're the one paying for the hardware so if you give me keys that then hash to the same thing and your collision rate is super long now your query takes a long time and they're just clocking your money. So it's not, it's not, I'm sure you could think of attack that does this but for what we're talking about here in the internals of the system nobody cares. Again, there's databases that will encrypt the data at rest on S3 or EBS buckets. That's a whole separate thing from this. There's another question or no? Okay, again I, there are cryptographic databases and people are spending a lot of money to worry about these things because data breaches are a big deal. I don't care at this point in my life. I'll get a point and then I'll care but right now I don't care. All right, so again, we're not writing hash functions. We're just gonna take one of these three or you know in general we want to take xx hash and that'll be good enough, okay? Don't write your own hash function it's not worth your time. All right, so let's talk about now how do we use our hash function in our hashing scheme to deal with collision. So again, what we're talking about here doesn't matter what hashing function we're using, right? It could be the slowest one, it could be the fastest one. All these hashing schemes will still work the same because this is what we're doing after we've hashed it after we've jumped to some location and now we've got to figure out how do we deal with collisions or how to find that thing that we're looking for. So we're gonna start to talk about the most basic hash table you can have called linear probe hashing and then we'll talk about some variants to improve on this, potentially called Robinhood hashing and Cuckoo hashing but they're all roughly based on linear hashing. And again, these are all static hashing schemes meaning we have to be told at the beginning when we allocate memory, here's the number of keys that I expect to store. And so that in some cases you actually can guess what this is. So when we do query processing and we're using a hash table to do joins I roughly know or I hope to know that how many keys I'm gonna have to hash my hash table and then I can allocate accordingly. If our hash table gets too full and we'll see what that means like it essentially means we have an infinite loop or all our slots are filled then that means that we have to double we have to increase the size and essentially double the size of the hash table and then basically take all the keys in the first hash table and copy them over to the second hash table which is obviously super expensive to do. So ideally we can have a good approximation of what the upper bound is for our hash table size so that we don't have to do this to do this regrowth or rebuilding. All right so again linear probe hashing sometimes called open addressing this is sort of the most basic hash table you can have and all it is is just a giant table of slots and we're gonna use our hash function to jump to some offset at some slot in that table. So if you use Python and you allocate a dictionary this is essentially the same data structure you're gonna get underneath the dictionary it's gonna be a linear probe hashing table. So the way we're gonna resolve collisions is that if we hash into a slot and we find something that's already there if we're trying to insert something then we just keep scanning down to the next position and keep going until we find the first open slot and then that's where we insert our the entry we're trying to add. So now when I want to do a lookup I would then land at the slot where I should have been and I keep scanning down to either find an empty slot meaning the thing I'm looking for is not there or I find the thing that I was looking for. It's pretty basic, it's pretty straightforward. So again, so let's say that these are the keys we want to add and we have some hashing function that's gonna take these keys and map them to our slot to a slot in our hash table. So the first one we hash A that it lands here and again inside of this thing it's a key value pair we have the original key that we inserted plus whatever the value that we want it to be so it's a pointer to another tuple, turn another page or some other arbitrary value, it doesn't matter. The reason why we have to store the key, the original key is because when we start doing lookups and we have to scan down, start looking at multiple entries we need to know whether the thing we're actually looking for in the slot is the key that we actually want because it's not always guaranteed to be exactly where we hash into the table. So we hash B, B lands here, now we hash C, C lands here but again A is occupied the slot where C wants to go so all we do is just jump down to the next position and then insert our entry into there. Same thing for D, D wants to go where C is we put it here, E wants to go where A is it can't because A is there, can't go where C is, can't go where D is so it ends up in here. And the last one for F down here, right? Pretty straightforward and this is actually really fast to do. So I'm not showing the division between pages here but you just look at this as like I've allocated a bunch of pages and I know how to go from one position to the next I know that if I'm in the last slot in my page I know what the next page is to jump to to continue the search, yes. If I want to insert another key in for instance E's position. Yes. Can I use the one below the B? Yeah so because this is a circular buffer so his question is say I want to insert G and G wants to go where E is can't so it goes here it can't go there it loops back around and continues here, right? Yes. So you said that you scanned down till you find an empty slot that means it's not there. Yes. What if I delete it then? So the question is what if I delete a value? Boom, next slide, excellent. Okay, so let's say that we want to delete C. What do we do? Again, we hash it, we would land where A is that's not what we want because again now this is why we have the exact key in there. So we can say A is not equal to C this is not what I want, scan down ah, C equals C that's the one I want this is what I want to delete. So let's say I just do something really simple and just remove it. What's the problem with this? You're gonna see an empty value and you're gonna say oh it's not there. Exactly, I do a look when D, I look in here I see empty slot and it's I think all right my search is done it's not what I want even though it's the next slot down. So there's two ways to handle deletes. So the first is that you just add a tombstone marker you basically take wherever C used to be and just add a little tombstone that says there's not a logical entry here but physically consider this slot occupied so that when I do a look up and I land here I say well there's no data here but it's not really an empty slot let me jump down to the next one and that's the thing that I wanted. Of course what's the problem with this? Now we are you know we're wasting space we have to go clean this up later on eventually. So this is gonna contribute to our fail factor. The other option is to do data movement essentially recognize that I have an empty slot here and just move everybody up one and then that way I land exactly where I wanna go. Now in this example this works fine because E maps exactly where it would be found F maps exactly where it would be found but again remember I said it's a circular buffer. So technically B might actually wanna go here because it is technically comes after F even though physically it doesn't. So in this case here if I end up moving B around this is gonna be bad this is gonna be incorrect because B hashes to that location. So had I moved it here I would then do a look up on B and find nothing because as I scan down I'm going down this direction I would not know to loop back around and look at the previous entry. So in practice most people just do tombstones because this data movement thing is actually complicated. This is another good example why you wanna have the original key in here because in order to figure out whether it's okay for me to move this up by one I need to be able to hash it and decide whether the location where it should be is less than or up above where I wanna move it to because if I now go above it then I'll get false negatives I'll hash the thing and not actually find it. So for some operations or some instances of a hash table in our database system we don't have to worry about deletes at all. Again if I'm building a temporary data structure to do a query I'm not gonna have any deletes I'm just gonna scan my input data populate my hash table and then start using it. If we're using as a hash index though then we could have deletes and we have to account for this and tombstones is probably the most easiest way to do this. Yes. So through the movement then I guess that's like the worst way. Like you can't just move everything that's forward or backward. Yeah so his statement is. You're now changing the default. He said the movement is probably the worst way to handle this because I can't say it again I can't move things up above without. You need to see it so you can't just complete it. Why not? I think not all of those have collision. But that's okay so if you go back here when we first asserted it right. That would work only if D, E and F have the collision. Correct so right no not necessarily right so F wanted to go here but E's there so it's okay to move that up by one. E wanted to go here but it can't so it's okay for me to move it up by one and then D wanted to go here and it can't it's gonna move by one. So my toy example here it is perfectly safe for me to move up everybody up by one. But the point I was trying to make is we can't actually move B because B actually wants the hash to there. Physically it's not contiguous logically it is so I should have to move it here but exactly as you said I had to hash it and check to see oh is it safe for me to move it down here in this case no because the hash actually wants to go there. So as I go down one by one I have to say is it okay for me to move it up? Yes. Yes. So right now we're inserting what's the ideal because if like the number of keys is like someone is actually more like kind of making a difference. Yeah so this question is in my super simple example here it's sort of you know I only have six keys I can kind of estimate how many keys that slots I need. In practice how do you actually estimate how many slots you need? In practice it's two N. You have two N in the slots of the number of keys or N's number of keys that you're gonna put into it. And we'll see in cuckoo hashing it's slightly different because they have two hash tables but in practice it's two N. And then what happens is when you if this gets to four this is now filled when you resize you double the number of slots. You goes up by two. Yes. Could you also like for the movement could you just track like the number of shifts that each block both and then just like count down to zero because like B for instance you shift it all to be zero so you shift it like oh it's already going up. All right so his statement which you guys are amazing segues. He's saying couldn't I also just record the position aware how many steps I am away from my original position and use that to determine whether it's safe for me to move it? Yes. This is called Robinhood hashing but we'll get to that in a second. All right the last thing the quick I wanna talk about nine unique keys as well and then we'll get to his point about Robinhood hashing. So again in your algorithms class you probably when you discuss hash tables you just assumed all the keys were unique for primary indexes this is fine but in practice in real real real real data sets we can't assume that the keys are unique so now we need to handle them in our hash table. So there's two ways to do this and I'll say that the two ways I'm describing can be used for any of the hashing schemes that we're talking about today. They're not specific to linear hashing you can use them for anything. So the first approach is just you maintain a separate link list with all the values so that you have say your key in whatever your hash table is in your slot and then instead of pointing to the underlying tuple or whatever the thing it should be pointing to it instead points to the separate link lists that have the values that of course that have all have the same key. So if I wanna say give me all the key value pairs for the key XYZ I just jump along this and follow this pointer and then I know that everything inside there has that key. The other approach which is probably the most common approach is just to store redundant keys. So all you do now is just in your slot array you're just duplicating the keys over and over again. So I have the key XYZ ABC appears multiple times and each one has a unique value I'm just recording that multiple times and in case of linear probing everything still works that if I'm looking for something I do my look up and I just keep scanning down to I find either empty slot or the thing that or do I find empty slot that I know my search is done. So if I'm saying find me one key instance with key equals XYZ I just could jump here and find exactly what I want but if I want all of them then I gotta keep scanning down until I hit empty slot. Again in practice everyone does the second one even though it's slightly wasted storage because you're repeating the key multiple times whereas in the top one you only store the key once. All right so let's talk about what he was sort of proposing to do but we're gonna see this in a slightly different way rather than deciding when to shift around it's rather than deciding how to move bulk movement how to have bulk movement of keys through our hash table. Let's look at how to use these positions to move individual keys. So Ramanod hashing was proposed in 1985. It's one of those papers that came out that no one really paid attention to and then in the last decade or so it showed up on hacker news a couple times and now people are trying out in different systems. So again, Ramanod is this folklore tale from England about this rogue who would steal from rich people and give it to the poor people in medieval England. So that's essentially what we're doing here in our hash table. We're gonna have poor keys, steal slots from rich keys and I'm defining poor versus rich meaning the number of positions you are from away from where you should have been when you first hashed into the hash table, right? So to do this, the basic idea is that we're trying to balance out throughout the entire hash table to minimize the likelihood that we have one key that's really far away from where it should have been so that we overall we're sort of we're balancing everybody's equal. So let's do this. So again, we want to start these same six keys. A goes here, but now as he was suggesting we're also gonna now store the number of jumps we are from our original position when we first hashed into this. So our table was emptying in the beginning so when we hashed A it landed this position here it was exactly where it should have been so we set its number of jumps to be zero. Same thing with B, B hashes here it lands at the top so its position is zero. So now we insert C and A occupies the slot where it wants to go but in the very beginning the number of jumps A is from its optimal position to zero and at the beginning C landed here so at this point C's number of slots where C is from where it wants to go is zero. So since zero equals zero we're gonna leave A alone and make C go down to the next slot and take that and now we see we updated its position counter to be one step so it's one step away from where it should have been when it first hashed into the table. So now we do this with D, D lands here D wants to go in this slot but C occupies that but C's counter is one and one is greater than zero so a higher counter means you're more poor I mean you're farther away from where you wanna be so C would be farther away from where it wants to go where D would go if D took this position so we don't let D take this slot and we make it go down here. So now I'll look at one where E started up wants to go where A is again zero equals zero so we leave A alone one equals one so we leave C alone but now E's counter is two because it's zero, one, two jumps away from where it wants to go so two is less than, two is greater than one so two is considered more poor than D so it shoots D in the head, steals its slot, inserts itself here and then now the insertion continues because D goes down here and now we update its counter to be two. So again before we had it, we had A, C, D, E but now on a Robin Hood hashing E is now closer where it wants to be and D is far the way than what it should have been because overall now we're more balanced. Same thing where F, F will go here two is greater than zero so D stays where it wants to go and F goes down here. In the back, yes. You'd rather have a one, one, separate or you're actually going to have two zeroes? Yeah, her statement is on a Robin Hood hashing the algorithm says that it's better to have two keys be one position away from where they should have been rather than having one key be two positions and one key be zero positions, yes. I'm not saying this is the right thing to do I'm saying this is one approach to handle collisions a different way. You're essentially trading off reads for writes. So now when I want to do a look up on any of these guys any of these keys, there's not going to be one key that's going to be all the way wrap around all the way. Everyone's going to be on average the same distance but in order to do that that's making writes more expensive or inserts more expensive because now I have to write more things. So when I did this stealing here let's say that I have to update this page. There's a page split right here. So I update this slot here on the first page to install E. So that's one write and then now I got to come down here and do another write to insert D into this page. Had I left them alone like on a regular linear pro hashing I would only do one write to the page. So again, this seems like a really nice idea. The research, at least the modern research shows that especially for in memory data structures that you pay a big penalty for a branch misprediction because you have more conditionals to do these checks to see whether one should take it from another one and you're doing more writes and that's more cash in validation. So in practice linear probing crushes everything still. It's still the fastest way to do this. I think we're disk is the same thing. Another approach to deal with collisions is instead of doing linear probing and just keep scanning down and possibly swapping things on a Robinhood hashing we just have multiple hash tables. And then we decide which hash table to insert our key is whatever which one has a free slot for us so that we don't have to do these potentially long scans. So that's what cuckoo hashing is. So I've always mistakenly said cuckoo hashing was named after like a cuckoo clock where the hand goes back and forth. It actually do with the cuckoo bird. Cuckoo bird is known to move itself from one nest to another. It steals another nest from another bird and that bird has to then move something else. So we'll see how that works in hash table. So that's what it means. So lookups and deletions on our cuckoo hashing is always going to be 01. We're always going to jump when we do a lookup. We're going to jump to our hash tables and find exactly whether the thing we want is there or not. We don't have to do any additional scans but the inserts are going to be more expensive because now we may have again ping pong or move keys all around. So let's look at an extremely simple example with two hash tables. Again, in practice, most people use this, just use two. There are some people that use three. Beyond that, it's sort of impractical and it's unnecessary. So two is always sort of the right number. So let's say I want to insert A. So for every hash table I have in my cuckoo hashing setup I have to have a separate hash seed for my hash function. So I'm going to take this key and hash it twice. It's going to be the same hash function like murmur or xx hash but I'm just going to give it a different seed so that for a given key it produces a different hash value. So I'm going to hash A twice, my two seeded hash functions and the first one's going to land in this position and the second one's going to land in this position. So at this point here, my hash tables are empty so I can insert in either one. So for our purposes, we'll just flip a coin aside. Let's insert it in the first hash table here. In practice, you can do more complicated things. You can say like, all right, well, what's the fill factor for my hash table? Maybe always choose the one that's less full or if you have metadata about the collision rate for these hash tables, you can make a better decision. As far as you know, everyone just flips a coin and that's good enough. Random is actually very, very good for a lot of things. All right, so I'll say I want to insert B. Same thing, I'm going to hash it twice. First one goes to this slot where A is already stored but the second one goes to an empty slot. So in this case here, my choice is obvious, right? I always want to go to the one that's empty because I don't have to move anybody. I just insert it there and I'm done. So that's an insert C. Same thing, I hash it twice. Well, now the first hash function maps to this slot where A is and the second hash function maps to where B is. So now I need to make a decision which one I want to kill. Again, let's just flip a coin. That's going to be good enough and to make the demo work, I'll pick this one, right? So we'll go now steal that slot from B and insert C. So now I take B out. Now I got to go put it back in the other hash table. So I'm going to hash it with the first hash function and that's going to tell me where I go to insert it. But as we saw when we tried to insert it before, it wanted to go where A was and so we have to now steal its slot, put B there and now I'll put A in the other one. So we hash that, it comes over here and now we land to an empty slot and so now our insert to C is done because everybody has landed in a free slot. Yes. His question is, which is absolutely the answer is yes, can this have cyclical behavior? Can you be stuck in an infinite loop? Absolutely yes. So in that case, you have to recognize where your starting point was. So if you come back around and see, wait a minute, I've seen this slot before and there's something there and I can't put anything in there, I'm stuck in an infinite loop. So that's when you resize. Okay. So again, in practice, everyone always does just two hash tables and again, you want to allocate this in such a way that, the likelihood that you have a cycle is minimized. Okay. So now all of the hash tables that we talked about so far are, again, we're static hash tables, meaning I need to know approximately the size of the number of keys I want to store ahead of time so I know how to allocate it, allocate it to be large enough that I minimize collisions and I don't have infinite loops or get completely full. So again, as he's pointed out before, if you now have to resize it, either grow it larger, which is more common, but also shrink it if you want to reduce the size, you essentially have to rebuild the hash table entirely. We'll talk about consistent hashing. There's hashing schemes or hashing functions or methods we can talk about later in the semester when we talk about distributed databases that don't have to resize the whole thing but for our hash tables inside our database system, we are gonna have to rehash everything. Rebuild our hash table because now we change the number of elements when we mod n, the hash value, and that means things that were in one bucket or one location before, one slot, now it could be in another slot and everything's gonna get a lot of whack. So in practice, you just have to rebuild it from scratch. So this is what dynamic hash tables are trying to solve that they're gonna be able to resize themselves on demand without having to rebuild the entire thing. The most basic one is a chain hash table and this is what people most think of when they think of a hash table but we're gonna talk about two more complicated schemes from the 1980s that are still used today, extendable hashing and linear hashing. All right, so chain hashing or chain hash table or bucket hash table is a dynamic hash table we're gonna maintain linked list of buckets for values that map to the same or keys that map to the same, values that are part of the same key. So when you allocate a hash map in Java and the JVM, you get one of these. This is the default data structure that they use. So the way they're gonna deal with collisions is that they're gonna just keep appending to the end of this bucket list. So each bucket chain can grow forever because you just keep adding more and more buckets. The linked list gets even larger. Of course, this can obviously degenerate to a, essentially a sequential scan, give all my keys mapped to the same bucket chain that my bucket grows forever and then I'm just doing a linear search and I'm no better than just reading from the table. So insertions and deletions are pretty straightforward because you're just modifying the buckets. You're not actually modifying the slot array. So again, it just looks like this. We have our slot array, these map to buckets and then any single time I wanna do an insert saying to this bucket chain here if my last bucket is full, I just allocate a new one and I keep appending things until here until I run out of space and allocate the next one. So you can think of these buckets as just pages in the table heap or on the heap file and I just allocate new pages and chain them together using page IDs to figure out how to traverse along it. Again, this is pretty straightforward. This is pretty easy to implement. This is actually pretty easy also to make thread safe because all I do is just take a latch on either the slot, which is the easiest thing to do or just the individual page anytime I'm modifying it. So let's look at more complicated schemes. So with extendable hashing, we're gonna take the chain hashing approach with the buckets, but instead of letting the link list just grow forever, we're going to split them incrementally. And the key difference here between rebuilding, splitting and rebuilding is that we're only gonna split the chain that overflowed rather than the entire data structure or the entire hash table. So in order to make this work, we're gonna allow multiple slot locations in our slot array to point to the same bucket chain. And it'll make more sense when I show that in the next slide. And the advantage is again, is that when we have to move data around, we're only moving the bucket that overflowed and not all of the other buckets. All right, so it's gonna look like a chain hash table except I'm gonna add some additional information. So the first thing I'm gonna have is this global counter that corresponds to the number of bits we have to consider when we wanna figure out what bucket to look at, what slot to look at in our hash function. So in this example here, we'll start with a global counter too. And then for each bucket chain or each bucket, we're gonna have a local counter that corresponds to the number of bits that we use to get to that location. So in this case here, this first bucket has a local counter of one. So that means that we only need to look at one bit to address into it. And this is why if you look at zero, zero, and zero, one, both of these guys map to the same bucket because the first bit zero is the same because this bucket hasn't overflowed, we haven't had a split yet. Whereas the other two buckets have one, zero, one, one, and the local counter is two that says we have to look at two bits. So the global counter you need to figure out how many bits you need to look at, the local counter is just for your own sanity, internally to understand what the, how did you get to the location where you're at? But you don't actually need this to figure out how to do look up in the slot array because obviously you can't know what this is until you do the look up through the slot array. All right, so let's say I wanna find egg. So then again, I'm gonna hash it. I'm gonna produce some bit sequence for my hash value. And then I look at my global counter and it says how many bits do I wanna examine in my hash function or my hash value to decide where do I wanna jump to? So my global counter is two. So I only need to look at the first two bits, zero, one. I do my look up in my slot array to look at zero, one. I follow the pointer and then I land to the bucket that I want. And now I just do a sequential scan to find the entry that I'm looking for. So now let's say I wanna insert bit. So again, global counter is two. I only need to look at the first two bits. I land here, follow slot array. This guy had a free location. So it's safe for me to go ahead and start this. That is an overflow. But now I wanna start C. First two bits are one, zero. I follow this, I land here, but now I see that I don't have any more free entries in my bucket. I'm gonna overflow. So now I have to split this. So in the splitting process is I look at my global counter and it's now set to two. So I'm gonna increase that to three. So that means I'm gonna need to examine three bits. So now I'm gonna double the size of my slot array to now account for three bits. Again, this operation is cheap because this is just an array of pointers. So I take a latch on it, protect it, resize it, and then put it back in. So it's not like I need to move around any of the data here which is the more expensive part. So now my global counter is three and I'm gonna split this by then now examining the three bits instead of two and to figure out which hash table or which bucket they belong to. So this guy just slides down. I restructure this thing to split the data that was stored between that single page. I remap everybody based on the local counter. So this guy up here, we still only care about one bit. So there's two slots that map to it up there and two slots down here that also map to it where again, where the first bit is zero. So now I wanna go back and try to insert C. So now I look at three bits and that tells me I wanna look at this position here. I follow the pointer and then I'm able to insert into it. So again, this movement here looked like it was kind of expensive, like stuff is sliding around but all I'm doing is just splitting that one page I had before to make another page. So it's two page writes plus the pages you have to update for the slot array in the back, yes. So is that considered expensive to remap or to use other slot arrays that you all would be budgeting on? The question is, is it not considered to be expensive to remap the slot array to all the new pages? No, because all I'm really doing, like these are still at the same page ID on files on disk. So now I'm just updating this, again, it's a single array, I'm just updating the values and I'll point to where the data is actually stored. So this operation is cheap, moving pages is expensive. Yes. So it's like the first one filled it up, then would you split it into columns? Like, put it right now for... Yes, so the question is, say the first one fills up, what would happen? Well, now I would split it and again, the local counter is one, so I would increment that to two. Now, yeah, so now it's split on two, so now it would be 0,0, 0, 1, 0, 0, 1. So anybody with a 0,0 here, these two entries would point to the same thing and 0,1, 0,1 would point to the same thing. If we delete stuff, we'll go to watch that. Okay, a few more slides, okay, we'll get to that. Delete, yes, delete, basically reverse this. The back, yes. So I'm storing the entire page in the pocket for the source of page ID. This question is, are you storing the entire page ID, page in the slot array here, or is it just a page ID? It's just a page ID. In the bucket, each bucket would be a page. Yes? A relation between hash table and the buffer pool. This question is, what is the relation between a hash table and a buffer pool? So at a high level, I'm ignoring that in practice, depending on whether you want it to be durable to disk, you would allocate a page, just as you would for a table, a slot of page, in your buffer pool and you can store a bucket page in there. Same thing. The buffer pool doesn't know it doesn't care. You just say, give me a page, here's the page ID, it hands you back some memory address and you can write some data in there. It doesn't know whether it's part of a table or part of a hash table like this. Like a data table or a hash table. So again, all the same eviction algorithms you would use and now you kind of see like, how we're jumping around this and accessing our hash table is certainly gonna look a lot different than how we jump through and do sequential scans in a data table. So maybe you want different caching policies for them or eviction policies. Yes. So you're using three bits to assign to map values to the page ID. The statement is, am I using three bits at this point here to map hash values to slots? And then it tells me what offset to jump into, right? And then within that slot in this array, I would have a page ID that I could then follow to get to the bucket. So how is the mapping from? How does the mapping from page IDs to buckets work? Yeah, I thought it should be one to one. It is, right? I'm using the term bucket again instead of a page because this could be in memory, it could be backed by disk, it doesn't matter. But you can think of, if it was backed by disk, then these are just page IDs. And the bucket is a page and they're synonymous. So the only thing I need to store here is just page IDs. So if you have multiple values in the global world. Absolutely, so the statement is, again, going back up here, at this here, my global bit counter is two. But I know that I haven't split this first page here, it's local counter is one. So even though I wanted to look up, I had to look at two bits, but in practice, I only care about the first bit for this page here. So that's why these guys have the same page ID, bucket ID, whatever you want. They can map to the same location because they have not split yet. After we have reached a page, we'll just move in and scan. The statement is, which is correct, is after we do a, reach a page, we just do a linear scan to find the thing we want. Absolutely, yes. Isn't that expensive? The statement is, isn't that expensive? Again, if I have a billion tuples, then doing that look up to scan a kill by page is nothing. And you say, all right, I want to be more crafty, but smarter, well, maybe I store a filter or some little pre-computed information at the top that says the key, here's the list of keys that I have. Doesn't tell you where they are, it just says that you have them, so you can do that quick look up and see whether it's there. But that linear scan is gonna be super cheap compared to reading it from disk or having to do sequential scan on the entire dataset. Okay, so the other dynamic hashing table is called linear hashing. So, one problem with the extendable hashing, well, it's not a huge problem, is that we're at the double the size of the slot array. Again, computationally, it's not that expensive, but while I'm doing that resizing, I have to take a latch on it, prevent anybody from reading and writing it until I reallocate everything. So, that will become a bottleneck if everybody's trying to go into that, to do my hash table at the exact same time. So, with linear hashing, the idea is that we're gonna localize the resizing to just be whatever the bucket that overflowed. So, we don't have to take a global latch that locks out everybody from accessing our hash table. So, the way this is gonna work is that we're gonna maintain multiple hashes, hash functions the same way we did in cuckoo hashing. Again, it's the same hash function algorithm, just different seeds that are gonna tell us where to jump to the right bucket for a given key. And we're gonna maintain this new thing called a split pointer that's gonna keep track of what's the next page we want to overflow or we want to split. And then we incrementally increase the size of our slaughter rate. So, how we decide when to overflow can depend on what we wanna do. It could just be when we run out of entries in our bucket and then that triggers a resize. It could be if the size of our bucket is larger than the average size of all the buckets. Again, it's left up to the implementation. They all have different trade-offs. All right, so it looks like this. So again, we have a slot array that's gonna now point to buckets just like on our chain hashing. So, again, we're gonna add this new split pointer that's gonna keep track of what's the next bucket we wanna split whenever we have an overflow anywhere in our hash table. So, we're gonna split the bucket point two by position zero whenever any other bucket, not just when bucket zero gets overflowed. So, if any of these other guys overflow, we will split zero, even if it's not the one that overflowed. So, at this point here, our split pointer is at the beginning of our slot array. So, we only have one hash function. All right, and then we're just gonna mod it by the number of entries that we have. So, let's say I do a lookup on six. I just hash it mod by four. I jump to the location, do a linear scan till I find the key that I want. That just works just like before under chain hashing. But now, let's say I wanna insert 17. I hash it 17 mod four. I land to this position here, but now I don't have any more free slots or free entries in my bucket. So, I'm gonna have to create an overflow bucket, essentially just create again the chain link list and then that's where I insert 17. But because I now overflowed, that's gonna trigger a split wherever the split pointer is pointing at. So, even though zero still can take entries in it, I overflowed once somewhere, that's gonna cause me to overflow. So, the way this is gonna work is that we're gonna add a one new entry to our slot array, now position four, and then we're gonna have a new hash function that now mods by two n. The idea is that as we keep splitting down and down, we'll keep adding new entries until we're two n or twice as big as where we were before we started doing splitting. So, the way this is gonna work is that we need to keep, the split pointer's gonna keep track for us whether we wanna use the first hash function or the second hash function. It tells us how far along we split in our slot array. Right, so in this case here we add the new entry four, we create a new bucket, and then this is where we insert 20 into. And we move the split pointer down by one. So, the split pointer is essentially this demarcation line here. So, it basically says that whenever I wanna do a lookup, I first hash it with the first hash function, so I say I wanna do a lookup on 20, I hash it with the first hash function, and that's gonna take me from zero to three. And then if the thing I'm pointing to is above where that demarcation line is for the split pointer, then I know that the bucket that I'm looking at has already been split, so therefore now I need to look at the second hash function and decide where I really wanna go for this data. So, now when I do a lookup on 20, use the second hash function, I mod that number by eight, because that's two ends where I started that, and then that tells me that I wanna jump down here. Same thing, I do a find on nine, nine would land here, that's below or above the split pointer based on how you perceive it, but it hasn't been split yet, so therefore I only need to look at the first hash function to find the thing that I'm looking for. So, is this sort of clear? Yes? So the question is, back up here, when I started 17, isn't this what overflowed? Why did I split the first one and not this one? Because that's the way linear hashing works. The algorithm works that you split whatever the split pointer works at, looks at, no matter whether this thing, whether it was one that overflowed or not. Because eventually, if this thing keeps on overflowing, that'll keep moving the split pointer down by one, eventually I will get to this, I will get to everyone, split it, and then loop back around and start over again. When we split one, we'll copy the 17 to the, we'll copy the overflow to the split bucket, right? So the statement is, if I end up splitting this one here, right, say the split pointer moves down, now I split it, I will then use the second hash function to decide how to redistribute them. Yeah, so we'll redistribute both the overflow and the, correct, you redistribute all the values or all the key instances that map to that bucket. And then delete the overflow. And then delete the overflow, yes. Now maybe the case that like, say this guy was, this overflow thing was at the bottom here, so I kept inserting to it, I kept overflowing, and I kept triggering splits, and then by the time I get down to split it again, it may also overflow as well, that's just how the algorithm works, and you loop back around and do it again. So worst case scenario, everybody's inserting to the same bucket, and it takes a long time for it to split. In practice though, with a good hash function and a low collision rate, this shouldn't happen. Okay, so the, right, so again, splitting the bucket just basically means that although we're not splitting the one that overflowed, eventually we will get to it, and eventually everyone will get split and then everything will balance out. So in this example, I only showed inserts, the pointer can also move backwards if you start doing deletes. And same way again, you can reverse extendable hashing if you wanna start deleting things to start coalescing buckets. But in practice though, this is quite tricky. So let's go back to where we were before, right? Split pointer was pointing here, and we only split the first entry, position zero and a slot array. So now if I wanna delete 20, so I hashed the first one, that takes me to zero, but this position here is above our demarcation line for the split pointer, so I need to hash it again and I land here and now I can find the entry that I want. So now I'm gonna delete this guy, and now the page is entry, empty. So I could just leave it alone and just assume that's later on, I'll go ahead and fill it up again. But if I wanna start doing compacting, I wanna start reclaiming memory, then I just do all the same steps that I did before and reverse it, right? So I blow away the bucket, blow away this pointer, move the split pointer back up by one, and now this thing goes away, my hash function goes away, and I've reclaimed the memory, right? Just doing all the steps in reversal order. Yes? Do you like to eventually remove hash functions once you can split all of the tables? His question is, do I eventually remove hash functions after you remove all the tables? Yes, I think after you get down to the bottom, I think yet most of you have two hash functions. Instead of 20, try to remove 11, then also the same thing would have happened. His statement is, instead of removing 20, if I also delete 11, would the same thing happen? Instead of 20, only 11, like so that there is space for that to come. I'm missing what you're saying, sorry. Yeah, you can't redistribute, because then the hash function is, it has to be terministic. Same key should always produce the same hash value, so we know exactly where to find it. In the back. So you wouldn't do it for these values which point at one hash function, but if this is deleted, you wouldn't do it. Yes, her statement is, if I go back here, if I was here, and I deleted six, six I would have to leave there. You're going with one hash function. Correct, there's only one hash function, so if I try to remove this, I can't resize this one down here. So I just leave that empty, yes. So let me think, I sort of flashed it already, but what's an obvious problem of why this would be problematic? If I actually deleted the page and removed it and then put the splint pointer back up by one. Continue to keep going. Exactly. What was the very next thing I do is insert 21, now I overflow, now I've got to split what everything's pointing at and I just do all the same crap over again, right? So again, this is what I was saying that when you decide to do an overflow, maybe you don't want to do it exactly at the moment. You insert something into an overflow chain. So maybe you wait to this thing overflows again, then you split it or two pages become buckets become empty, then you shut reversing. So people that have spent a lot of time making the inserts go fast, deletes are harder to do because in practice, it might just be, some cases it might just be better to rebuild the entire index, but you can do incremental deletes with these data structures, okay? All right, so the, we spent today talking about hash tables. Again, these are fast data structures that on average would do 01 lookups for, for, you know, to find keys. And we're going to use this all throughout the internals of the database system for as we actually queries, for page tables and immediate data structures. In practice though, and for at the application level, a hash table is usually not what you're going to want to use for a table index. Database system will let you do this. Some systems will say when I call create index, I want to use a hash table, but when I call create index in most systems without any specification, what data structure to use, I'm not going to get a hash table. I'm still going to be getting an order preserving index. And this is because the hash table can only do exact key equality predicates, quality lookups, meaning if I want to see whether a key exists, I have to have the entire key to do a lookup, but I can't do a partial key lookup. And I can't say, find me all the keys less than my given key. Because again, the hash function can't do that for you. So in practice, this is not what you're going to want to use, and we'll do demos next time with my SQL and Postgres, and we'll show the performance implications of this. So instead, what you mostly get when you call create index is the beloved B plus tree. It was called the ubiquitous B plus tree, ubiquitous data structure in 1979. And 40, 50 years later, it's still the best data structure out there, my opinion. And pretty much every single database system is going to have some kind of B plus tree implementation. Except for the systems that are like memcache that are just a hash table entirely, every single major data system is going to be used something that looks like a B plus tree or a straight up B plus tree. Now, they're going to differ on how they actually store things and do searches in some ways, but at a high level, what we'll talk about on Wednesday will be the sort of the canonical B plus tree. Again, it's everywhere. Okay? So any questions? Oh, yeah, coming through with Michelle and crew. Two cent for the case, give me St. I's crew. In the mix of broken bottles and crushed up canned, met the cows in the jam, oh, I'll drive. It's with St. I's in my system, crack another unblessed. Let's go get the next one and get over. The object is to stay sober, lay on the sofa. Better yet, damn myself, I'll be Tim's stressed out. Can never be son, Rick is a jelly, hit the deli for a port one, naturally blessed, yes. My rap is like a laser beam, the fools in the bushes, St. I still can't change. Crack the bottle of the St. I, sipping through doors, you don't realize, you're drinking it only to be drunk. You can't drive, keep my people still alive, and if the St. don't know you're from a can of pain, pain.