 All right, let's give DJ Japhiel a round of applause. How you doing? I'm doing well, how are you? How's your girlfriend? That was a free ask for the season, at least. She mad? I mean, yeah, it's her birthday next weekend, so yeah, she's mad. I think you got to fix that. Yeah, of course. OK, all right, lots of discuss today, so I'm going to plow right through everything. So again, a quick reminder, upcoming events, to now correct dates. So today at 4.30 over Zoom, we're going to have the co-founder of PostgreSQL, basically a modified version of PostgreSQL. PostgreSQL isn't modified. They're building off the extension system and API in PostgreSQL to support machine learning frameworks and large language models directly inside of PostgreSQL. He's going to talk about the stuff that they've been building. Next week, we have probably one of the bigger vector database companies coming to give a talk, Weaviate. And then after that, it's going to be Feature Form. And after Feature Form, it's Feature Base, right? Same name, different systems, OK? Again, this is optional by all means, attend if you want. And then as I post on Piazza while sending out an email to the Davis friends, companies like these, with everyone's CVs, if you uploaded it yesterday, OK? All right, so last class, we talked about hash tables. And we talked about how this is an important data structure that's going to give us this nice 01 average time complexity to do lookups matching keys to values. And we spent time talking about how there is this differentiation between static and dynamic hashing schemes. Like static was, you sort of fixed size, number of slots, whereas the extendable hashing, chained hashing, and linear hashing all had the ability to grow incrementally over time to accommodate more keys than you originally envisioned. So the main takeaway from last class should be we spent most of our time talking about how to deal with the conflicts, right? If you have two keys hashed, two different keys hashed at the same location, what do you do? And then we mostly talk about how these hash tables are primarily going to be used in most systems for internal data structures, like your page table in Project 1 or the page directory or the other things keeping track of what's the state of the database system itself while we're running. We'll see hash tables again when we talk about how to join efficiently. But again, for the most part, these are primarily used for internal data structures. So today's class, we're now talking about B plus trees. And these are going to be primarily the default choice of when you want to have an index in our relational database systems. So if you call create index, 99% of the time in most systems, you're beginning something that looks like a B plus tree. Then we'll talk about, so we'll first go over a high-level overview of what a B plus tree looks like, what's the plus in B plus tree versus the regular B tree. And then we'll talk about some basic design choices of how and when we actually want to build one. And then we'll finish up to the extent that we have time at the end. Talk about all the different ways you can actually optimize and improve performance of these different systems. And the examples of real systems are actually doing to that. OK? So the first thing we've got to discuss is what is a B plus tree? And so the B plus tree is in a category of data structures called B trees. And what's confusing about this in the database literature or different database systems is that there's the class of the data structure called B trees. And then there's a specific data structure called a B tree. And then some database systems are actually using B plus trees, but they're going to call themselves a B tree. So if you go look at the Postgres code, they're going to refer to their data structure as a B tree. But as far as I can tell, it's a B plus tree with some modern techniques like from the B link tree. So it's sort of this, when you say B plus tree, it's typically going to mean a bunch of these other different things. So there is no original paper on the B plus tree. Or actually, well, the one that everyone cites is this one from 1979 by the guys at IBM talking about what they call the ubiquitous B tree. And in this, it is described, hey, there's the different variants, but the most common one that's going to be most useful for our database systems is going to be a B plus tree. And then they cite some kind of IBM tech report that I have not been able to find. I didn't look that hard, but it doesn't show up right on Google. That's the one where they talk about the original B plus tree. The original authors of what the B plus tree work, this guy, Bayer and the Crayat, they never actually define what the B means in B plus tree. Typically, people say it's for balanced, broad, bushy. The guy's name is Bayer, B-A-E-Y-E-R. So it could have named it after himself. This data structure actually developed at Boeing, like the airplane companies, so it could have been Boeing tree. Nobody really knows. But typically, when you say B tree, people typically mean balanced. There's another variant called B link tree. And as I said, there'll be the sort of classic B plus tree, but nobody implements exactly as it's defined there. People are going to borrow bits and pieces of it. And in particular, what they're going to borrow is some ideas from this B link tree paper that actually came from here at CMU in 1981, a certain guy by Phil Lehman. That dude still works in the dean's office at CMU on the fifth floor. And if you go look in the PostgreSource code in the directory where they talk about their B tree, notice they say B tree instead of B plus tree. And they say it's an NB tree because it's a non-balanced B plus tree. But we'll get to that later. But yeah, right here in the circuit, they say, oh yeah, this is a correct implication. Correct is always important. A Lehman and Yau's paper from the B link tree from 1973, right? So that's kind of cool. But again, we're going to focus primarily on this one. We'll see if we have time. I'll talk about the B epsilon tree at the end. And then the BW tree is a thing out of Microsoft, a lock-free version of a B plus tree. We actually implemented that here at CMU and it was not easy. And we have an open source implementation of that. OK. So the B plus tree is going to be a self-balanced order tree that's going to allow us to do searches and sequential access and insertions and deletions all in log n time, right? Because log n is going to be the height of the tree. And so the difference between what we'll describe here in the B plus tree versus a sort of generic binary search tree is that the nodes in our data structure can have obviously more than two keys. And the reason why we're going to want this is because, again, we want to convert, we want to minimize about a random IO we're doing. So we want to maximize the amount of sequential IO. And the B plus tree is going to be perfect for us to do this. Because when we land in a node that's essentially fetching on a page, we don't have as many keys as we can inside that before we have to move on to grab the next page from disk, right? So again, thinking back way back in the 1970s when hardware was terrible, you had a middle amount of RAM, but your disk was also super slow. So a B plus tree is going to allow you to convert when you do these lookups from random IO into sequential access. Because once you follow the leaf nodes, or sorry, follow the tree down to the leaf nodes, and once you're down there, you never go back up. Not entirely true, but for our purposes now, we'll assume that's the case. And then I can scan along the leaf nodes to find the data I'm looking for, right? So more formally we could say that a B plus tree is going to be an M-way search tree with the following properties. First is that it's going to be perfectly balanced, meaning that every leaf node in our tree structure is going to have the same depth, means the same number of levels down from the root to that leaf node. And again, Postgres is going to violate a little bit, some people do. But for the very beginning, assume that's the case. We also have a rule that every node other than the root has to be at least half full. So if I can have M keys in a node, I need to have at least half a number of nodes, half a number of keys as possible, up to the maximum number, right? If I go below that threshold, if I go below being less than half full, then I'll have to do some merging. Again, we can tweak that requirement later on. And so then the root to be special case, so we can ignore it for now. And then every inner node with at least K keys are going to have at least K plus 1 non-null children, meaning I could have some locations where I could or possible pointers to leaf nodes or nodes below me when I'm an inner node, but I don't have to have the max number. So this is all math. Let's look at an example. So here's a really simple two-way B plus 3. And we can define the root node at the top, and then the inner nodes are just because we only have three levels, the inner nodes are just the ones in the middle, and then the leaf nodes are the things at the bottom. So within a node itself, we're going to have this alternating pattern between a pointer to another node and then a key. And then in the leaf nodes, there'll be the value that we're trying to store for a given key. And for now, we're not going to define what the value is, but you can think of it like potentially the record ID to point to the actual tuple with some page number and offset. Or if it's a case of mySQL or SQLite, it could actually be the tuple itself. But for now, we can ignore that. And so the way to think about these numbers here in the inner nodes and the root nodes that these are essentially guide posts that tell you which path you want to go down. So at the root node here, we only have one key, it's 20. So if you're going to go left to it, it's any value that any key value is going to be less than 20. And then we'll go right to it for a greater than or equal to. Same thing for the next one, I have 10 here. Less than 10 goes here. Greater than or equal to 10 goes on the other side. So this is what I'm describing so far. This is a basic B plus tree. But what I was saying before that they have this rebarring ideas from other papers like the B-link tree is that the nodes are also going to have sibling pointers at every level. So I think the textbook might only show them at the leaf nodes. Postgres puts them in the middle node, the inner nodes as well. And I think the original B-link paper had them at the inner node as well. And so the reason why this matters is, again, if I'm doing a search, like give me all the keys, greater than or equal to 6, I could reverse down this side of the tree, get to the bottom, and now I can rip through along the leaf nodes and never have to go back up. We won't talk about multithreading just yet, but having to get parent lock or latches on your parents to order to scan along, that's going to be expensive. If I can just keep down at the lowest level I need to go, then I can move more quickly. And again, if it's random IO, this sucks. But if I get down here, assuming all these pages are sequentially or contiguous on each other on disk, then that's all sequential IO to scan across. Yes? What's the point of having to use the inner nodes? It helps you when you do split the merges. If I know I have to, if I want to steal something, so say I delete 10, say 10 that's deleted, instead of having to reorganize the entire tree, I could follow a sibling pointer, maybe take a key from this guy and bring it over. Yes? What is the node pointer? This question is, what is actually a node pointer? Yeah. Like these red lines? No, not the sibling pointer. It's the node pointer at the top. So the node pointer be like, so I'll show the next slide. You're not going to lay this out exactly. You wouldn't lay this out in the disk exactly I'm describing, but the thing of it is, like I have the key here 20, so that's this part, and then the node pointer is saying, if you're looking for a key that's less than 20, follow this pointer and you go down here and find it. And in our world, in a database system that's on disk, it's just a page ID. But we also have to store the pointer to the other side, right? Correct, yes. OK. Yeah. So this is the visualization people usually show. So the nodes themselves are going to be basically arrays of key value pairs. And again, the keys are going to be derived from whatever the attribute that the index is based on from the table. So I'd say I built an index on table foo, columns ABC. The key itself will be copies of the values for every single tuple in ABC. So you can sort of think the index is like a replica of the table that you're trying to index. And it's organized in such a way, in an ordered manner, to allow you to do these efficient log-in lookups. In the relational model, the tables could potentially be unsorted. We'll violate that in a few more slides. But the table could be unsorted, and so this index is a way to have fast-sorted access. And of course, now, underneath the covers, a database system, and we'll see this later in the semester, has to make sure that your index is in sync with the table. Meaning if I update or I insert a tuple into my table, I want to automatically update my indexes. And the database system will do this for you and make sure everything is consistent and in sync. But again, we won't focus on that in this class. The values can differ depending on where they are. Sorry, the values will differ depending on where they're in a node or a leaf node. If it's in a node, the value is a pointer to some page below us. If it's the leaf node, then it's going to be either, again, the pointer to the tuple. I'm using the pointer not in the memory address term. I'm meaning in page ID also, or the record ID, or it could be the actual tuple itself. The arrays within the nodes themselves are typically kept in sorted order, but they don't have to be. And then there's this issue of how do you deal with null keys, because again, assume that if the index are trying to build, using a B plus tree, if it's non-unique, there could be null values. We have to put the null somewhere. So typically, you either put them all at the end or all at the beginning. And actually, when you create indexes in some systems, you actually define where you want them. Do you want them to be first or after? Because depending on what your query is, you may want to not see the nulls first. You may want to see them at the end. And it depends on applications. Another important thing also too is that going back here, there's only sibling pointers and pointers going down. There's no pointers going back up. And the reason is why, and we'll see this more next class, is when we start taking latches on these nodes, we don't want to have one thread going this way from the top down, another thread going from the bottom top, because that's going to have deadlocks. Now, the sibling pointers are going to have this issue too, and we'll see how to handle that. But by avoiding having the pointers going in two directions between different levels, it's one less thing we have to worry about. Because the way we're going to do splits and merges, it's not like an AVL tree where you have to do rotations and all that. We're not going to do any of that. So I may say it's easier, not really. But they both suck, both hard. So here's what our node looks like. Again, somewhere in the, assuming this is a page, so we're going to have this array of key value pointers, our keys and values. And then we'll have these pointers here that'll be just a page ID to the previous one and to the next one along our level. The key value pairs could either be sorted one after another. If it's an inner node, then the values would just be pointers, again, record IDs. We could also sort them separately. And this is probably in this common approach. You would have the keys in one array, and then the values sorted separately in another array. And then whatever offset you are in the key array, that corresponds to some offset in the value, right? It's almost like the column stir stuff we said before. You can do simple arithmetic to decide how to jump around. Var charts mess that up, but you just maintain an offset table to keep track of these things, right? And then there's additional meditating, keep track of like, here's a number of slots I have left in my page, by what level am I looking at. And that way, as you're traversing down, you can just look in the page and say, where am I in the tree? It's also useful for recovery as well. So I've already said this, I'm just repeating myself, but the leaf node values themselves could either be record IDs, which is page number and offset to some location. And then the or could be a tuple data, as in the case of the index organized storage when we talked about before, like SQLite, and my SQL, do this by default. But in like, SQL server and Oracle, you can say create table, and I want to be index organized, right? And it'll make a B plus tree, and then the leaf nodes will be the actual just tuples themselves, right? For the number two, you only do this for the primary key index, otherwise you're duplicating data, and you don't want to do that, yes? I just want to clarify some terminology. Yes. For record IDs? The question is, if it's storing record IDs, is it a leaf node or an inner node? It's a leaf node. Because again, like, we'll bring up an example in a second, but like, kind of back here. The only really keys that exist, that like they're actually corresponded to what's actually in your table, are found in the leaf nodes. So we'll see this in our demo when we like delete and insert keys. A key that was deleted may actually still exist in an inner node, right? So you can't have it like, you couldn't have it be a record ID to point to something because that record may not exist. In this case here, I have in this inner node here, I have 35, but there's no 35 in the leaf node, meaning at some point, this key got 35 inserted into it and then got deleted. But because of the way it got organized in the algorithm to maintain the balance of the tree, I didn't end up removing 35. So it's still there. So it's the thing of like, all the internotes are guy posts, traffic signs or street signs to tell you how to get down to where you need to go in the leaves. Yes? His statement is, is it true to say that every inner node has one key but every leaf node can have multiple keys? So this example here, I'm showing two keys per node because I got to make it fit on PowerPoint, right? There's nothing about the B plus tree that says you can only have two keys. You can have multiple keys. In this example, in this example, yeah, that you only need one, but in a real B plus tree, you wouldn't have one key per index, per node, right? You can have unlimited, right? And this action, and we'll get this in a second. The slower the disk, actually the bigger the node you want because that's more sequential IO. So you could have hundreds of keys. There's a limitation on what I can show in PowerPoint, but we'll bring up the demo in a second. Other questions? Okay, so I made a big deal of like, okay, we're talking about B plus trees, not B trees. You may not know what a B tree is. So the original B tree from 1972 had all the keys and values stored all throughout the tree, sort of like an AVL tree, for example. And it's more space efficient because you never have keys that don't correspond to actually anything in your data set. Like I said before, I could delete record 35, key 35, and it'll get removed from the leaf nodes, but it may end up in one of the guide posts. Or I could have multiple copies of the key going down my inner nodes to the leaf node and that's potentially wasting data or wasting space. So in a B tree, a key only appears once anywhere in the entire tree. But the problem with that one is that the values, I mean the record I use pointing to the actual tuples, they can be anywhere in the tree. And so now if I wanna scan along sequentially to get all the keys I need in sort of order, I may have to traverse up and down because I basically have to do breadth-first search. And again, we're not gonna talk about latching just yet but think of like, I basically have to latch the entire tree as I'm going up and down. Whereas in a B plus tree, because the leaf nodes are only places where values actually are, considered like that's the exact copy of what's in the table, once I get to the leaf nodes, I don't have to maintain any of the latches from upper parts in the tree and I can just scan along the leaves and let other threads do whatever they want at the top of a buck, long as it doesn't need for what I'm doing. Right? So the advantage for a B plus tree over a B tree is that we're gonna have better concurrent access and we're gonna maximize or improve our amount of sequential IO we're doing over random IO. Yes. This question is if the internodes only guideposts, why do we have sibling pointers? Because when you're doing split merges you may need to borrow things or merge with your neighbor. And they may be, you know, you have the same parent, instead of going back to the parent you can go across and get them. You don't need it, it's just an optimization. Postgres does it, I actually don't know I don't know whether MySQL does. Okay. So let's see how we wanna do our basic operations. So do an insert, we wanna basically, the goal is to find the correct leaf node. So we're gonna basically traverse down, following the guideposts, get to some leaf node where our key should be and if it has enough space, great. We insert it in sort of order in that leaf node and we're done. If there's not enough space, meaning the keys or the number of keys we have is full in the node, then we're gonna have to split whatever the leaf node we're trying to insert into, into two nodes divided in half, but half the key is going one side, half the key is going the other side and then you're gonna copy up whatever the middle key is between in that list of keys up to your parent. And then now you have a new guidepost and a new pointer down to the new node you just created, the new leaf node. And of course this happens recursively, right? If I promote up the middle key that I split on to the parent and that parent is full, well now I gotta split the parent and that can cascade all the way to the top. So making these slides show this in PowerPoint is kind of a pain, so I'm gonna do, I'm gonna bring up this visualization. So this is a, up here, let's check the size, here we go. All right, so this is a website that'll again, the link on the slides takes you to the wrong one. I'll update it, but if you search P plus tree visualization, you'll get this. So I'm gonna do a demo of a P plus tree with degree two. So the maximum of the keys per node is two and maximum of our pointers is gonna be three. So the first thing we'll do, we're gonna insert two. That lands in our root node. Can I make that bigger? Let's try this. Height 200 and then now I can maybe do this. Is that better? Okay. All right, so then we're gonna insert six, right? So again, we only have a root node, it can hold two keys, so nothing changes here. So now we're gonna insert four. So in this case here, we try to put three keys in our root, we can't do that, so it decides to split on four, makes two new leaf nodes, and then the middle key is four, so anything less than four goes on this side, greater than equal to four goes on the other side. So two is on this node over here, this leaf node, and then the four and eight go on the other side. And then in this implementation, we only have the sibling pointer go in one direction, some systems do that, Postgres does both directions. Again, it's not wrong, it's just done differently. So now I insert five, right? So I follow four, four is less than, so five is greater than equal to four, so it would go down to the previous node, but then I have, I had two keys already in there, so I had to split that key and split that leaf node, I made two nodes, and then I put five up there. Okay, so first of all, good? Okay, so delete is essentially the reverse of this, where we start at the root, go down to we find the leaf node where entry, the entry wanna remove, if it's not there then we don't do anything, right? Because you can't delete a key that doesn't exist. If it is there, then we go ahead and delete it. So if the leaf node we just modified is at least half full, then we're done, right? We pop out and we don't have to do anything. But now if, but if the leaf node, if you're deleting that key, goes below our threshold, right, m divided by two minus one, where m is the number of keys per node, then we have to redistribute or sorry, the first thing we just try to do is redistribute, meaning follow the sibling pointers, find another node at the same level as us, and steal one of their keys, like long as they don't become unbalanced, that's okay. We may have to tweak up above the parent node, sorry, in the parent node, the guide post, the split point. But again, that's, it's not that expensive because we would already have the latch for it, which we'll cover next time. If we can't redistribute, then we have to merge L with one of its siblings, combine those two keys, put it together, and then update the parent accordingly, right? And again, this is recursive. So if like, if I merge two nodes together and then I delete a guide post key in my parent and now the parent is less than half full, then I, the merge will cascade up, yes. What does the redistribution fail with it? What does it fail? Like if I, if I have two keys, you have one key and you're my sibling, I delete a key and I'm less than half full, that's a bad example, because there's say three keys, right? If you have two keys and I have two keys, I delete a key and I'm like, oh, I'm less than half full and I go try to steal one of your keys, but if I do that to you, then you're less than half full, so I can't. All right, so let's go back to our demo, right? So, we can go ahead and delete. So let's go ahead and delete eight, right? Eight's at the far end here. Oh, it was six, oh yeah, sorry. Six, right? Goes over here, delete six. That's fine, we're still balanced again. It's two keys per, two keys per node and I have to delete half full, in this case is one. We can go to, we can go to degree four, if that would look better. But now let's say I delete key four in the middle, right? It basically propagates up and removes it from above. So let's go to degree four, because then you can start seeing the stealing better. So let's do the same thing. So let's do insert one, insert two, four, five. Little split, now we've got there, six, eight. All right, and then we'll do nine. All right, so we're gonna go ahead and delete five in the middle here. And again, at this point here, this node, this leaf node will become less than half full. So the first thing it's gonna try to do is try to steal from when it siblings. So let's go ahead and delete five, didn't do that. Yeah, this, what's that? Yeah, I think it's a four. So I delete four, let's see what it does. Yeah, there it goes, there it steals. Yeah, so this animation doesn't follow the textbook exactly, but like it's not wrong, there's just different ways to do things, how aggressive you wanna be on certain optimizations. But as long as you get the high level idea that you could steal, but you still have to update the parent when you steal, because that's gonna change the boundary points. Yes? I'm seeing it in this way. Yes, so as I said, in this example here, they have sibling point is going in one direction. Postgres and other systems go in both directions. Doubling link list, yeah. Yes? So you said that sibling point is. Yeah, this question is, actually I had an example of that. Actually, that's the question. Yeah, actually, I don't have an example of that. Well, I know I do, like in the middle here. These guys, yeah, so I think for leaf nodes, you definitely want pointers to your siblings along this because you wanna be able, you need to be able to go along the leaf nodes. If this guy had like a, this thing got even bigger, could you have two parents, could two nodes at the same level have pointers to each other, even though they don't have the same parent? You could, I don't know, actually don't, I don't know what Postgres does or other systems do. Yes? Yeah, so this point, you may not, sorry, if you have the same parent or? So if you have the same parent, sorry, if you have different parent, then you don't need to have the same pointer because you're always gonna merge with your. Everyone comes up to the root. So like, maybe in case you have to re-organize everything, right, so that may help, but at that point, you're latching the whole tree, so who cares? Yes? You said you don't store the parent pointer, how do you send the data back to the parent? This question is, if you don't store a pointer to the parent, how do you send data to the parent? We will discuss this next class because basically as you're going down, you keep track of the stack of the nodes you visit as you go down, and you keep track of which ones you have to latch for. So I can go down, like, if I'm traversing down here, and say, and this guy here, I have to, I got to split. When I come down and get here, I would recognize, hey, I'm gonna have to split, don't release the latch to my parent, so I still have a pointer, I still have it on my stack I can get back to. So it's the internal bookkeeping of the worker as it goes down the threads, or goes down the tree. That splitting can go all the way back to the top. Yes, again. And then, so we'll talk about it means safe or unsafe traversals. Like, you know, like, as you go down, if you would know, like, I'm trying to delete something. So as I go down, if I know that no matter whether or not the key I need below me, whether it's there or not, I know that I won't have to do a merge or split on this node I'm at right now. So once I go past it, I can release the latch on it because it's considered safe. It won't, no matter what happens below you, it'll never get reorganized. So you don't need to maintain the latch for it. That we'll cover next class. Okay, so that's the basic operations for splits and merges. So the B plus tree, or just trees in general, but B plus tree from B plus in databases is gonna have a bunch of, do a bunch of other stuff we couldn't be able, we couldn't do with a hash table. So if we were in a hash table, the only operation we could do is something equals this key, right? Is the hash key equal to my key I'm looking up on? We can't do less than, we can't do greater than, and we can't do any partial key lookups. You have to have the entire key, right? So if I say I built it next on columns ABC, if I only have columns A and B for my key, I can't hash that and jump to anything meaningful, right? Because the hash is completely random. But in B plus tree, we can do a bunch of tricks where we can potentially only have, or not potentially, you can, you can only have parts of the key, or a certain number of the attributes that your key is based on and still use it for queries. So again, say the index on ABC. So obviously I can do A equals one or B equals two and C equals three. That's the same thing as the hashing. I have the quality matching for all of the keys that are in my, that the index is based on. I also can do what's called a prefix search where I only have A and B and not C, right? And we can do the lookups to find all the matches were based on the A and B without C. But we'd also, not all systems do this, because it's trickier to do, we also can do a suffix lookup where we don't have the prefix, but we have the suffix of the keys. So I don't have A, but I have B and C, and I can potentially still use my index and do that lookup. Very few systems do this. This is hard. Postgres doesn't do this. Oracle, and I think maybe SQL Server can do this, right? And again, if it was a hash index, we have to have the entire key. And we always have to be in a quality predicate. Yes? It's hard to implement. It's hard to implement, yes. Yeah, because you basically need to have like potentially multiple threads at the same time going down and everyone coordinating. There might be a patent from Oracle and that presents people who are doing this. Wouldn't surprise me. I don't know. They're called skip scans in Oracle. And that might be why nobody else does this. All right, so again, say I have my index on A and B. If I want to find key one and two, again, I use my guide posts, and then essentially just looking at the keys, the parts of the key in sort of sequential order. So first I check is one less than equal to one, and then I check the second part is two less than equal to three, and that tells me that I want to follow down this node here, and then I can find the entry I'm looking for. If I'm doing a prefix search, meaning I have the first part of the key, but not the remaining part. So I have key on A, but not B. The way that would work is you basically look at the part that you do have and say is one less than equal to one, yes, follow down here. And then now I scan along and keep evaluating my predicate against all the keys that are in the leaf nodes until I have something that violates or I know that one is now is less than two, and I can't traverse here. Because it's sort of the order, I know there'll be never anything where with a key, the first part A equals one, and then something else for B after this point because they're sort of first on the first part of the key. Again, the last one for the skip scans, basically what happens is you, at every single node, you evaluate, okay, what part of the key do I have, and then determine what portion of the trees below you have to look at. And this is an example here, since I don't have the first part of the key, I essentially have to look at everything. And again, I think in Oracle, they can have multiple threads in parallel go down different parts of the tree, and then they combine the results together at the end. So it looks like you're doing a wildcard search. And so we'll see this a little bit at the demo at the end. There's this trade-off between and the data set that's gonna make that like, it could say, all right, well, I know something about the statistics of the keys that my index is based on, so it probably still is worth me to go look in the index. But it may decide, okay, well, I don't know anything about what you're trying to ask me to do. So the index is not gonna help me. I'm just gonna do a sequential scan across the entire table. And that actually may be faster than trying to do multiple probes down in the index and combine results together. Again, it's not something the programmer has to deal with. This is something we take the SQL query and try to figure this out on our own. Again, we'll cover this after the midterm. Yes. Is it up to the programmer to decide which indexes to make? So this question is, well, okay. This question is, is it up to the programmer to decide what indexes to make? Most systems, yes. So this is an old problem in databases, right? I talk about how great the abstraction is for the relational model that like, you don't have to worry about how things are actually physically stored and all that. But the end of the day, someone's gotta decide what indexes you actually need. And so there's a long line of research, myself here at CMU, but like going back to the 1970s, that people are trying to figure out automatically what indexes you need. And so the commercial systems have built-in tools that can help you figure this out. My SQL and ProXX do not have that. ProXX will build whatever indexes you tell that you want, it'll just do it for you, right? So if you tell it I want a thousand indexes, it'll do it, right? In SQL Server on Azure, what they'll actually do is they'll spin up a separate instance for your database system, try out basically some kind of machine learning algorithm to figure out what indexes you actually need and then suggest them to you, right? SQL Server does other interesting things too, although in the query optimizer, which we'll talk about after midterm, your query shows up, it starts planning it based on whatever indexes you have. But at some point you can also say, man I'd be really great if I had an index on this table right now. It doesn't, it can build it for you potentially, but instead of saying, it can potentially build it for you just for that query, but it'll also come back and tell you, hey by the way, if you gave me this index, I'd run a lot faster. The reason why you may necessarily, you could build it just for the query and then throw it away, because that'll only affect that query, you may not wanna build it and then keep it around because as I said, you gotta maintain it and you gotta keep it in sync with the table. So you don't want your data system to start adding like a ton of indexes and now that's gonna make all my insert updates and leads go slower. Also too, like they take storage space, they take memory space, so like there's a cost to the physical hardware as well. That's a whole another hornet's nest we can get into. Okay. All right, so the next thing we gotta deal with is, all right, so at this point, we know how to do inserts, updates, deletes, or sorry, we know how to do inserts and deletes. We know how to do basic lookups to find the keys we're looking for whether prefix searches, full key searches or the skip scans. The next challenge we gotta deal with is how do we handle duplicate keys? So there's two approaches to do this. Again, the issue is gonna be like since I wanna be able to have everything always in log n, how do I actually wanna handle the capability of inserting keys, duplicate keys over and over again and not violate that requirement? So the most common approach is to maintain sort of a hidden column or hidden attribute in the key with the record ID of the tuple that it's pointing to. And then that guarantees that every key ends up being unique, right? If you have a key on four and I have a key on four but you have a separate tuple and I have a separate tuple, if we put our basic record ID as part of the key in addition to the column we're based on their index, then your four and my four end up being unique. And because I can do that prefix search, where I don't have to have all the elements of the key to do lookups, then this scheme still works. The other approach is to overflow leaf nodes. And basic idea is that the leaf nodes themselves, if I get too full but I know I'm inserting the same key, then I just potentially keep building a link list and I sort of go down in the depth of the tree. Again, but that violates our log n approach. Our log n guarantee. All right, so here's how we do the record, right? So the key now isn't just number one, it's one and then followed by the record ID, then which is the page number and offset. So now if I want to insert six and six already exists, well, anytime, even though you might call in SQL, insert six, but the database that I was gonna do is convert that to insert six, followed by the page number and slot number. So now when I get down here, since this guy's full, I'll just do a split, things slide over and then, sorry, wrong one. Yeah, so things slide over and then now I can insert six right here, right? Again, superficially it looks like it took like a keys, but again, the actual bits themselves are unique. Of course, now if I want to do a delete on six, again, I would have to, you know, internally the database system is gonna know, okay, well, delete on six followed by this record ID and offset or the page number and offset. Yes? What happens when the key is not a number? What do you mean? Is six the actual record or the key? Oh, so six is the key. Oh, okay. The database know to insert. Cause I insert into a table, right, and the table has an index on column A and the tuple I'm inserting for that attribute sets the value to six, right? But instead of just putting six in, it's gonna say, all right, I've already inserted into the table, now I have a record ID, so when I insert into the index, it's the six appended by the bits for the record ID. And that guarantees that no many times I insert six, it's unique, right? Now, if it's a unique index, like a primary key index, or I can declare that it's a unique index, then I don't wanna do this, but the mechanism still works. Yes? This is very similar to having a hidden attribute that makes you believe, right? Except that the hidden attribute can be thought of as the payment quality. The statement is, and he's correct, that this is just essentially a hidden attribute that guarantees that duplicate keys are physically unique, right? Because it's the record ID, yes. That's the trick. Yep, right, so the overflow nodes, the overflow node leaf nodes look like this. I insert six, I recognize it's full in my leaf node, but I also recognize that the thing you're trying to insert already exists in here and therefore it's a duplicate, so I just make this overflow page and insert it down here, all right? And I keep inserting new things and I keep appending it along like this. So this looks sort of similar to the chain hash table before, right? That like, instead of having a hash function, tell me where I landed at the start of my linked list, I have a tree structure in front of it, but essentially doing the same thing. Because now again, this violates the log n, we have to deal with like, okay, what if we actually wanna split and merge? What do we move things? This makes things way more complicated than the record ID approach and very, I don't think any real system actually does this. Yes? The question is, what's the benefit of this approach? Well, now I don't need to store the record ID, right? Duplicate, it's stored apart in every single key. It's potentially easier engineering. Actually not really, it makes it harder. This, yeah, this is a bad idea, don't do this. You could, and so you'll have, I'm just saying it's bad. I've been able to talk about clustered indexes. These aren't, I mean, it's good for you guys to at least know this and this exists, but I don't wanna spend too much time on it. The basic idea is that there's some database systems that let you define what are called clustered indexes on tables, where you can allow the actual table itself, the actual tuples themselves, even though the relational model's unsorted, you can say I want the physical tuples on disk to be sorted based on the sort of defined by some index. And in this case here, if I have a true clustered index, that no matter where I insert a new record, the actual heap files themselves will be guaranteed to be in that sort of order. You sort of think of this like in the MySQL, SQLite approach where the leaf nodes are actually storing the tuples, that's automatically clustered index. But in some database systems where it isn't an in the organized table, you can have the sort of be enforced by this kind of index. And so the advantage of this is that when I start doing, when I wanna start doing scans, assuming I'm not doing index organized storage, when I scan along the leaf nodes to find all the tuples I'm looking for, then I'm guaranteed to get the pages in sorted order basis defined on the key order, right? So again, as I scan this, going across, I get all my entries and I get all my pages and I just rip through that sequentially and things go fast. If you don't have a clustered index, then you end up sometimes with a bunch of random IO. The, again, the leaf nodes, that can be stored sequentially on disk, but when I start doing lookups to get the actual data that the leaf nodes are pointing to, that could end up being random, right? And so if I do something really stupid, like say I have one free frame in my buffer pool and if I scan along in the order, if I fetch the page in the order that they come out of the index, I may end up doing a bunch of the redundant IO where I fetch a page in, process on it, because that's the key I'm looking at right now or that I pointed to, and then I throw it away, get the next page, but then a few more keys later, I go fetch the same page I did before. So a really simple optimization to do this is that you don't actually, you don't retrieve the tuples as you scan along the leaf nodes, as you find them. You first do the scan of the leaf nodes first, get your list of all your pages, then sort them in, based on page ID, and then go retrieve them based on this. Now you still have to do the bookkeeping to make sure that you're falling along the tuples in the order defined by the index if that's what you care about. But again, this is a way to get more sequential IO and reduce the amount of random access. Yes? Wait, why couldn't we just fetch the scanning tool? And so his statement is, why couldn't I just keep track of what I've already fetched in page and I don't fetch it again? I was giving like a toy example where like, I only have one frame. So like I can only fetch one page, I throw it out and get the next page in this toy example, right? But thinking a real system, again, don't think of like one page, think of like I can have maybe 10 gigabytes of space, but my database is one terabyte, but the table is one terabyte and then you're running out of space. You want to sort them, you want to give them page, you want to access them in the order that they exist physically on disk and then still do a bookkeeping to make sure that the order of the results you're generating match the sort of order of the index. Again, to reduce the amount of wasted IO. Okay, so I want to quickly go through some design choices here, how to handle certain things. And so a lot of these ideas come from this book, which is considered the Bible of B plus trees from this guy, Gertz Graffi. He's invented a bunch of the various techniques that we'll discuss about this semester. And again, it's called modern B tree techniques. And again, he's talking about B plus trees, but he calls it a B tree. And actually if you just Google this name of this book, it's came out in 2010. It's a great book, covers a lot of these techniques and even more, if you just Google, you'll find the PDF. If you like this kind of stuff, it's a good read. Again, because it's not like theory, it's like here's actually how to implement it in a real system. All right, the first question is, what's going to be the node size? So you can assume in all our diagrams here, one node corresponds to one page in our database files in our buffer pool. But in some systems, like in IBM DB2, you can actually modify, you can figure the size of a database page for different tables and different indexes. And so depending on what your hardware is, you may want to set the size, the page size of your B plus tree nodes differently. And so again, the slower your disk, typically the larger the page you want because again, it's going to be better for maximizing swancho-io. So if you're in an old spinning disk hard drive, you want a page size of like one megabyte. Now the number of keys that you can fit on a one megabyte page will be the final size of your keys are. If they're all eight bit integers, then you can store a lot of them. You can apply more than you actually can have. But for a fast SSD, roughly eight to 10 kilobytes is considered to be the right size. And then if you're in memory, 512 bytes is considered the right size is then a cache line, you keep things very efficient. We talked about that some systems can actually violate that requirement that every node has to be half full. Obviously you can't go to more than, you can't have more keys than you actually store because you run out of space. But the, you can recognize that like, okay, well most the, maybe I don't want to split or so maybe I want to merge my nodes all the time. And I can go below that threshold temporarily to see whether something's gonna get inserted to then put me above that threshold and avoid having to do this prematurely, right? So again, this is why Postos is gonna call their bounce, they call their B-plus tree as a non-balance B-plus tree because they can violate this requirement. Next question is how do you want to handle variable length keys? I think somebody brought this up. So the, we could sort of try to approach it like a column store where we want everything to be fixed length. So one way to do that is actually you don't store the keys themselves in every node, you just store a pointer to the key, like the record ID, because that's always gonna be either 32 bits or 64 bits, right? And actually this will say space two because if my keys are all really big, I'm not gonna store them, because again the B-plus tree is a copy of what's in the table. I only have to store just the pointer to the record ID in the nodes. Is that a good idea or a bad idea? It's, he's right, he's just, doesn't sound like a good idea because it causes a lot of non-squint to IO. Absolutely, yes. But think of it like as I'm traversing down my nodes and I gotta figure out whether I'm gonna go left and right, I don't have those guide posts in my node, I gotta go follow the pointer to go get that tuple in that page, then do the lookup to find what I'm need, right? And again, while I'm doing this, I'm holding latches in my data structure and that's gonna be really slow. Yeah, so nobody does this in a disk-based system. The variant of this of a B-plus tree is called T-trees. I forget what the T stands for, I think it stands for the dude's name. In the diagrams, the nodes look like T's, but then I think the email, he said, oh yes, the guy's name, but whatever. In-memory databases did this in the 80s because they wanted to save space. You won't have to duplicate keys in your B-plus tree because they didn't have a lot of memory. But again, nobody does this now in a real system because it's so expensive to do that other lookup. It's easier to just copy the key. You could support variable length nodes where the size of the node can vary within the index and you have to do this because you don't know the size of the, you wanna have the same potential number of keys in every single node, but nobody has, you may not have enough space to store all those keys within that node. As far as I know, only academic systems do this, nobody does this in the real world. Padding's another approach to handle this as we talked before in column stores. Again, I think this is rare. What most people do is that they would have a, essentially, almost like a slot of a page approach like we saw in table pages where you just have this array of pointers within either offsets within the page you're looking at or to another overflow page. Again, it's just like the overflow values we saw before where you just, you know that, okay, the thing I'm looking for is not in my page, I gotta go somewhere else and go get it. All right, now I gotta talk about how we actually wanna go find the keys once we land on the node, we bring it into memory and we're looking for a key to decide to go left or right or whether we have the match we're looking for in our leaf node, we gotta decide how we're gonna do that match. So the easiest approach is just do a linear scan, right? Just think of like an array, doesn't matter if it's sort of the not, I just start at the beginning and I scan along until I find the thing I'm looking for. In this case here, I'm looking for key eight. It's simple, it's dumb, it works, right? We can do a little better though with SIMD. Actually, who here has taken 418, 618? Nobody, okay. Who here doesn't know what SIMD is? Okay, okay. SIMD stands for single instruction multi-data. It's a class of instructions you can have on modern CPUs that allow you to basically have like a vector register. You put a bunch of values in it and then there's a single instruction to do like do something on it. Like you can put a bunch of numbers in one vector, bunch of numbers in another vector and do add them together and the output goes into another vector. We'll cover this when we talk about query execution, but this is a very common technique used in modern systems. This is what made Snowflake special 10 years ago. Right, so what I can do is, instead of doing looking at every single key one after another to try to find eight, I can instead use a SIMD intrinsic, in this case here, this is for x86 to do a evaluation for 30-bit integers on 128-bit registers. So I'm looking for eight. I store eight copies of eight in my SIMD register as four lanes and then now in a single instruction I can do an evaluation of those eight eight or four eights with the keys in my array and then I'll get a bit mass that says zeros if there's no match, one if there's a match. So in this case here, now it's a single instruction to do that evaluation and I can do that way more efficiently than going one after another. In this case, and then for this one I don't have a match so I gotta slide it over, do the look at the next one. I have to recognize that I only have three keys and not four so I gotta play little tricks and make sure I don't end up with a false positive. But in this case here, now I have eight equals eight in that first lane and I have a match. So I can do this more efficiently than doing this. It is still linear, but I'm doing in batches and the hardware can support that. Next approach is obviously do binary search. Assuming it's sorted, this is easy, you jump in the middle, my value is greater than the key I land on, it's greater, less than the one I'm looking for, I jump to the next side and so forth until I find my match, then I'm done. This is what most systems will do, yes. It depends on the hardware. So in Postgres it'll be eight kilobytes. But again, the number of keys you can store in that node is gonna depend on the type of the key is. So binary search is the most common one. You could do this, I don't think any outside of academia nobody does this, you can do interpolation search. And this works if you know there's no gaps in your keys and they're always in monotonically increasing order. Like if you have a primary key that's a, like an auto increment value, like plus one, plus one, plus one, plus one. And again, I assume I don't have any gaps. Then it's just simple math to figure out exactly within my array. I know the low point, I know the min value, the max value, I know the number of keys that I have and I can just do a simple formula like this to jump exactly to the offset that I need and I'm done. This is the fastest approach, faster than binary search, faster than SIMD, but again, you can't have gaps to do this. So it's rare. All right, we have 20 minutes left to get through all these optimizations. Let's see how far we can go, okay? Some of these should be pretty obvious. The pointer swizzling and the buffer updates, those are probably most important. Dduke two. All right, so just like in a column store, we should recognize that the keys that are gonna be in our B plus tree, they're gonna be in the same value domain because they're all coming from the same attributes. Furthermore, they're sorted, which is even better for compression, right? So there's a bunch of things we can take advantage of recognizing that the values are gonna be very similar to reduce the size of the keys we actually have to store. So in this case here, we can do what's called prefix compression and we can identify that we're gonna have a bunch of keys that are gonna be very close to each other in electro-graphical ordering and they're gonna have overlapping portions of data. So instead of storing complete copies of the keys, we just store the common prefix, in this case here, Rob, and then we just store the remaining suffix that's unique, right? That's pretty easy, that's nice. Next technique is to do deduplication and the idea here is that we're gonna have a bunch of keys that are end up with the same value over and over again in the same notes. And again, ignoring the prefix or putting the record ID at the end because that one, the system knows that it's doing that and can pull that piece out. But if I have a bunch of non-unique keys that are gonna end up in the same node, it's just like prefix sorting or prefix compression. I just store the duplicate key once, then have a posting list or the list of all the values that correspond to that key and now I'm only storing one copy of that key, right? Postgres added this in, I think in Postgres 15, I think it came out last year and it's a pretty significant drop in size of your notes, yes. How are we gonna note to interpret the prefix that it's in the right form in the final result of my life? Yeah, so this question is how do we know that we should interpret what this is, that these are values not keys? This is just a mock-up, you wouldn't store it at exactly the page like this, you would have, obviously, lengths of the number of elements you're storing. I'm just not showing that. You have additional metadata to know where the offsets are. We can also do suffix truncation and again, because the inner nodes don't have to be the exact copies of the keys because those keys might not exist in the leaf nodes, we maybe don't have to store the entire key. We just need enough of the key, keys prefix, to discriminate whether we need to go left or right. So in this case here, I have keys ABC up to K and then element O up to V. The only thing that really matters is, in this case here, is say just the first three characters of both of these two strings. So my inner nodes only need to store, the minimum prefix that we need to decide whether to go left or right. Of course, now the challenge is like if I insert a key, oops, sorry, if I insert a key that could be in between them, maybe I gotta go back and get the original keys to decide what the prefix should be. But in some environments, this might be the right thing. All right, so pointer swizzling is a common technique that's gonna allow us to minimize the amount of lookups we have to do in our buffer poles page table. Because again, when we are traversing the nodes or traversing the tree structure, what I keep calling our pointers, they're really page IDs. So I gotta go to the page table and say, okay, well, if this page exists, give me the pointer to it, right? So if I wanna say find keys greater than three, I start my root node here and I look at the keys and decide whether I wanna go left or right. In this case here, I wanna go left. But then the value in this node here is gonna be the page numbers, so page two. So now I gotta go down my buffer pole and say, okay, give me the pointer to page two. Likewise, when I'm on the bottom here, I wanna scan along the sibling nodes, all right, go from page two to page three, I gotta go back to the buffer pole. So a technique, the idea of pointer swizzling is that if you pin the page in the buffer pole, say this page cannot be evicted, then any page that points to that page you pinned, you have to be pinned too, you replace its contents with the actual pointer in memory. And so now when I'm traversing my tree, I'm not going to the buffer pole to say, go do this look up for me for this page, I have the thing exactly what I'm looking for. So a thing like the root node, everyone's always going to that in your B plus tree and say, and then they're always gonna go down to the next level. So instead of having to page look ups in the buffer pole would get down to the next level, I have the pointer to know where to go directly. And obviously you don't wanna store this pointer on disk if the page gets flushed because now you load it back in, you have this pointer that goes to nowhere and that would be bad. So there's bookkeeping you have to do to make sure, okay, you're going to disk, let me undo this swizzle, what got swizzled, to make sure that nobody points to it. And then also too, you don't want this page to get evicted, this thing have a swizzle pointer and now it's pointing to some other page that got swapped in that frame and it's not part of the B plus tree and then you have a seg fault because it starts interpreting bytes that it should, starts interpreting bytes in a way that it shouldn't. So the reason why I talk about this for the B plus tree and not for the hash table stuff or not for the regular heap stuff is because we already have this hierarchy in our tree structure here, we would know that if we swizzle anything below us, we want to make sure that this thing doesn't get unpinned, sorry, that this thing doesn't get unpinned before it's children get unpinned and that way the pointers are always valid, right? Just think again, when you're building project one, this thing, all the work you have to go do to go look up on the page, you will go find the thing you're looking for, the frame's not there and then go evict something, right? You skip all of that, you know, update the LRUK stuff, right? You skip all of that by just going directly through the pointers. But of course, you lose the metadata of the access patterns for how these pages are being used. But again, if it's important enough to pin it and swizzle it, then you probably should stay in memory. All right, to do inserts quickly, the most common trick is just pre-sort everything which we'll cover in I think next week. You sort out your keys and just lay them out as leaf nodes, right? With your sibling pointers and then build the data structure from the bottom to the top, right? And this is different than if I just do, if I just inserted the keys one after another, I would start from the top and go down and start having to do all the splits and so forth. I skip all of that by just pre-sort things and then build a scaffolding on top of it, right? And this technique is very common as well. All right, so the last optimization I wanna talk about is, you know, we make a big deal about how the nice thing about the B plus tree is that it's balanced, everything's always low again, and our lookups can be really fast because again, it's gonna, everything's log in to get the leaf node and then we can try to get as much sequential access as sequential IOs we can. But of course, the challenge is that updates are gonna be expensive for us because we have to maintain this balance property. Anytime a thread comes along and inserts or deletes, they may draw the short straw and be responsible for reorganizing the entire data structure. And so ideally what we want is a way to delay the updates to the data structure in such a way that we can accumulate them and then at some later point when we have enough, we say, okay, let me go ahead and apply all my changes in a batch. And then yes, I may have to reorganize things, but I'm doing it all at once. I can amortize the cost of making those changes. And now your rights could potentially be faster because you don't have to worry about, okay, I'm gonna have to split every time I insert something new. So there's a line of work on what is called right optimized B trees or B plus trees. These have sometimes also called B epsilon trees. You'll see it with a little epsilon symbol. There's a commercial variant from the Tokutek guys called fractal trees, but it's basically branding. It's the same idea. And the idea is basically now at every single root node and inner node, I'm gonna have a mod log, just like MySQL had for their pages when they were doing compression. And any time a new update comes along, instead of propagating those changes all the way down to the leaf nodes, I'm gonna violate the property we talked about in the B plus tree where the leaf nodes have to be where the actual values are. And I can insert my entries into the mod log. So if I wanna insert seven, again, instead of having to reverse down and figure out where seven should go, I just put it in the root. Same thing, I wanna delete 10 instead of going down and deleting it and then essentially doing a merge, I'll just put it in the mod log. So now if a query comes up and wants to find 10, well, as I traverse down, I look in the mod log and say, okay, is the thing I'm looking for here? So in this case here, I deleted 10. It's in my mod log, so when I do my lookup, I would find the entry here and I'm done. I don't need to go to the bottom and to actually see the change. Of course, what's the obvious problem with this? The buffers get full, right? So when that happens, then you gotta cascade down the changes, but the idea is here, you're doing this incrementally and in batches. And you basically, you don't have to apply any modifications until you get to the structure of the data structure, until you get to the leaf nodes. So if I insert 40, I just move my previous changes, I insert seven, I insert 10 here and I leave, I put insert 40 there. And at some point, if I keep going, this thing cascades down and this thing gets full, then then I go ahead and apply my changes. Yes? Could this potentially make reads really slow because you can build them? Yeah, so the question is, could this potentially make reads go really slow because as I'm scanning along here, I potentially have to do sequential scans within the mod log to find the thing, why the thing I'm looking for is in there. Yes, but then these different mutations of B epsilon trees will have bloom filters in front of these things to say, is the key I'm looking for actually even in my mod log? If yes, then go look for it, right? And bloom filters are cheap to maintain and they're not very big. So this is an old idea, old, 2003 old. B plus tree is 1972, so maybe that's not that old. There's an old idea, what does this log also look like what we talked about before, log structure storage, right? Same idea that we can append these log entries and then batch them up and then apply them at some later point. So I said to you guys, we've seen this idea over and over again. So Tokutek, they rebranded their implementation of a B epsilon tree as fractal trees and then they had a storage engine for my SQL that I think got bought by Burkona and I think I got deprecated last year. So they had probably the most robust implementation from a few years ago, that's dead. SplinterDB is a key value store, embedded key value store from VMware, written actually somebody here that he was a researcher at VMware, but he was getting his MBA here at Tepper at CMU and the other guy working on the C's now professor at Cornell, but like this is basically a sort of a super optimized version of this and then relational AI is a relational knowledge based sort of graph veneer on top of the database system that implements B epsilon trees do fast updates. So this is not that common, but this is something I suspect we'll see more and more of in the future. I mean, RocksDB doesn't need this because RocksDB already is a log structure, merge tree and you're essentially giving us the same idea, the same properties. All right, so we have six minutes left. So let's pop up in Postgres and do a quick demo. So here I'm gonna demonstrate the difference between, let me turn the lights off, demonstrate the difference between my SQL, or sorry, not my SQL, nope, sorry, let's do that, that's even worse. There we go, all right, sorry, had to write the first time. So I wanna demonstrate the difference between a hash index and a B plus tree on data and then we can see what the data system is gonna choose to use to run queries. So the data set I'm gonna use is gonna be, I think it's 21 million email addresses from a few years ago, right? Cause it's all on, it's all in, yeah, so 27 million email addresses, right? And so I'm not gonna run this in real time but basically I've created two indexes. I created one here on a B plus tree on the emails, sorry, and the way it works is like in Postgres, you say create index with the name of the index on this table and then using, and then you specify what data structure you want. So by default, if you don't have the using clause in Postgres, you get a B plus tree, but here I'm explicitly telling you I want a B plus tree and the index already exists, I don't need to do that. And then I have the, build the same thing on, on the same column, I'll better on using a hash table and I just say using hash and again, I already have that. So let me turn off a bunch of other optimizations in Postgres that we don't need to worry about just yet, so I can do queries like this, selects many emails from emails and I get some random thing like this. But again, if I put the explain keyword in front of it, Postgres will tell me what the query plan is, right? So here it's gonna tell me I'm gonna do an index only scan using the B plus tree and it tells me what the conditional is. So we didn't talk about the index only scan and sometimes called the cover scans or covering indexes, basically Postgres recognizes that all the data or all the columns I need to answer this query can be found in the index. So even though they're storing the record IDs in the leaf nodes, I don't need to actually follow those record IDs to get the data for the actual tuple, all everything I need for this query can be answered from the index, right? Because again, going back to my query, it was just select them in email from the email address and the index is on email, so once I go all the way to the right side of the tree or left side of the tree, that has all the data I'm looking for, right? So that's why it told me it can do an index only scan. So that's fine. So let's see now if we wanna do something like this, we wanna say, give me all the emails where the first letter is A, but I'm gonna do a limit one and you get somebody like this. And I can do explain to see what it actually tells me what to do. So in this case here, Postgres says, even though I have that B plus tree index that is sorted on emails, Postgres decides I want to do a sequential scan because it recognized that the thing I'm looking for, I'm looking for all emails that start with A, that's at some middle point in the tree and since it's unbounded, I'm not specifying like, at least in the scan size, I'm not specifying the end marker, it would say, okay, you gotta scan to the entire end. Now it's not smart enough to recognize I have a limit one there, right? So what it really should have done is just probe down the index, find the first thing and then popped out and done, but in this case here, Postgres wasn't smart enough to figure that out, right? And it can't pick the hash index because again, I'm doing like a wildcard search, I don't have the actual full key, right? So you can see this in other ways, right? So in this case here, I wanna find all emails where it's greater than Andy, again, it decides to do a sequential scan, but now if I do something like this, find all emails where the string starts with ZZZ, now I'm on the right side of the tree and Postgres recognizes based on the distribution, okay, well, you're far enough along the tree where I'm gonna scan along, so it's okay for me to do the index scan because that's gonna be less data than doing a complete sequential scan, right? And at no point did Postgres decide to use the hash index because again, I'm doing less than, greater than a wildcard matches. So if we can do something like this, where now we can say for, you know, find emails where there's exact quality matches using the in clause, and now you see Postgres decides to use the hash index here, right? The bitmap index scan, I'll explain that, is it's not a real bitmap index. You can see it better, instead of using in, you can use a bunch of ores, right? And now what you see is that Postgres has multiple entries where it's for the index scan where for each of the email addresses I had in that, in my where clause, like something equals something or something equals something, each of those are gonna be separate probe into the hash index. And then they maintain a bitmap of the, I think it's the actual tuple IDs, the tuple offsets, not the pages. They maintain a bitmap of the matching values for each of those index probes and then they oar them up together and then that produces the exact tuples they actually would need. And this is sort of similar to what I was saying before where you figure out what the pages you need from the index first, then go actually go get them. So they're doing that here, they're doing a bunch of probes in the index, don't actually get the real data from the tables, doing the oar on the bitmaps, then you have the list of the indexes or the records you actually need to go get, right? Okay, so the next thing I wanna do is talk about clustering tables. So Postgres supports the cluster command, but it doesn't actually, and it will sort your table, but it doesn't actually maintain the sorting because it's not gonna be organized, it's not index organized. So with the cluster command, progress will sort your data, but as you start modifying the table, it can get out of sort of order. So I'm not gonna do this live because it would take too long and I realize we're a little over time, but the command would be basically like this. You would say cluster, the table you wanna cluster, and then let the index you wanna have it be clustered on. So this takes about a minute to run, so I've already run it before we class, so I'm not gonna do that, but we can go look at what's in the first page. Again, the CT ID is the record ID in offset in Postgres, so you do that, that does not look, well, actually that's the unsorted one, so that looks all random, but if I change the name of the table to clustered, because of that, sorry. Now you see that in the first page or the first offsets, the tuples are actually sorted in that order, right? Based on the index. So if I go ahead and delete one of these entries, I delete the very first person in their fake email address, and I go back and do the same scan on the table, Postgres didn't fill in that first slot, right? It's empty. But now if I insert this fake person back in and do the same scan, right? They're still not in that first page, right? So to go find them, we do select star, select CT ID from emails clustered, where email equals this, right? Now they're in page 299. I don't know where that is, some random thing, but it's again, it's not in sorted order, because again, Postgres can't maintain the sorted order because it doesn't have true clustered indexes. Okay, so, B plus trees are important. Probably the best choice for your database system. Tries are pretty good. You can have B plus trees of tries, as I said before, but B plus trees, there's a bunch of ways of opting to make it faster. So next class, we'll talk about how do you actually make your B plus tree thread safe by traversing down and even when you're doing splits and merges, okay? Hit it. This shit is gangsta. Poppy with the motherfuckin', still got you shook up? I should tell you, look up, show me where the safe's at before I blow your face back. I got a black stylus like Tampa Proof. You can't lace that at the Dominickin', or you could call me Dominickin'. Black Skelly, black leather, black suede Timmelins, my all black dirty haters send you to the Pernigate. You get Kazama trying to skate and that's your first mistake. I ain't lyin' for that cake, your fam, I see you wait. My grand's has heavy weight and ran through every stake. When they ask me how I'm livin', I tell them I'm livin' gr...