 we're already getting emails and fan mail. We actually got something for you this time. Really? Yes. So this is I love your class, I love DJ Mushu. I think you know this person. Can I just tell him that I forgive him for everything he did. I've changed my ways to I don't care if he's not letting me settle down. Can you tell Timothy to? It's just DM me on Twitter like it says little pill I don't know who that is that's that's his This is like you know pin name Okay, all right, um Okay, so for you guys in the class Hover to is gonna be do this coming up on the 25th Project one again still do it on October 2nd tonight. We'll have the Q&A session I posted on piazza the link is on zoom again the idea is that will present sort of rough high-level outline of what the project is about Maybe some gotchas and things to watch out for and then there'll be a little chance for you ask, you know some questions at the end, okay? And so it's also just one you also say I'll remind this one project to comes out project to project one It will be much easier than project to Project to be very hard. We're not hard, but it'll be harder Okay All right, and then we'll have the again the office hours will be on Saturday October 1st the the day before it's due, okay? all right, so Last class we talked about hash tables and we talked about how they Were a data structure that's providing sort of an average on average or one access to two items But we saw talked about how we have to deal with collisions how we actually want to find the data that we're looking for When you know Because we can't have an infinitely sized hash table We talked about how they can be used for internal metadata like the page table and your buffer pool I can use them as like the core storage of the database system temporary data structures Well, we'll see these when we do hash joins and we talked a little bit about table indexes so today We're focused most our time on these B plus tree data structures and they'll be most often going to be used for for table indexes So it's remind everyone what table index essentially is it's it's a third thing as a Is a replica or a copy of some portion of a tables attributes that are stored in a Specialized data structure that can be sorted or unsorted or organized in a certain way that allows going to find to do efficient look up sufficient access to find particular tuples that That have certain attributes with a certain values right and The it's up to the data systems job to make sure that the contents of the table And the contents of the index are synchronized And I'll do this for you automatically like if you're building index if I update a table It'll automatically propagate those changes to the index But you make sure that this is the case because we don't want to have any false negatives or false positives If we use an index to find data, we don't want to delete something and then go look at the index and tell us that item's still there right So the in for all to be a transactional workloads Best case scenario is 99 or 100% of your queries will be using an index in some way if you ever have to fall back to do a Scratchel scan that's gonna be super slow So if you want really fast response times you want to build indexes For analytical workloads you use indexes primarily for speeding up joins and you'd use these for the You know smaller tables not not not the your most largest tables So we're not going to describe how the data is it's going to figure out what index to build or use for a query Right, that's a whole another problem. What we talk about query optimization, right for now We can assume that something in the system magically pick pick the index that we want to use for our query Now what we're also sorry in our data system also too In most daily systems, they're not going to pick the indexes for you Use the administrator use the user has to Figure out what indexes you actually want to have Now there is sort of this line of work of autonomous databases the stuff that I work on There are tools that use machine learning and other techniques to find what indexes you should have based on your workload All right This is one of the oldest problems in databases. It goes back to the 1970s My advisor advisor wrote one of the first papers on this in 1976. He's dead Actually, his daughter is actually an ISR Jessica hammer but anyway, so like this is a really really old problem to and So progress isn't is it automatically to figure out what indexes you need You have to tell it what the next is you want the enterprise systems and other tools can do this for you And there's gonna be this classic trade-off in computer science again We see this this recurring theme over and over again of storage versus computer overhead, right? So I can have a lot of indexes and I can make almost you know Every query use you know re really fast because I could I could do lookups real quickly But then I have to maintain them There I said that the the data system is going to make sure that the index is synchronized with the actual table So every time I up if I have a thousand indexes on a table every time I update that table I have to update a thousand indexes so come up the right trade-off for for the speeding up queries versus the maintenance problem again This is this is sort of outside the script but then what we're talking about today We were talking about like if you call create index, what do you actually get and nine times out of ten It's going to be a a b-plus tree All right, so first I'm going to have an overview on what a b-plus tree actually is And then we'll talk about how we can use that in any systems the different design choices we have We're actually building it and if we have time we'll finish up with some Some more common optimizations you can you can do a girl being on what the I think what the textbook discusses and what the you know the the literature back in the 70s discusses certainly So the first thing we have to address is what the hella actually is a b-plus tree So sort of the confusing thing is that there is a specific category of data structures called b-plus trees But then there's actually one data structure called Sorry, there's a category of data structures called b-trees And then there's one data structure that is specifically a b-tree But the b b-plus tree is in this family of a b-tree data structures, right And a lot of times in database systems when you look at like documentation and the literature of what a data system says They actually support they'll say they're going to be using a b-tree But they're just looking at it. They're almost always using a b-plus tree If you go look at the progress code, they talk about using a b-tree But it really is actually a b-plus tree when you look when you look at the documentation right So the original b-tree goes back to 1971 the the b-plus tree that we'll talk about here There isn't an exact date. It's a literature that this is from 1973 So there's this paper that everyone else cites everyone cites You want to like to consider what the original b-plus tree paper is this one? You ubiquitous b-trees again, they're using b-tree, but they're actually describing b-plus trees And this is from IBM people and they mentioned in this that there's a some IBM tech report from like 73 that describes the original b-plus tree But I'm not able to find that And then there's these other ones like b-link tree the b star tree There's the b epsilon tree so much of these other b trees And the modern incarnation of a b-plus tree is actually going to borrow bits and pieces from from all of them Right, so the b-link tree was actually invented here at CMU by phil leman who's then he's still in the He's he works in the dean's office in the cs department or scs So his tree the b-link tree actually has sibling pointers between the the leaf nodes The original b-plus tree doesn't have that but most b-plus trees are going to have that now Right, so again, you'll see this as we go along I'll try to say here's the bits and pieces that are borrowed from different data structures, but it At a high level b-plus tree will be what I'm describing here, but it's not the original definition back in the 1970s right so the other thing too is like the the original guys at IBM that invented uh Invented the b-tree the b-plus tree never actually defined what b is actually take that back The b-plus tree was not invented at IBM. It was other dudes um They never explained what actually be with the b meant right they actually built the first version act bowing in their research labs Uh, so some say army means it means bowing It could mean balance broad bushy Bayer the guy's name right like it's never been defined, but I usually think of it as a balanced tree All right, so again, we're going to be focusing today on a b-plus tree But with sort of accoutrements from from the other ones So a b-plus tree is going to be a self balancing tree data structure That's going to keep the the the keys that it's storing in sorted order And we're not going to describe what what that order is, but that's called a correlation Right, so sometimes characters in different languages sort of different order for our purposes here We're going to assume english alphabetical ordering And so it's going to allow us to do efficient lookups deletes and assertions in always log n Right meaning we can always find to verse the tree and find whatever we're looking for and do whatever is that we need to do in log n Um, so it's a generalization of a binary search tree In that we're going to allow us to have more than two children per node Right in the original binary search tree like it's always what's binary it's two right and the data structure is going to be designed to Optimize the amount of sequential access we can do on large blocks Remember this is invented back in the 1970s memory was super constrained disks were super slow And there was a huge difference between random access and sequential access So they were trying to design a data structure that Maximize the amount of sequential access you could do when you when you did lookups right And but it turns out even though it's in modern in the modern era where SSDs are getting You're much faster random access is is not as that much slower than Than sequential access it's still the right data structure to use In most cases So the the way for more formula to say is that b-plus tree is going to be an m-way search tree with the following properties So first is that it's going to be perfectly balanced meaning the Every leaf node in our data structure will be always the same distance the same number of hops from the root That's why we can guarantee log n lookups for everything um And the way it's going to work is we're going to say that the way this it's going to say that every uh Every node out of the root has to be at least half full uh meaning the the You take the number of keys you have if you divide it by and either take the ceiling of the floor The exact the exact version doesn't matter, but you're guaranteed that you least have half you You have at least half a number of keys filled in your node at every level The root you could be a bit loosey-goosey with this And again depending on the implementation some might might choose different things in the end It doesn't actually matter as long as we're guaranteeing that it's it's log n all the way down Right All right, so let's look at a really simple example here So this would be a two-level b-plus tree And we have at the top we have or the leaf nodes we have the inner node in this case here It's only two levels. So this is also the root node And you're going to have these simply pointers of the leaf nodes that Once we get down to the bottom we can actually go back and forth Along scan along the leaf nodes without potentially having to go back up Right because this will matter when in next class when we talk about how to do multi-threading operations on this We want to avoid threads going in Opposite directions because then we have we can have deadlocks Now if you want to go along leaf nodes that you can have deadlocks We'll see how do I handle that next class? And so the composition of the of each node is these sort of key value pairs Where you would have a within a node you would have this array of key values and then The between each key value pair you would have this pointer to another node That would say which direction you would go down below Right so along at this top node here. We have five and nine So before five we have a pointer to the node the leaf node down here uh, and so the Since the key is five The pointer would point us down to a node where we're guaranteed that it would only contain keys With that are whose value is or keys who the key is less than five I don't say value because the value could be you know a pointer to it to a to To point it to the tuple we're going to skip that for now Right and then after five you have another pointer. It says less than nine So that's guaranteeing that any keys found down here and to the right of it Uh, so to the left of it will be also less than less than nine And then at the last one uh, you say it's anything greater than equal to nine Right pretty straightforward Yes, sorry. Sorry. Is this what sorry? Oh your question is like I'm showing this as an array of size three. So you would say it's a Officially we say this is a a b plus tree with a degree of four Meaning he has three keys and four pointers or outer edges coming from it. This is a really simplistic example to make it fit on Powerpoint. Yeah, in actuality the size we'll get to this It could be a 10 megabyte node could be You know kilobytes and the number of keys will depend on the size of what the keys are actually storing yes Her statement is her question is everything less at this point here everything, uh For this point over here going down is guaranteed to be the less than nine. Yes So the again the composition at the leaf nodes now since you don't have to have pointers to other nodes You actually have now a pointer to the value that corresponds to this key Right And if you're using this the table index, it's going to be the record ID like the page number The page number at an offset or slot number Yes The question is if I have a value for five where we're to go in this case here Uh, it would go here Right because this is everything less than five if I actually had the key five here we go here Yeah, so so she's picking up a good point I have I have five here Uh, but it's actually not a key in the data structure I think it's like a guide guidepost up above That says if you're looking for this key or looking for keys in this value go left or right of me But the b plus tree does not require you to have the values Uh, like any value that's in an inner node above the leaf nodes does not have to exist in the leaf nodes so in this example here Like again, I it's a pretty set up for you for demonstration purposes But in order to have five there, I would have to insert it five And then let some later point I deleted it So I'd have to delete it from the leaf nodes to say it no longer exists, but it could still exist up in the inner nodes Yes His question is in my example here if I insert four wherever to go It would have to go here because it's less than five Yeah, but I just I meant to say here like for this edge right here. This is less than nine Right, uh, but like the way you would do the do the comparison You would go from left to right. So you would look say is four less than five. Yes, go down this edge Yes, it's three four because there's four there could be at four at most four outer edges Coming out of each node. So in my example here, I have three right So these little like thinner bars here. These are the pointers going out. So you have one two three four Yeah, so you want to have to change the the root node you have to change what what guidepost keys it has in it But yeah, it could potentially look like that. Yes This question is does every leaf node have to be the same size or could be very big size So the same size in terms of like the number of keys that could store uh Does it I don't know I don't know if it has to be but you you want it to be because it makes your life way easier All right, because then you don't have to deal with like I deallocated page. I got I have fragment fragmentation of my desk or memory All right, so every node itself is going to be this probably these key value pairs And as I said, the keys are derived from whatever the attribute I want to build the index on so if I call create index on on column abc The the keys that are being stored are the values of abc for every every tuple right And then the values could differ based on whether it's classified an inner node or leaf node as I said to her question Like I could delete a key And so that I have to remove it from the leaf node because I if it's in the leaf node that it has to exist in my set That i'm trying to try to map But I couldn't maintain it in the upper nodes All right There's sort of could be a remnant of a key that used to exist in the upper parts because it's For the time being it allows to figure out where we want to go left or right And typically the arrays of the key value pairs we sort of in key order doesn't have to be But you can do this It simplifies things when you look up because you can do binary search to find what you're looking for very quickly We'll talk about index and currency control next class But think of I basically have to take a latch on the entire page of a leaf node Or at any node while I'm traversing it. So I want to be in and out as quickly as possible Yes It's David is So we generally think about this as like every node is a size of a page. Yes And that's why you you can have in the the enterprise databases You can specify that you want larger page sizes For index the index buffer pool because sometimes you want a larger page size of indexes All right, again, here's the here's the inner contents of a of a leaf node So you have again your your your key value pairs of Of key values And at the at the end here you have a pointer to the to the previous node and a pointer to the next node Um And then you get again just key value pairs that are alternating. So this is one way to do this. Um, You can also organize it as separate Separate arrays Uh, typically it's done this something like again, there's trade-offs for dean doing this On one hand, it's an extra hop to go go from here to here to find the key that you're trying to looking for Otherwise or is it this is an extra hop? But if you put them in line or one after another alternating then like you kind of screw up your cash lines We're gonna know all that for now But you also store some additional metadata like what level you are in the tree Uh, how many free slots you have and then the next and left border different data systems You know organize pages in different ways The trade-offs for all these things So the values themselves, uh Could either be the record eddies as I said page number and offset to go Allow find the find the two but you're looking for Um, but it could actually actually be the the tuples themselves So in some database systems the leaf nodes will actually be Like the slotted pages that we talked about before we're actually going to store tuples So when I traverse the index When I get to a leaf node when I try to find the two I'm looking for I don't have to do another look up to the page table and say give me that page number and offset Or give me that page number the data you want is already there All right, so I I I call these index organized tables Um For number two you'd use this for the primary key indexes because otherwise you'd be Duplicating data and it's unnecessary So different database systems do different things So the first approach where the the value is actually the record ID That's the process does and and and you know sequel server dv2 and oracle But having the the data actually the leaf nodes of the index That's what my sequel does by default. They're probably most famous one that does this Sequel light does it as well In sequel server and oracle you can actually specify for each table. What do you want to be index organized or not? The default is it's not But they do allow support for this Right, so again in my sequel when I do a look up on the primary key I'm going to traverse a b-pustry like we're talking about here But when I land on that leaf node, there's the tuple ID immediately Right, and I don't have to do an extra look up And there's pros and cons for for for these for these different approaches All right, so one common be um for any secondary index. That's not the primary key in my sequel The value is actually not going to be a record ID. The value is actually going to be the entire primary key So I so I do look up at my sequel on the secondary index I get to the leaf node It gives me the primary key then I do another probe in the primary key index to get the tuple I want Where in postgres if I do a secondary index look up I land at the bottom I get the record ID And then I can just go jump immediately get the data that I want Yes You can only have one index that uses it typically the primary key Yes, so that's my example if I in in my sequel if I have secondary indexes the value will actually be the primary key value So I do a secondary index look up I get the primary key then do a second probe in the primary key index to get the tuple I want Yes His statement is statement is what what is the downside of the first? What is what is the possible downside of the first approach versus the second one? Yeah, so I mean, so this is this is the problem with databases the answer is it depends right for everything If I only do primary key look ups the my sequel way is better Because I don't have to do the extra look up Right if I do a lot of secondary key and look ups then the postgres one is back to better There's actually a blog article I can post it on piata We'll make cover this later in the semester From from uber engineering where they They were using postgres that they hired somebody who really like my sequel So they switched their database over to my sequel and then because of this particular problem here They had to go switch back to postgres Right, so they paid millions like Instead of paying millions of dollars for engineers to switch it they couldn't yes me and I said use postgres I don't know enough like Like this is a good example I'm not trying to pick on them. They're having really smart people, of course Like it's a good example where you don't understand the return of the database systems You may make design decisions for your particular application or workload like That are just completely wrong and things get worse than you think it is There might have been other reasons why they wanted to switch to my sequel, right But for their particular workload as they describe it in that blog article postgres is the way to go All right, so I talked a little bit at the beginning. What's the difference between a b-tree and a b-plus tree So in the original b-tree from 1972 they restored the key and the values In any node in the tree both inner and the leaf nodes, right? So this is more space efficient because every key will appear once and only once in your data structure the downside is that If you have to do a sequential scan now Or a range scan in sort of order based on the key now You're basically jumping up and down the tree and you're jumping across different pages over and over again Where is in a b-plus tree since the the values are always in the leaf nodes Once we traverse and get to the bottom We can then potentially scan just scan along the leaf nodes using the sibling pointers to find all the data that we're looking for There may be some scenarios we'll see in a second where you may actually want to sort of do a Almost like a breadth first search you sort of get to one level go down and come back up But in most cases you can get to the bottom and just scan across Right Yes I think so statement is why would you have to bounce around in a b-tree? If you just if you're looking if you're doing a point query meaning finding one thing then yes, you go find one thing If you're doing a range scan then you have to jump jump around All right, so let's see actually how we build one So to do the insert we're going to find the correct leaf leaf node that we want Again doing that traversal looking at the the guide posts to tell us whether we want to go left or right Excuse me Then we get to the leaf node We want to start the the new key entry the new value in in sorted order So if our leaf node has enough space then we just insert it we're done If it doesn't have any more space then we have to split it And create a new node Right And there's sort of two approaches for this one we can try to to copy You know move some some Half of the nodes over Uh, maybe reorganize a little bit But when we do this because we now another leaf node We have to update our parent leaf or parent node to now include a new guide post Key so that with the pointer so that other people can find now where we're going right And then when we want to split an inner node you basically do the same thing But you always push it up pushing up the middle key and splitting things in half so the Drawing this in in power point would be a huge pain So i'm actually going to use this this this demonstration from This other professor, but this is actually a fork of it That has has changed some things so let's start really really simple here So we're going to start one goes like that Start three goes like that I'll say also we're max degree three So that means we have three you have three edges coming out of it every node But we're stores two two keys per node Right, so now i'll start two What's going to happen here? It's going to sit I only have one level so it says I have this The first node is where I need to go, but because I've already had two keys in there. I can't insert it So we have to split it right And so it picked the middle key two And put all values less than two to the left of it, which is just one All values greater than equal to two to the right of it, which is two and three Right So far so good All right, so let's keep going let's put in a six Right that would go here that's too big we have to split that right so it bubbled up three So now again the the first key of the root is two so all values Less than two go to the right of it so the left of it then the next key of the root is three So all values less than three go to the right of the left of that Which is just two and then the last edge would be all values greater than equal greater than equal to three So three and six So let's keep going. I won't won't start on the uh on the right side of the three So we'll start nine the way to split that Right That was a little I go back that was a little fast What's that? Is that here All right, so we're gonna start Yeah, I turned out the animation. Yeah, we're gonna go slow. So we're gonna start nine and so Again, we're gonna check the root to say all right nine is greater than greater than equal to three So we've got to go down here. This is our leaf node this we know this is where we want to insert But we already have two keys in there so we can't insert so now we got to split it But when we split it we're going to move up six because that's the middle key But when it goes up here and our and our node above it that's going to be full So now we got to split that And then that's going to create a new level for us Uh because we have to create a new new uh new parent above above above the uh original root So we insert nine Goes down here can't do that that pops up can't go there that pops up and three goes the root Yes This question is can the root be one or two out more elements? Uh, or can it be more It would be to his question before it'd be the same size as all the ones So in this case here a degree of three means I have two keys per node. So I have I have two keys It depends on the size of like it turns to the degree that you set up for your notes Yes The question is uh, the degree is three meaning uh, when the size is three it splits the degree mean It's a like from graph three degree means like the number edge is coming out of it So I get most at most three edges coming out of it So but so I try to start another key I need four edges coming out of it and that that violates my degree of strength Yes, so when I say when a degree has when a when a When a node has to be at least half full Do I mean it in terms of degrees or number of keys the degrees is number of keys plus Is number of keys plus one per node it says it's basically the same thing So like again starting with degree three meaning two keys per node like If you have one key it's it's half full and that's enough, right? So we can switch this to be degree of four you have three keys We can see what that looks like when we do splits later. Yes Yeah, so it's questions if your degree is even like degree four you could have arbitrary middle keys What would you choose again? This is where like the the exact one you would pick doesn't matter But I say as long as you're consistent in your implementation, which most systems would be It should be yes Sure, she says she wants to see a larger key. All right, so let's do We do one Oh with a slash all right clear that sorry one Two let me jack up Four right degree of four means like I have three keys now. I'm going to assert six It's it's not going to have any space or it's going to have to split it and it's going to make make a new level down below right Keep going eight Let's pick three right so now assert three three is less than four should should should go to the one on the left Right, so zero it's certain zero it's going to go zero is less than four So it's going to go to the left once I insert it where the Which says no zero it says one two three it's not going to have any room there So it's going to have to split it And now it's going to pick the middle key and then move that up to the root Keep going are we good say it again? It's balanced right now Which I mean definition of skew is what that every node is not Every node is what sorry no no no no its question is can you ever have A In balance where the distance of a leaf node to the root is not the same. No by definition. I can't Yeah, so saving is like for other data structures. You have to do rotations get things in balance No, like it's always going to be if you follow the protocol of the algorithm how do you maintain a b-plus tree? It's always balanced They did it's self-balancing Yes There's two his questions your question the same thing this is how do I if I have four elements? How do I say which one is the middle key? It doesn't matter Right pick pick the the first one or the second one as long as you're consistent doesn't matter So his question is do I Is the purpose of the sibling pointers to do redistribution? No, the purpose of the sibling pointers to do scans along leaf nodes Because otherwise I got to go back up Yes Yes, it does it's a perfect question. So his question is does the order of uh insertion Effect the construction of the tree. Yes So I like It maybe the case I insert things in one order versus another I will have it's no guarantee that certain keys will be in the same nodes Tries actually do have that property. So no matter what order you insert the keys in a try They will always be the same like physical layout in the data structure in b-plus trees. It's not the case It's like the hash tables, right? If I insert this key versus the other one if they're oh, they both hashed the same slot Whichever one I insert first we'll land in that slot first Ignoring robin hood and all that other stuff Yes, so you say if I split oh, I'm sorry Yes Oh, yeah, so the statement is if I divide the side of the split on one instead of two before then I would have a Uh In this case to read three No, because you wouldn't do that because then this would have only one key in it and beyond Yeah, you make yes If it is you need to make sure that the property that that the after a split the two two leaf nodes you're creating Are both have at least half half the number of keys Yes Well, the complexity guarantee is always low gap no matter what Right the I would think it not so much to save it as if I if I If I like we'll see a bulk load in a second if I insert my keys first and then insert them in sort of order What what do I get from that? so in There is a notion of like the fill factor of the nodes meaning Uh within a page of a go fetch from this how many keys are I going to have in there? Right and because we're trying to keep things at least half four and like we're we're we're doing this on the fly on on average I think in in real live data sets the occupancy of a vplus tree node is like 69 so that means you have roughly 21 of the data is as empty So there are commands and sequel called optimize like you say you can call optimize on an index Now all this is will support this that basically does compaction that gives you the minimum size bplus tree you would need So you could so with bulk loading you could do that you could you could Not also support this but you could bulk load the if you sort all the data ahead of time Then to build the bplus tree and you actually build it from the bottom to the top right from the top to the bottom Um, if you do that then you can at least have a hundred percent Fill rate occupancy of all your nodes yes His questions. Can we can we can we see duplicate how they handle duplicate keys? Let's get that in a second this one. I just do it nothing nothing nothing fancy what happened right Yeah, it was right I don't know. I don't know how to do the comparison uh In a we'll see how we actually would do this in in in a real news isn't There's no perfect. This is the best people people see visualization. I can find yes If there's a node with just a single number, how do you get an upper node? What do you mean? We'll come back to do look at values in a second. All right, let's keep going all right, so deletes Same idea we we started the route do traversal looking at those guide post markers We get to a leaf node you find the entry you want to delete if it exists If the leaf node is at least half full we delete it and we're done If it's now less than half full Then we have to we have to Rebalance things right And the first approach is you could try to steal uh Steal steal a key from from your neighbors Right, as long as they have the same parent node as you um, and you may have to update the the the The the splitter uh Key up above in your parent if you can't do that then you have to pick one of your siblings Uh to merge it with And then remove remove one of the out out degrees or out edges coming out from it from the node above you right so the The challenge about this is that the merge could actually propagate all all the way back up to the root And you got to collapse the height of the of the of the of the tree It's almost it's almost like rebuilding it from scratch All right, so let's go back here So let's delete let's go back Again that extra two so let's delete three right for the middle Now when I do this, uh The middle node will be unbalanced now Right if we have less than half half a number entries So I think in this visualization they're going to try to steal from their neighbors So they're going to try to maybe steal from the the the rightmost one with four six eight Right, so we go ahead delete two. Ah start from that guy It's not wrong, right just didn't didn't chose chose to go from the left instead of the right All right, I think it doesn't try to Yes, I say yes, like yeah, so if I delete here if I I'm going to delete three This node is now unbalanced because it's going to have less than less than half full So instead of having to uh I mean instead of having to like is this merging or is this is stealing right No, sorry. It's stealing. So it takes from this guy No, I think back it's merging. So it decided it's going to this thing needs this two now needs to go over to here That has has update the the key up above Right, so just do this again Delete three three goes away. It decides to merge those two Right So now let's keep going. So if I played delete four That's fine right Oh, it was very aggressive. I try to try to try to rebalance everything Yeah, because the root you can you can play games What what? All right, so why don't put it so again like this Whatever this thing is actually implementing. It doesn't follow the textbook, right? Because I know because I showed before it had you know It has to have at least two entries But I let the root with one entry they decided let's go packing more up in there Yes So here go back here when I when I merge This one here So if I deleted three But you'd have to update this thing too Yes Yes Again from the implementation standpoint, you would actually already have the latch on your parent anyway So so we'll talk about this next class like you have the latch on your parents So you already have you know you already have exclusive access to it So it doesn't cost you anything to go update it because you've already you're already blocking for anybody from touching it Yes Let's come let's come to that next class. Yeah Yes Yes, David is if I steal from my sibling Uh, could that could my sibling have a comment on balance? Yes, so therefore you don't want to steal from them and you just you just merge Actually in this case here if I if I if I store actually negative one Right that should go over here Now this now if I delete three Yeah, whatever Again, this is the best idea. Okay, so if I delete if I'll eat three now it should steal from the neighbor All right, let's ignore this. This thing's fucked up Go with the textbook I said, I This one's a fork So the the original one is from this guy, San Francisco I don't know who this dude is And I I use this one because he made a drop down I thought he improved he made a drop down instead of having like Whatever checkboxes the original we can take this offline. There's the original guy is the is This dude here, so find his webpage. He has the original one Anyway, all right, let's keep going All right, so doing lookups now So I showed examples where We have these guy posts that tells us where something's less than or greater than equal to and that tells us to go left and right Uh, and that's pretty simple if you have one one key or one attribute in your key But you can you actually do some tricky things having partial lookups when you have multiple keys or multiple columns in your key So simple example, I have an index on columns a b and c at my table So I could do a conjunction where I say a equals one and b equals two and c equals three And I know how to do that traversal again using those kind posts to find the data that I'm looking for But you also can do partial key lookups So I can do a lookup on a and b without c And that still works because we just assume c is a wild card and we'll take anything along as a and b matches But then I also can do Suffix lookups as well Where I could have just b or just c or just b and c without a and I can still do do that lookup Now that that last one's kind of tricky and not everyone supports supports that one But in general, I mean you have to have the I have all the keys do the lookup um You can have either quality predicates or you know range range predicates on partial some of the keys But only not having the the You know the first set of keys in your in your list of attributes Again, not not everyone supports that And so this differs from the hash index stuff that we talked about last class If we sorry hash tables are using as an index whereas I have to take the key and I hash it and that tells me What where where to jump to in in my hash table? And so if I don't have all the attributes that I need to do my lookup on then my hash is essentially meaningless Right So this is one advantage you would have using a b plus tree over over a hash table for indexes Of course b plus trees can handle like non-quality predicates range scans not not equals and you can't do that in a hash index Yes Yeah, we'll come to that right so let's look at some examples So here's the first one we have we have an index on A and B And so we want to do a lookup we'll find find key where we're sorry. We have two columns column one column two I'm going to do a lookup where The first column equals a the second column equals equals b, right? So even though there's not multiple keys We just do comparison in order one after another So the first one we say is a less than equal to a And is b less than equal to c. Yes, so we know we just go down this edge here Find the data that we're looking for Right If we have now a followed by a wild card Right, so in this case here, we know a is less than equal to a so we just traverse down to the leaf node here Then now we scan across the the leaf nodes until we find a The key that violates the predicate is is is less than equal to a Right, so once once we reach the first b Since a is like a a star is less than b star Less than equal to b star, we know we can stop here and we don't need to scan all the way down Yes How do you build up the value that the key is pointing to what do you mean? Uh, I'm not I'm following what you're saying. Sorry Yeah, I'm just thinking like again like Uh, but column zero equals five me five me five me all keep following all keys where column zero or column one equals a And then take anything for column two, right? So as long as I see Column a column one equal a I keep scanning until I see something where column one does not equal a then I stop because again They're in sort of order at the leaf nodes So I know that once I find the first thing where it doesn't equal a then there can't be anything else And another leaf node that could could match my predicate therefore I can stop right Yes, uh Questions why is ac in the second node because because uh ac This ac is less than equal to this ac These are values a and c are values of column. Yeah column one column two. I should be should be more clear All right. So the last one is this one here where I I only have a partial Right, so I'm trying to find where I don't have it. I don't know the value column one, but I want column Uh Column two to equal a Now this one's pretty tricky or pretty stupid because it's it's only a two level tree, right? You basically have to search for everything Right But you actually can't do uh the sequential scan that we saw before you actually need to do multiple traversals Down to the leaf nodes, uh, and then scan just that node to find what you're looking for All right Because you don't know you don't know when to stop in this example here again, it's stupid because it's only a two level tree Uh You just you could just jump to the the leftmost node and scan across to find everything which is the same thing as a sequential scan So let me do a more more sophisticated example here So now I have an index on column one column two column three Uh, and then the values that can have partially have for these columns or are either a b c or d Right, and I'm going to do a lookup now where column equals b or column two equals b So this is a gross, uh Diagram it has small Again, this is why it's hard to show these things in in powerpoint because things get a little packed But the main takeaway I want to show you is that I started here at the top And I know that I want to do a lookup where b were column two equals b So I've just go down to the first the next level And I would do my look up here now i'm trying to find where a a a Uh inclusive up to a b a exclusive And that would tell me that I know that it could never be anything down here because i'm trying to find where Uh Where for b column two equals b therefore I don't need to look down here But for this one here says I do have something with b in the second column. I do need to look at this one So we started again the the level right above the leaf nodes I'm trying to figure out could there ever be anything below b and if not then I can just skip it I just do keep just doing this down all the way down for all these Uh skip this one And just traverse all the way down Right, so the idea is again you're just using these markers up here to decide whether I need to jump down down below Yes, um, I mean the sibling pointers So we're the sibling pointers. Oh, so your question is Uh, maybe go back to here This one didn't have siblings your question is is there an example where I would jump down to the bottom and follow along the leaf nodes here In reverse direction. All right. The question is um I've always showed he's going let's go to left and right. Would we ever go right to left? Yeah, so if I'm trying to find all values, uh Less than some key look up Right, like a between range it may be better to start from here and then go backwards Your question is is that controlled by the data system? Like is the data system going to figure that out for us? Yes, that's like a few more weeks. That's like Sequel query shows up. What how do I translate that into a query plan? I want to execute We're still like sort of like, okay, assuming something else figured out how they want to use us How do you suppose this data structure to them? Yes So questions are the sibling porters between internal nodes? Typically no because you don't you you Although it could help some scenarios. Well, I say It it complicates things because of deadlocks like I could be going this way you could be going this way And so if you just restrict it to be on the leaf nodes You you avoid like you have to handle it But you know there's not some kind of weird thing of people going in different directions up and down This gives can Yes Yeah, yeah, so so yeah, so the question is in this example here Because I I actually want to look at all of these And do the same sort of analysis How do I get back? How do I get from here to here? You would go back up. I should take it back. You would restart right because you never You don't want to go back up unless you already hold the latch for something And so if you like the optimization for scans you try to release the latch as soon as possible cover this next class And so if you don't have the latch for your parent you just start over and start from the root So this question is To do a range scan would I traverse once find the starting point traverse again to find the stopping point? And then just figure out this scan along those. Why would I need to know the stopping point? Right I traverse once and then I have some some predicate that says stop when this thing's no longer true And I scan along till I stop till I stop All right, there's a lot of things that we want to cover and we're short on time We have to overflow into next class. That's okay. All right Uh came up with how do we handle duplicate keys And I mentioned this last class and I misspoke I said you can do this for hash tables And you can't because you need to have the whole key for the hash for the hash Did you use a hash table for it for an index? But a really common trick you would use is the vplus tree since we can do these prefects lookups where we just have a portion of the key um If you if you want to support unique that or sorry non unique keys then all you do is just you just put the Uh, the record ID of the tuple you want to point to as the last key Uh, or last last attribute in the key of your of your index So I have an index on on column one If I want to make sure that's not that it can handle non unique keys It's underneath the covers. It's really column one Followed by the the record ID that can possess those that cti think d thing and that guarantees that No matter if you insert the same key over and over again As long as it's not for the exact same Tuple it's always going to be unique And because I can do record ID lookups or sorry, I can do partial key lookups or prefixes I can just you know, I can just do my lookup on that that first column And I'll get back all the uh, I'll get back all the tuples that that match it Another approach which is not not as common But you could do this is do over overflow leaf nodes. You basically just support duplicate keys But instead of having to change the Um Change the the guideposts and the internals above you you sort of append them as a link list going going down And that sort of violates the the log n stuff that we talked about before Yeah, but But it's one way to do it Again, it's a constant factor. Does it matter? Yes, but like if if you don't have if you're If your key distribution is not super super skewed We're all you know one billion attributes have the exact same, you know key then it might might be okay All right, so let's see how to do that the the first approach with the pending the record ID Right, so now the and our values was always the key by the record ID But now for all of our Anytime we insert something the real key the real key we're going to assert is always going to be the You know the key itself followed by the record ID So now what if I want to start six I already have six I'm not showing that we're adding the the record ID But now the the new six we're inserting is is is treated as unique Because the wreck the the page number the slot number will be the same Uh for the new entry Pretty straightforward. I think this is how postgres does it so all right, so if I have now a uh If I would do overflow nodes, I want to start six again Basically, I find I traverse the tree find the leaf node It should be in and then I just tack on a little pointer says oh by the way Here's another page that has additional additional keys that should map to this location All right, and as I said if I keep inserting keys that could at the same location Uh, it just gets filled filled up like this At some point if I started a new key that maybe it would be within this range Like 6.5. I would recognize that that value is distinct From the duplicates I have in here and therefore now I should split following the protocol from before Yes, the question why don't keep the what sorry? Oh these questions. Why don't I keep the overflow page sorted? Uh, does that help you hurt you? What do you know it would make lookups faster? Oh, it would definitely make it could make uh Making surgeons slower because I may have to like cascade right like if I insert 6.5 this is long Now everybody's gonna get a little bit over by one So yeah, so if you're doing this you you want to keep it unsorted Yes The question this question is if what if I insert six why don't do a split because like The think of these is sort of is like Logically these are sort of one leaf node right, uh And because I'm not depending the record ID here that there is I would have this problem where I could have six up above And it could be actually on the left and the right I mean, I mean the statement is like because I have six seven eight I could could split and put the two sixes on one leaf node together Yes, you could but I'm first I'm just trying to vision demonstrate that with this overflow thing actually does The back yes This question is if I split the leaf nodes do I split the overflow nodes as well? Yes So maybe the case that after a split the new leaf node might be also overflow as well Yes, correct. Yes, so statement is the keys When we use a b plus tree for indexes That the keys are the attributes within the table And then the value is a is either a pointer to the tuple or the actual tuple itself. Yes so statement is if So statement is basically the I showed the example before of the Composition of the Of the page This guy back here, right So I had the the values and I assume it was one automatically from a key to a value Uh, and then this is just an array. So you're basically saying I could have the key point to an array Within my page that here's all the values for this you could do it that way too. Yeah But at some point it could overflow as well All right, so let's go a little bit deeper now. What else we can do with the b plus trees So, uh, you have a notion what's called a clustered index Where this is you're sorting a table Based on some specific order of the primary key Right, this guy to be the the the mysql approach Right said where the the leaf nodes actually store the tuples themselves Right, but since you get that for free Uh, or you could tell the data system Okay, the this table is was originally unsorted, but now make it sorted based on this given index At any time you do an update or delete or an insert the data system We'll try to maintain the uh, that it has to enforce that that sort sort order Right So not all data systems support this, uh We can see an exam on postgres in a second postgres has the cluster command You can tell it to cluster the table based on an index and it'll do it once for you But it won't actually actively maintain it over time Yes Uh, your question is is there a specific primary key that would be better or worse? What do you mean by that? Every table can only have one primary key Is there a type of primary key that would be better? Uh Like based on what the attribute types are are the integers or strings or whatever Uh It's not specific to cluster indexes, but certainly having indexes on integers will be way faster than integers on strings Right, so that's one way to think about it So some data systems support, uh Support, uh Indexes that aren't the primary key. Yeah, there's secondary indexes. Yeah What do you when you say what is the is the is the order the tuple sort of based on that? No, you can only if there's if you have a clustered index for your table the table can only be sort of one key Right or whatever whatever that index is if you have a uh You can build a secondary index and that can be sort of anyway on any actually do you want? Actually any direction like sending or descending Yes Yeah, go for it. Yes This question is is the primary is the entire tree it'll be sort of able to be certain memory That's why we can have fast lookups I mean fast from asymptotically log in it's fast, but there's no guarantee that those pages will be in memory Right, my example. I said beginning class like nothing The data system won't won't stop me if I say build an index build a thousand indexes on the table It'll do it for me, right Whether it all fit in memory depends on with the hardware depends on the access patterns of the queries There's no guarantee In in a in a production system you want it all to be in memory Right, so you you you pay for dba or some kind of automated tool to try to figure that out that balance for you Yes This question is is every single index you build it's a to a table Uh, it basically can can you have multiple? This question is basically asking can you have multiple? Can an index reference multiple tables? How would that work? But what's in the leaf nodes? I have values to the point of what? But for what table? But how would I know what table you'd have to put some metadata to say what table i'm looking up on? uh yeah From what we're talking here doesn't make sense the way We're not going to talk about inverted indexes. Basically some of the new sequel guys You can almost think of like a search engine find me all the tuples that have this string in it Uh, if you just put everything in a giant table, then it would work like what you're describing But in for the relational model that doesn't make any sense Oh, so this question is if I if I built an index on every single column my table, would that be the same thing as index storage? Maybe so When you talk about this class, there's a notion of what's called a covering index a covering index is um For my query if I can get all the the data I need for that query if I can get it from the index Then you don't have to go look at the original tuple Right, so it'd be a shitty way to do the index organized table But you could potentially do it that way But when you update that when you update that tuple, you're starting to go you're starting to have to go update the original table so that's like You could do it's a bad idea. I mean, but it depends All right Since we're not going to get through uh I mean actually so let me take it to show Not this Right, so basically a cluster b plus tree looks like this again. We have our table pages. They're unsorted Um, and we have some some b plus tree up above Right, and this is going to guarantee that the the ordering of the the tuples and the pages will be based on the Based on whatever the key specifies. All right Or it's sort of the index that we're specifies You don't have to use a A a clustered index to get this benefit as well Um, so this is sort of the example he was asking could I have an index could I have multiple indexes that have their own different Sort of noting that's different than how the The the table is actually sorted Uh, the answer is yes, but the challenge is going to be that If I blindly follow the the pointers that the of the values that the index is pointing to If I just go fetch them or like, you know In order that they show up at the index Then they actually may not be that efficient. This is getting a little into a query execution But this is sort of a simple optimization we could talk about now So I have my index is I'm scanning across the leaf nodes But now my pointers point to different random pages Right because the index is based on a different key than maybe how my my tuples are sort of them Uh, physically in the in the pages So if I blindly follow the the pointers in the order that that the davis sees them As it scans along I may end up doing something really stupid where I go fetch a page Uh, and then immediately throw it away because I'm running out of memory space And then I end up in my in this same query go fetching the page over and over again Right redundantly, which is wasteful So I really saw optimization is I just do my scan first figure out all the pages I'm going to need to touch Sort them Based on the page ID and then then scan through in that order So that guarantees that I'm only going to ever go fetch One page wants and only wants for this query All right, so in this since we only have 10 minutes left We'll have to cover those up next class. Let's Let's do a quick demo, which is always fun all right, so um For this demo, I have a table that's already preloaded that has um A bunch of email addresses so, um So this is actually, uh, it's 27 million emails from the ashley madison Hack breach or whatever you want to call it? From a few years ago. They made it available on bit. Torrent. So here it is Um, so I've preloaded Uh, what's that? Sorry, I've I've already created two indexes. I have a hash index and a At a and a b plus tree index on the emails. So in uh, postgres Uh, if you just say um, well sure Uh, you call create index you specify what table you want to build the index on and what columns you want to build along um And in postgres if you add this using b tree calls then you're you're you're guaranteed to get get a b tree Yes, if you tell it not to so by default if you just write this I want to build a index on on email Again 99 of the databases that are out there will give you a b plus tree But you can you can specifically tell possess. I want a b plus tree Um, if you want a hash index in postgres You say this and you add you add using hash at the end and this is about a hash table Again for 27 emails, it's going to take a little while to do this. So I've already preloaded it Um, so let me also turn off a bunch of other optimizations if we don't care about now um So if I want to get the the minimum email Right, it comes back pretty quickly and I cannot explain To figure out what it did. So in this case here Uh This is actually the covering index I talked about before so it's doing index only scan meaning it only looks at the index to find The answer question I need or the the data that I need again The beauty of the relational model in sequel is that the data system knows my select query all they want are emails And so this there's an index on the b b plus tree the b plus tree index already has the email So I never have to go look at at your original tuple Right, so this is super super fast So is this one new that all I need to do an index scan to traverse down to The left side of the tree because it's that sort of order and I can immediately produce the answer that that i'm looking for All right But now if I want to try to do something more complicated Like I want to find uh an email that begins with foo Right, it takes kind of a while to run And produce a bunch of results We don't care about the results, but we can see what the query plan is was that Are these emails in there maybe uh, they're actually some cmiolas in there. Um, it's not me Uh, right. So this case here I'm doing you know Trying to look up where find me all emails where it starts with the prefix foo I have a b plus tree that I could do this on right But it chose to do a sequential scan because the data system has this It's optimized to try to figure out what would be the cost of doing a sequential scan versus index scan And it decided that foo is somewhere in the middle of of the values the better I'm better up just scanning through the entire table and trying to use the index for this Right, let's do another one here. We can do a with a Uh, this one to try to say where email equals zero zero with some other fake thing And then some other random dude in hotmail, right In this case here post-process size is now going to use the hash index Now we have a disjunctions with an equality predicate with the entire value of the key that we're looking for So it's going to do the first look up on the hash index with the first for the first predicate Then it runs it again for the second predicate And then this is called a bit bitmap the next scan but ignore the bitmap part Basically, it did a probe once find the entry that it wants did another probe again to find another entry And it now knows those are the two pages that has the data. I need to go fetch That's sort of that optimization I saw before and then it says, all right Take takes the uh takes the ore of them finds all the the pages I need to touch And then goes touches them and just find the exact data that we're looking for right So that's a good example where again where the the data system can can within a single query can probe it multiple times And in this one here, it chose to use the hash index because You know, it's on average o one. So it's that's much faster than using the b plus tree So now we do another one where you want to say Uh find all emails without uh were greater than andy I think we need the Like this In this case here, it chose to do the sequential scan again because we're on the the left side of the tree And we had to do all these leaf node scans and it's just faster to do the sequential scan So I try to pick something on the the uppermost portion of the of the key space Like zzz Now I recognize aha the thing you're looking for is actually Uh, the range can actually be much smaller because I know that I know roughly the distribution of values So I'll jump to the left the right side of the tree and I have to scan across and the index scan is faster So it chooses that so Let's see now without a clustering index So I'd go ask it. What's the minimum ctid of of this table and it's zero one is page zero offset one right slot one But if I go look actually what those values are Let's just keep it simple Let's say let's get the first four values right where ctid is zero one zero two zero three So go find the first the first page zero and give me the first four values Right, you can see it's much a random email address, right? So this is showing us that the data is actually not insorted like I I randomize or I shuffled the Randomly sort of the data before I inserted it and postgres doesn't sort of in the way in the order that they were Inserted them into the pages in the order that they appeared in the file that I loaded right So there's a command in postgres As I said before called cluster I'm not going to run it because it's going to take it's going to take a while right It's basically going to copy out the It's going to make a new version of the table And then it's going to sort them or do a do a rank Do the leaf node scan and sort them back into the table Based on the order that they appear in the index So I already have one that's already prepopulated You know clustered or email sorry Right that has has the same values as before with all all 27 million 27 million emails So now if I do that same query that I have before I said go get me the first four entries in the first page On my clustered one I should know so that that's the unclustered Let's do the same thing on the clustered Now you see because I ran that cluster command it it reinserted the back of the table in that order But as I said some data says will maintain things in that sort order postgres does not when you call cluster command It's a one-time operation and it won't won't maintain things in the sort order going forward And I can test this because I'll delete the first entry Right this fake email address with zeros right And if I run that the check to what's in when the first four slots Now you see that again zero one is missing. It's empty. Right. It hasn't been used yet By now and sort it back in And I do my look up before This it's not there So now I want to go get it Ctid Star from emails clustered Where email equals this guy, right now it's some some page 299 on offset 146 Right, so even though I told postgres I want to I want to use a clustered table or cluster my index table in this index It's not maintaining it for me Yes This question is this mostly to speed up reads going forward. Yes This question is this the postgres property do others not have this problem? Correct. Yes, uh We could load the data up in my sequel since my sequel Actually, if we build the primary key on the email addresses Then my sequel would would guarantee it's always in sort of order Yes Yes, save it is and again, this is the important thing to understand it. I said at the beginning this table is Even though logically the email address should appear at the the right left left side of the tree because that's the smallest value Postgres is not guaranteeing that's going to appear in that order On the pages correct again the index is like a replica of the table It can have its own physical properties and do whatever you want with it That that's clearly separate than what the table is and but always has an underlying table like you because that's that's the sort of the That's the primary value or the the the master values of those tuples uh, so so save it is Even though they're logically the same the cluster command guarantees that the The the the table pages with the tuples will be in the same sort order as the index. Yes But in case in my seek or in a certain case of postgres, they won't maintain it for you So sometimes you see people running like like cron jobs every saturday They're rolling the cluster command every night they run cluster command for the next for the next day It's great save it It seems very expensive to you insert if you want to maintain the cluster If you're not storing the the tuples in the leaf pages like my sequel is The answer is yes So especially if you want to get good performance for a sequential read Uh, you want your cluster index Well, no like you could do this for uh Depends on what the query is right So let's actually look we have time to quickly do this Let's do Here we go All right, so the statement is If you're doing a lot of sequential reads, do you need a cluster index? No, because it depends on what the command is and if it's OLAP, you're doing nothing but sequential You know sequential reads anyway, and you want to look at everything anyway All right, so I have two terminals here on on this side. We have sequel light on this other side We have duct db duct db is a column store That has vectorized engine it has a whole bunch of the stuff we'll cover later in the semester But sequel light you guys used for In first homework, right? So I think this one is query four Right So just trying to get the name of the counter people that appeared uh based on number of appearances Right, so sequel light on my box here takes about 17 seconds theory But let's use duct db Same query Right half a second So The in this case here, I'm basically doing an entire sequential scan on the table or with the join Uh, there's other optimizations that'll make duct db faster before Sequel light before the even the clustered index There's nothing in my in my query here that that would take advantage of the clustered index anyway Right So I said what is the exact reason duct db is faster? I think it's the combination of column store. It's vectorized. We'll cover that in a second. It's doing, uh I think they're also doing query compilation and it actually is probably doing parallel execution as well So everything So is the clustered index going to make a huge difference for uh for olap? Probably not column store would be a bigger difference Okay All right, so there's a bunch of things we didn't get through uh We'll cover that we'll cover the beginning next class So we'll say next class talking about the different design choices and optimizations we can do okay All right hit it