 We have seen how to access records from a file, but the issue is what we have seen so far, let us sequentially read records from a file, we can go to the first block of the file and then read all the blocks of the file, read all the records, which is ok if you want to read the whole relation, but for most online transaction processing workloads, transactions do not do that, they will pick up particular students record and retrieve it. So, how do you fetch a particular student's record? You could always do it by scanning the entire student relation, but that is going to be very slow. So, what we want is an index which lets us find out given key value like the roll number, where is the record or where are the records relating to this particular search key value in this particular file, the student relation for example. So, question is how do you build such indices? So, we are going to start with some concepts before we move on to the actual index structure which is most widely used which is the B plus tree. B plus tree again the lot of details given that we have very little time, I am not actually going to try to explain B plus tree algorithms, but just give an overview of the structure and give some intuition about how B plus trees work. Again people here have taught this course know all this stuff. So, what I want to do is again go over it quickly and then if there are questions you have you know anything which you found weird, let us discuss those. So, the basics there is a notion of a search key which is what you use to store data in the index and to retrieve records in the index and then index in general consists of index entries of the form search key and pointer. There are certain special indices which instead of storing the pointer will actually store the record itself the contents of the record, but the typical index only has a pointer to the record stored somewhere else. And there are two broad kinds of indices ordered and hash we are going to skip hash indices and focus on ordered. In fact, in the real world hash indices are not used that widely for disk data storage. They are used for indexing data in memory, but on disk B plus trees are pretty much universal. Some databases like Oracle do support hash indices, but Postgres supports it, but they do not encourage you to use it. They say that we have seen the performance and it does not give any benefit. Stick with B plus trees that is the general the wisdom of people who are actually implemented these things. So, now what do you do with an index? There are several metrics including the time to access a record given a search key, time to access multiple records given a search key range, time to insert a record, time to delete a record, how much space the index takes and so on. So, we are going to focus on ordered indices where the index entries are stored sorted on the search key value. Some other notation this what is called a primary index which some people confuse to mean a primary index is index on the primary key. It is not an index on the primary key. The primary index is for a sequentially ordered file. If the file is ordered the primary index is an index whose search key specifies the sequential order for the file. So, the file is sorted on student ID. An index on student ID is a primary index. An index on student name would be called a second reindex. Now, a primary index is also called a clustering index. Why clustering? Because if you have multiple records with the same search key value. Now, you do not have multiple students with the same ID, but let us say we have the takes relation which says which courses a student has taken. If that relation is sorted on the student ID, all the records for that student are going to be cluster. They are going to be consecutive on that index because all of them have the same ID value. And this is very important. If you have a registration system which needs to find out what all the courses the student has taken. If this set of records is scattered across a relation when a particular student courses have to be displayed, you have to retrieve data from many places. Now, how many courses would a student have taken? If you want to display all the courses the student has taken since they joined your institute. Typically, students take what? 6, 7 courses a semester including labs times 8 semesters say of the order of 50 different courses. If these 50 things are at 50 locations on disk, you have to do one IO. Now, how much time will this take? If we say 10 milliseconds per IO, it is half a second to retrieve the records for one student. Now, if you have online registration and 1000 students are coming in in one hour or not even an hour, 1000 students are coming in in say a few minutes and retrieving data just displaying their data. They are not even registering. They are just seeing what they have registered for half a second times 1500 seconds. That is a lot of time and this is just one thing they are doing. They are actually doing many other things all of which take time. So, it is very critical for performance when you are retrieving multiple records to have the records clustered together. This can make a huge difference in performance and how do you choose how to cluster a relation? Well, that depends on what are the queries on that relation. So, for example, the takes relation would probably be ideally should be clustered on ID. The secondary index is one where the search key is not clustered. So, if you do not cluster the takes relation on ID, then the records for the student will be scattered. So, then that is index on ID will be a secondary index. There is a notion of index sequential file and ordered sequential file which also has an index. Now, an index has these entries. A dense index has one entry for every search key value which appears in the file. So, an index on ID has one entry per ID. In this case, the student instructor relation has one record per ID. So, the index has one entry per record. However, if a file has multiple records for the same search key, you may have a dense index which, for example, has one entry for compsci, but then there are three instructors for compsci. This would be provided this file is actually sorted on department name in which case the you only need to store the pointer to the first record. The remaining ones are consecutive. If this file was sorted on instructor ID, you cannot do this. Then you have to have one entry per record here. In contrast, the sparse index file has only a few entries. And this is applicable only if the file is sorted on this search key. And the key idea is that you know, if I store a few representative values. Now, if I search for record, let's say 1, 2, 1, 2, 1, it's greater than this record, it's less than this one. So, I will follow the pointer to this record and then search sequentially till I find that record. How far away can that record be? It may be anywhere from here to this point. So, as long as I have enough entries here, I don't have to search through too much of this file to access that record. That's the key idea of a sparse index. Now, single level index, you know, okay, so this is still a single level index. But the question is, how many in a sparse index I can choose to have one entry per five records? I can have one entry per 100 records. How do I choose this? And the typical optimal solution, a good trade off, is to use whatever is the size of the blocks which the database system uses. So, what is the meaning of a block? When the database fetches data from this, it's going to fetch a whole block of data. Whether you want the rest of the data or not, it's going to fetch that whole block into the buffer. So, once the block is fetched, it's available, it's cheap enough to read what is in the block. So, it's usually a good idea to have just one entry for the block. Anyway, you're going to read the block and then it's cheap enough to scan through the block. So, this index will have one entry per data block. So, this is a sparse index. Now, what if this index itself becomes very big? You can build an index on top of this index, which is a multi-level index. So, here is a pictorial representation. These are the actual disk blocks. The file is sorted on the search key. And then you have the level zero or inner index, which has one entry per block. This itself is divided into multiple blocks. And the next index has one entry per block in this level. So, this is a multi-level index. And this is how indices were designed initially in early days of databases. I'll repeat the question. Go ahead. Okay. So, the question is what if I delete a record here, like 3234, which was actually used in this index at this level. So, what you could do is replace that by the next record here. So, that the index is consistent with this. It is also possible to not do this. So, you can actually have an index which has values which don't exist in here. And we'll actually do that in the B plus pre case. We allow that. It's simpler to not bother. But if you wish to bother, you can update. Yeah. Question is regarding this multi-level indexing. Actually, if you increase these levels, it is going to affect the efficiency. Yeah, it will be slower, right? Yes, if you increase the levels, you have to traverse through all the levels. So, how do you justify that considering that indexing? It's a question. No, it's a trade-off. If you can use larger blocks here. Right. So, you can actually, you can either use the disk block size for the database block size. Or you can use a different notion of block which is bigger and have fewer levels. So, database implementations actually play around with these things. And the block size for the index is often larger than the minimum block size. But there are some tricks in there. I don't have time to get into it. But yes, they... But there is an interesting problem actually. But you do have to live with multi-level indexes because this index can become too big. You can't store it in memory. Whereas, this outer index may be smaller. Okay. In the interest of time, let's move on. We've seen, you know, on a primary index, you can have sparse things and then you can have a multi-level index. A secondary index doesn't give you this benefit. You have to have entry per key value. If you have multiple records with the same key value, you have a couple of options. One is to have a bucket of pointers, which has a pointer to each record with that search key value. There's also another option of having multiple entries here, one per record. Now, the key thing is that this itself is a very large file and you will have a multi-level index on this file. And this file happens to be sorted. Even if it's a secondary index, this file is sorted on the search key, which allows you to build a sparse index on it and create a multi-level index. That's the conceptually how indices are built. Now, if you come back to this picture, you see here, turn it around mentally, turn, rotate it 90 degrees and it looks like a tree structure. In fact, this is a tree structure. Now, there are some limitations with this kind of an index. What if you delete a record here? What if you delete multiple records here? The block may become almost empty. But still, what if you insert a record here? This thing has to be sorted. If you insert a record, this block becomes full. Now, what do you do? You have to have an overflow block. But then that block doesn't have a pointer from here. You have to deal with that. So, one way is to just keep one pointer and have a link list of blocks, which you have to search. But what if a lot of insertions happen here, this link list can become very big. So, that's a bad idea. So, a better idea is if this block fills up, you split it, move some other records into one block, move the other records into another block. But now, if you split this, you need an entry from this next level, you need an entry to the newly inserted block. But then this may become full too. And now you have to split this. But if you split, there is a tradeoff. If you split, if it's full and you split, each of these may be only half full, because there aren't enough records. You have 100 records in the block, you split the blocks. The two blocks together have space for 200 records, but you only have 101 records. So, each of the blocks is going to be only part full. That's a tradeoff which B plus trees take. You say, yes, we will live with this tradeoff. Let blocks be only partially full. But the benefit it gives is that when I insert records here, I can split it and add more entries here. And this is actually recursive. This is full. I will split it and add entries here. And this can cascade off. This is the key idea of a B plus tree. So, from a basic multi-level index, we now move to a B plus tree structure. B plus trees are going to be shown rotated. And here is a B plus tree structure. What does it look like? This is the underlying file containing the records. And this file in this case is sorted on the instructor ID. But this B plus tree is sorted on. It's key is the instructor name. So, now, if you look at this index, the bottom layer of this is actually sorted. If you see, you have B, C, E, G, K, M, S and finally W. So, if you look at the names, they are sorted. And the next level up has pointers to these things. So, the next level up is actually a sparse index on this thing. Now, each of the key values here has a pointer to the corresponding record in this file. There are no duplicate names. We don't have time for it, but there are ways to deal with duplicate names. We will skip that for the moment. The next level up is the sparse index. So, what does the sparse index need to do? It has to, our earlier avatar of the sparse index had to basically say which is the minimum key in this, which is a minimum key here and so on. Now, if you look at this, it's actually doing something a little bit different. This level has two key values, but three pointers. Our original sparse index had three key values and it would have had three key values and three pointers. It would have had an entry which says the minimum thing here is brand and here is a pointer to this. The minimum thing here is Einstein and here is a pointer to this. The minimum thing here is golden. Here is a pointer to it. Now, B plus trees basically have a small optimization which says for the smallest one, I'm not going to keep the minimum. I don't care what the minimum is. If it, as long as it's less than Einstein, I have to go to that pointer. So, there is no point keeping track of what is the minimum one there. It's irrelevant. So, you're going to have two key values and three pointers in this case, in this particular node. In this particular B plus tree, how many pointers would you have at maximum four and three key values at maximum? What is the minimum you will have? If you allow nodes to have two few pointers, then there's a lot of wasted space and the tree can also become very tall. In this particular case, if the maximum number of pointers in an internal node is four, the minimum will be ceiling of n by two because this is four. The minimum is two and this particular node is actually at the minimum. It has two pointers and it has one key value which serves to distinguish between this. So, if the value is less than Srinivasan, you go here. If it's greater than Srinivasan, the records are on this side. The next level up on the tree is the sparse index on this level. Here, there are only two values. We need a value which helps us decide whether to go here or here. But now, there is another issue here. If I start the minimum value here, what is the minimum value in this node? Is it Srinivasan? If you look, physically here is Srinivasan. But actually, conceptually, there is one more value which we have thrown out and that is really the minimum value. Since I have thrown it out, I can't find it from this node, but I can certainly traverse down the leftmost pointer here and find the minimum value here, which is Mozart. So, that is the value which I am keeping here. So, if I am searching for Mozart, I need to go to the right. Now, I find Srinivasan. It's Mozart is less. So, I will go to the left pointer, come down here and I will find Mozart. If I am searching for a value which is less than Mozart, say Kim. It's less than Mozart. I will go to the left. Now, here is Kim greater than gold? Yes. So, I will go to the right. In this node, I will have to search through the values and yes, I do find Kim. What if I search for a value which is not there? Let's say I search for Mohan. Is Mohan less than Mozart? Yes, H is less than Z. I will go down here. Is Mohan greater than gold? Yes. I will go down here. Is Mohan here? No. So, if I want to insert a new record for Mohan, the first step is to search for Mohan and now Mohan is not here. So, what do I do? I will have to insert the entry for Mohan here at the leaf along with the pointer to the Mohan record. Of course, I have to insert the Mohan record into the file also. So, that is separate. So, I am just worried about the index here. If I have to insert the Mohan record here, what happens? The node is full. There is no space. So, what do I do? I have to split it. When I split it, I have to add an entry pointer to that node above along with a value here which decides whether I go to this node or the newly created node. In this case, there is space here. So, we are ok. But if this were full, this itself needs to be split and then we have to go up. So, the splitting like this can go all the way up to the root. If the root is split, what happens? You get two nodes and you have to add a new level to the B plus tree above with a pointer to the two things. So, that is the basic intuition of how insertions happen in a B plus tree. Now, when all these insertions happen, you will see that the tree is balanced in terms of the height. If you look at this tree, all the leaves are exactly two levels below the root. When I do an insert, if splits happen, a new root is created and all the leaves will be three levels below the root. So, the depth of the tree, every leaf is exactly the same depth. There is another property which is that every internal node has at least n by two pointers. In this case, the minimum is two pointers. So, what is the height of the tree? It could be log to the base two of the number of entries at this level. Now, log to the base two of a million is 20, log to the base two of thousand is 10 and of a million is 20. So, if I have to search through 20 levels of a tree and potentially each of these is in a different disk block, I am in trouble. It is going to take a long time. So, in fact, these nodes are going to be very big. Typically, the number of pointers at max would be 100 or 200 and admin would be 50 to 100. So, now, if let us say the minimum is 100, what is log to the base 100 of a million? It is 3, 100 power 3 is a million. So, then the tree will have at most three levels. So, at most three discios to get down to the leaf level, which would be three times 10 milliseconds or 30 milliseconds, which is a lot better than 20 times 10 milliseconds, which is 200 milliseconds. So, B tree is a basically short fat tree structures, which are balanced. There are there is a lot of stuff here, which gives the details of the occupancy, more details about the structure of a node, the properties of leaf nodes, properties of internal nodes. For lack of time, I am skipping all these slides. Then, there are several examples with different n. We saw an n equal to 4. This one is another tree with n equal to 6, 6 pointers from a node, internal node. And you will notice that when n increase, the height of the tree decreased. Earlier our tree had three levels, now it has only two levels. Then there is a query algorithm, which I described informally. The formal algorithm is here. And then this logarithmic time details are on this slide. So, I am skipping all of those. Insertion, again we have described the intuition. This is again presented here, the same intuition, including how to propagate splits. All those details are here, because we are really out of time, I have skipped those. But here is an example. So, I think I will take a couple of minutes at least to work through this example. So, this is a tree initially. And now I am inserting the value Adams into it. So, is Adams less than Mozart? Yes, I go here. Is Adams less than Einstein? Yes, I go here. Now, this guy is full. I have to insert Adams here. So, I split it in two. Now, basically I will take a sorted list over here. Move half the things here and move half the things here. Half the sorted list including Adams. So, I will get Adams and Brandt here and Caliphari and Crick here. So, now when I split this node into two, I need an extra pointer to be added here. I also need to know what is the value which differentiates this node from this node. In other words, the minimum value in this node. So, what I have done here is, I add an entry Caliphari in from this, in this node I am adding Caliphari, Einstein and Gold are still there. The left child of Caliphari is the left node. The right child of Caliphari is the other node. The two nodes resulting from a split are separated by Caliphari. And then the other one, Einstein and Gold remain unchanged. And the entry, the other entries in that node remain unchanged. So, in this case, the split happened here, but did not stopped here. Now, this is another one. The original tree I inserted Lamport. Where does that go? L is Lamport is less than Mozart. Lamport is greater than Gold. So, it is here. So, Lamport goes in here. Now, Gold Cat's Kim is full. I have to split it. And that results in Gold Cat's and Kim Lamport. Now, when this splits, I have to add an entry to the parent. But now the parent, this is the result of a previous insertion. The parent is full. Now, what do I do? I split the parent. So, there were four pointers. Now, there is now one extra pointer. There are total of five pointers. So, three of those pointers will go to one node. Two of the pointers will go to the other node. So, the first three go here. And the last two go here. But there are some details about the intervening values. So, if you see caliphary Einstein, they remain the same. They are the intervening values. Then you had Gold, which if you see caliphary Einstein, those three pointers remain the same. Gold had a pointer to this. But now, this is split into two. And those two, if you see, are separated by Kim. So, this node here has Kim as the separating value and pointers to these two nodes. What happened to Gold though? Gold is now the separator between these three and these two. But Gold no longer stays in this node. It actually has to go up to the parent. Because these three land up here. These two land up here. So, the parent itself, I need to know when to go to this and when to go to that. And that value separating them is exactly Gold. So, Gold has moved up to the parent when the split happened. And it separates these two pointers. So, the pointer to this node now became two pointers, one to this node and one to this node. And Gold is now the separating value there. And that has gone up to the parent. Is this clear? So, this is how splits propagate up. Okay. What are these horizontal arrows? At the leaf level in our structure, there is one extra pointer. Because the leaf level, I am going to have one pointer to the record per key value. I can use this space for something which can be useful, which is to find out the next leaf. If I want to scan the leaves, I can use this pointer. But actually speaking this pointer is many people skip this pointer because it is actually redundant. I can always look at the next level up and find which are the children. And then I can use that to scan the children. Instead of going up and going down, it can go sequentially. However, it turns out that if you do this each time you read a page, you will have to process it and only then you know which is the next page to read. So, now the problem is after reading every page, you have to wait for the next page to be fetched. So, what optimized implementations do is, they will look at this node and say, I want to do a sequential scan and they will issue prefetches on all these four pages. So, while you are processing this first page, the remaining three pages are also being prefetched. So, by the time you use them, they are hopefully already in the buffer. And then when you are done processing them, it will actually go up the tree and go down to the next side. It will not up the tree. It knows where it came from. It will go to the next child and prefetch a bunch of values from there which is the next three over here. So, even for a sequential scan, the next pointer is not useful for prefetching. You want to use the pointers from the parent to do the prefetching. What is the difference between a B free and a B plus free? Both are balanced. The difference is the following. What we have used is a B plus free and you will notice that key values here do not have any pointer to an actual record. In a B tree, a key value here may also store a pointer to the actual record. Now, the original presentation of B trees actually allowed record pointers from internal nodes. Afterwards, people realized that this just made implementation more complicated and it did not give any benefits. In fact, it had some drawbacks. What is the drawback? By storing record pointers here, you are reducing the fan out of this node. If you reduce the fan out, the height of the tree increases. Now, you might say for some records, you can directly access them from here, but unfortunately, most of the records are at this level, not at this level. So, it turned out that it was not a good idea to store record pointers from internal levels and what everybody uses today is what are called B plus trees. So, we have in the book, for example, we have de-emphasized B trees. We just mentioned it somewhere for historical interest, but you do not want to waste time actually dealing with B trees. Just know that they existed and what is the difference and that is it. Do not spend time on actually implementing or dealing with them in any further manner. This is what is used and if you look at database manuals, they will not say B plus trees. They will just say B tree, but by B tree, they mean this one, not the original B tree. They mean this implementation, this variation. Any other questions? I do not know how much it is related to this presentation. I just want to ask question regarding the red black trees. What is the necessity of representing the nodes in the red color or the black color? It is also concerned as the balanced trees. Yeah, there are many different balancing schemes. Red black trees are one example. Two three trees are another example. In terms of the two three trees are a special case of B plus trees. If you have a node with a maximum fan out of three, the minimum fan out is two and a B plus tree with maximum fan out three and minimum fan out two is a two three tree. Now, there are different ways of balancing a tree. There are height balancing, weight balancing and red black trees are one particular implementation. So, that is probably more relevant to a data structure scores, not to a database score. The key problem with all those other structures is that they tend to be binary trees. And for this storage, for magnetic disc certainly, a binary tree is not at all efficient. You want a much fatter tree. You might ask what about flash storage? In terms of even flash storage, access is a page at a time. So, even for flash, a binary tree is a bad idea. Next, you may ask what about in-memory indices for a database system? Why not use a binary tree is there? It turns out that memory today is beginning to look a little like page access, page oriented, not a page, a cache line oriented access structure. Now, what do I mean by this? The processor speeds today are of the order of a gigahertz or multiple gigahertz. It is kind of stopped at two to three gigahertz. But now, the cache which is in the processor runs at this speed, a gigahertz speed. But if you try to access memory, memory systems today still take off the order of 50 to 100 nanoseconds, meaning that it takes 50 cycles of the processor to read even one byte of memory from the memory subsystem. So, instead of wasting time reading one byte at a time, every processor today will read multiple bytes at one go. It reads the whole cache line. How big is the cache line? 16 bytes is common size. So, it reads 16 bytes or 32 bytes at a time from the memory subsystem. Now, if you look at this 16 or 32 bytes, it really looks like a tiny disk page, really small page. And it turns out that B trees with very small fan out are actually better than binary trees even for an in memory index. So, many databases play around with this and even for an in memory index, they may use hash indices. But if they use B plus trees, then they would use things with tiny fan out. And then there are some tricks which combine a page itself in a B plus tree is organized as a set of tiny pages organized as a B plus tree inside of a page. So, there are a bunch of tricks which people use to get faster access to data within a disk block, which is really a B plus tree with a tiny, with a smaller fan out. Other questions? So, there are bunch of slides on how to do insertion. Then there are some details on deletion where we had splits on deletion. We will have merging on splits for insertion match to merges for deletion. But it turns out that when you go under full, if you are less than half full, you may not be able to merge with either of your siblings because they may be full. So, deletion requires a new operation which is called redistribution or borrowing of values from a sibling. So, redistribution again, there are details on the slides and in the book. For lack of time, I am going to skip these details. Maybe just show one example. I am deleting Strenivasan from this tree. What happens here? I delete Strenivasan. This node should have at least two entries here. It has a maximum of three entries. These are the leaf node. If I delete this, it has only one. Therefore, in this case, it tries to merge with a sibling and it can merge with this. So, Mozart, Sing and Wu is what is left here. Now, if this merges, this dividing value here, Strenivasan is now useless. These two pointers merge into one pointer which is a pointer to this node. So, now, what has happened to this node? It no longer has two pointers. It has only one pointer and in fact, it has no key values. So, this is under full. Now, what do I do? Can I merge with a sibling here? Unfortunately, no. This sibling is already full. It cannot take even this one extra pointer. So, at this point, I need to redistribute pointers. So, now, this has only one pointer, the pointer to this node. This thing is gone. So, now, what do I do? I borrow a pointer from this node. When I borrow a pointer, I also need to know what value separates this pointer from this pointer. How do I know what is the value which separates these two? I can search down here and find this, but instead of doing that, that value is already there in the parent node which is Mozart. So, now, what I will do is, I will copy this pointer into this node and move Mozart, the separating value into here. So, what I have got here is Mozart and a pointer to this gold node and a pointer to this Mozart node. Now, there is another thing which has happened here. Now, this node has only these three pointers. This node has these two pointers. So, the separating value between this and this has changed. It used to be Mozart. It is now changed. Now, what is the separating value? It is actually gold, which was in this node. So, now, gold goes up to the parent. It is kind of a rotation. Gold goes up and Mozart comes down into this and that is the result of this tree now. Gold now separates this pointer and this pointer and Mozart separates this and this. Is this clear? It can actually happen that, let me take an example. Supposing I delete gold now from this tree. If I delete gold, what happens? This node now has two entries. Is it under full? No. It has a maximum of three and a minimum of two. So, I will do nothing. I just delete gold and that is it. But note something interesting. Gold is still here. Do I need to do something about it? I do not. I do not care if gold is not actually down here at the bottom. It does not affect anything. It does not affect search. It does not affect insertion. Even if gold were there or were not there, I still have to search down till this leaf and then I will find it or not find it. So, how does it matter if gold is still here? I cannot look at this and say gold is there. In a big tree, on the other hand, this could be an issue. But in a V plus tree, we do not care. So, we will leave entries around which correspond to values which are been deleted and are gone. So, there is another example of further deletion. This example shows the height of the tree actually shrinking. For lack of time, I am going to skip this example also. Now, there is some issues on how to handle duplicate search keys. So far, we kind of assume search keys are unique. What if search keys are duplicate? There are probably three different solutions. One solution is to just let search keys get repeated in the V plus tree. If you do that, the search algorithm, the insertion algorithm have to change a little bit. Actually, insertion does not change much. The search algorithm has to change a little bit. Some details are there in the book. I should note that the first printing here had a mistake in the exact way in which this was handled. So, the errata has a correction. So, look up the errata or the slides. These slides have the correct form. So, you can either look at the slides or look up the errata which is on the website for the case of how to handle duplicates. If there are no duplicates, the algorithms, as stated, do not have any problems. It is only for the case of duplicate search keys that the algorithms have an issue. But it turns out that handling duplicates this way is not a good idea in any case. Yes, you can modify the algorithms a little to handle duplicates, but you do not want to do it for reason which I am going to cover in just a minute. So, the next alternative is to have buckets. That is, the search key is stored only once, but instead of storing a pointer to the records, it stores a pointer to a bucket which stores pointers to a record. What is wrong with this? It adds one more level of indirection, one more IO. Now, that bucket may be under full or over full. You have to deal with that. It is a headache. So, people do not actually implement that in practice. Yet another way is you have a key only once, but store a list of tuple pointers with that key. This is actually implemented by a few systems. But again, this has some overhead. What if this list becomes very big? How do you deal with that? That may span multiple blocks. Then you have to deal with those situations where it is too big and then the biggest problem though still is if you have a deletion, how do you delete that record? I will find the search key. I will go through that entire list till I find the record ID and then delete it. Now, there is a problem here. Now, all the other B plus 3 operations so far without duplicates had a nice guarantee. Insertion is logarithmic. Logarithmic in what? To the base of the minimum occupancy. So, if the minimum number of children of an internal node is 50, the height of the tree is the number of entries in the tree, log of that to the base 50, which is a very small value. So, it is very nice to have these logarithmic bonds on insertion, deletion and search. But if you have duplicates, you do not have and if you follow one of these two schemes we have seen so far, the problem is we have to look at all the records with that search key to find which one to delete. Now, what if half the records in relation have a particular search key value? If I delete one of those records, I am going to search potentially through huge number of records to find the one to delete. That is very bad. So, a much better way which also simplifies implementation of B plus trees is to not use the original search key which has duplicates, but to add on to that search key something which makes the record unique. So, typically a record ID, what is a record ID? A pointer to the record or a primary key or some other sequence number stored with the record which lets us uniquely identify a record. The moment you do this, what you have is a composite key, a key with two parts. The composite key can be used in just like an ordinary key, you know the salt order is straightforward, it is a lexicographic order. So, there is no problem with having a composite key in a B plus tree. The benefit now is that when I want to delete a record, I know its record ID or its unique key value. So, I will search in this B plus tree using the search key comma record ID and I will find exactly the record I want in logarithmic time, exactly the leaf which I need in logarithmic time and then I can delete that node and I am done. So, deletion of a record can be done in logarithmic time provided I make the search key unique by adding on something and that is what most implementations use typically. Now, a B plus tree so far, assume the records are stored in a separate file. Now, it is often useful to cluster the records in that file in a meaningful fashion. I said the takes relation for example, I would like it clustered by the student ID. But what if there are a lot of insertions and deletions to the takes relation? This happens when students register or change their registration, there are insertions and deletions. How do I make sure the takes relation is sorted? That becomes an issue. There are two solutions. One is I will periodically resort the relation and then I will use overflow blocks or something like that and that is the traditional solution which pretty much all databases use till some time ago and PostgreSQL uses that even today. I can ask PostgreSQL to cluster a relation which is to sort the relation and store it in sorted order. But the moment I do insertions and deletions, the sort ordering goes for a toss. How do I maintain the sort ordering now? That becomes a problem and the solution it is actually to use a B plus tree again. The trick is instead of storing record pointers in the B plus tree, I am going to store records in the B plus tree. So, here is an example illustrating this. This is a B plus tree on what? It is on some search key. I am just using a single letter C F to represent it and the records on the other hand are there at the leaf. If you see at the leaf, it is not storing a pointer. It is storing an actual record. A4 is the actual record and now I may have three records in this leaf, two in this. So, the records are actually stored here. So, this is called a B plus tree file organization. Now, this looks simple enough. In fact, it is simple enough. So, then you may ask why did most databases not implement this initially? Why did they implement it only recently and PostgreSQL has not implemented it yet. DB2 implemented it like a couple of years ago. Oracle implemented it maybe three, four years ago. Now, why is it such a recent thing? Part of the reason is that there was not enough demand for it, but initially they thought there was not a demand. But the second issue is the moment you allow records to move around like this. The notion of a record ID is a problem. Note that records are moving. Why do the records move? If a leaf becomes full, I split. Half the records move to a new page. Now, I have been assuming there is something called a record ID which is typically a page and a offset in that page. Now, a record ID completely changes the moment a record moves. So, I cannot use a physical record ID anymore. And the problem was most databases when they were implemented had the notion of a record ID hard coded in their code. If you allowed records to move, record IDs were no longer meaningful and that cost a lot of problems. So, in order to implement relations where records can move, you should not be using a record ID elsewhere. What do you use instead? If I cannot have a record ID, what do I do? The trick is that this B plus 3 file organization is on some key. That key itself, let us say it is on the primary key. That is a common mode. But whatever key it is on, that key has to be unique and uniquely identifies the record. So, the trick is now wherever I used record IDs earlier, why do you use record IDs? I use it in other indices, other structures. The record ID now which I should use will be this search key. And the search key will include something to make the record unique. Now, earlier I said the record ID makes it unique, which is the physical location. But now that is meaningless. So, what I need to do instead is store some unique value with the record, maybe the primary key of the record or maybe something else which is added artificially. So, now everybody else has to store this search key which includes that primary key or something else which makes it unique. So, this is what everybody else should store. So, that is detailed in record relocation in secondary indices. I am going to skip the detail here. For lack of time, I am going to skip a few more details. Maybe I should just mention this, because this is something which people are not aware of, which is if you take a database, create a relation. Let me go ahead and ask you a question. There are two questions to you. One is, by creating a single index, can we link it to multiple files, multiple data files? Is it possible? Single index which refers to multiple different relations. That is actually potentially something very useful. Some databases do implement it. So, yes, it is a good idea, which is a very nice question. Now, why is it useful? For example, if you are joining records from those two relations on that key. If you just search once on this index, you will find those pointers to those records and you can fetch them and join them quickly without traversing multiple indices. Second question is, how to choose an attribute on which index can be created to improve the access efficiency? That is another very good question. The question is, what indices should I create for a particular database? So, there are two parts to the answer. One is, every database automatically creates a number of indices. What are these indices which are automatically created? If you declare a set of attributes to be a primary key or a unique key, automatically an index is created. Why? Because to enforce that constraint, when you insert a tuple, you have to see if that key already exists and unless you have an index, you cannot do this efficiently. So, every database automatically creates that index. The second automatically created index is if you have a foreign key reference from something to another relation. Now, if you delete a record from that other relation, you have to check if there is a key value here, which references that. If you do not have an index on the foreign key attribute, you cannot check this efficiently. You have to scan the relation. So, these two kinds of indices are created automatically. Now, what other indices do you create? That is a function of the queries which you have. For example, name may not be either a primary or a foreign key value, but maybe you have a lot of queries on name. So, if you do that, you may want to create an index on name. Now, how do you know you are going to get a lot of queries on name? When you do the schema design, you do not have any idea really. It is only when you build a system and create the queries on that system, do you know what queries are there? Even then, you may not know which queries run frequently and which queries run very rarely. You may not want to bother about a rare query. So, what happens typically is you build the system, load the data, run a typical query workload and see what queries have run. And then, you want to see what indices would have benefited these queries. So, one way to do it is you do it manually. You have a database tuner who sits and looks at these queries and figures out what could be useful. But it is actually quite hard because you do not know what plan the query processing engine is using. So, it is actually a hard thing for a human to do. So, today, most commercial databases, SQL Server, Oracle, DB2, SciBear, all of them, provide a tool which is called an index tuning wizard which will let you record a workload of queries which are executed. And then, for that workload, they will figure out what indices are most useful. And they will suggest that you create the index. And if you say, okay, go ahead, they will create the indices. So, for any real deployment, these tools are very useful. Now, in Postgres, such a tool is not available, but you can, it still provides some monitoring features. So, you can find out which queries took a long time to run. So, then if you look at that query, it took, you know, two seconds to run. Why did it take two seconds to run? So, you can look at that query and say, but this query really wanted only people with a given name. They say, okay, I should create an index on name. So, you can do it manually by this tuning process. And Postgres does support the tuning, supports manual tuning. Use the mic, please. Sir, when we are accessing the database, that B plus tree should be in the main memory. No, B plus tree is brought as sequined into the buffer. The entire tree is on disk, pages which you access are brought as needed into the buffer. And during, if the records are stored in the B plus tree, that previous slide shows this. The B plus tree file organization. Then, then fetching things, putting these things into the buffer, it will take a huge amount of time. Actually, we are, we are brought in the whole record in the buffer. Yes. And in any case, when you access the record, the block containing the record will be brought into the buffer. Now, in this case, the particular block is actually the leaf level of the B plus tree. There is no. If leaf contains the record itself, then huge amount of IO has to be done. So, let me reinterpret that question. The question is the leaf contains the record. Originally, we assumed it had a search key and a record ID, which is small. Now, if the record is large, then the number of leaves of the B plus tree can increase significantly. So, maybe the height of the B plus tree may increase. That is a trade-off. In any case, there was an implicit extra level where you followed the record ID to the page containing the record. So, that is now explicit in a B plus tree file organization. So, really, there is no performance difference between the two. It is not an issue really. The real issue is that records move around and all other parts of your database have to be written in such a way that this will not cause havoc if records move around. Okay. Now, to get back to this slide, another performance issue is if you do the following. If you create a table, create an index on it and now you load a large number of rows. Load a million rows into it. You will find it very slow. So, yesterday we were playing around with Perscrisql and we loaded 100,000 records into a table. And guess how long it took? It took like 45-50 seconds, which may not sound like much. But if you do a million records, it is 500 seconds. And why is it so slow? Actually, if you take the amount of time to read a million records from disk, it is nothing. If they are 50-byte records, I could read them in one second. Why should this take 500 seconds now to build an index? It turns out that on very large relations, if you insert records one at a time into a B plus 3 index, you end up with a lot of random IO. Similarly, if you do another reason for this is if you have a foreign key declaration, when you insert a record, it has to check if the foreign key is satisfied or not. So, for each record which you insert, it goes and checks the foreign key one at a time. All of this results in a lot of random IO, which is why 100,000 records took 40 seconds. Now, most databases provide a workaround for this problem. What they allow is what is called a bulk loading facility, where you can say here is a file containing 100,000 records, load this file into the database. Now, what are the tricks that the database will do now? First of all, for all primary key, foreign key declarations, instead of checking them one at a time, it will actually do something like a join of this relation with whatever it references and that is much faster than probing one at a time. It will verify the integrity constraints that way. If there are indices on this, it will bulk load the indices. There are algorithms which let you build an index on an existing relation very fast compared to inserting one at a time. Those are called bulk loading algorithms. So, all of these are done by the database transparently and therefore, if you are loading a lot of data, it is a bad idea to insert them one at a time from a query, insert, create a file and use the bulk load features of the database to load it. And this particular slide here talks of one aspect of which is how to build a B plus tree on an existing relation very fast compared to inserting one at a time into a B plus tree. Again, for lack of time, I am going to skip it, but this is called bottom-up B plus tree construction. If you are interested, you can read it up later. There is some surf on multiple key access which I think is straightforward. You should be aware of this already. I am going to skip that and wrap up one last thing. If you want to create an index, you can run this query in SQL. It is actually not part of the SQL standard, but most databases support it which is how to form create index, index name, on relation name with a list of attributes. And optionally, they take what kind of index. If there are multiple types of indices, you can specify what kind of index to create. So, in today afternoon's lab, we will actually play around with creating indices on PostgreSQL and then see the effect of creating an index on certain queries, on the query plan, on the time it takes and so on. So, that wraps up indexing.