 So, few things to note. The connections between nodes of the tree are done by pointers and the actual nodes may be scattered around the disk. So, if I go from one node to the next to its child, it may be completely somewhere else on disk. So, I may be required. And as already said, B plus tree contains a relatively small number of levels. In fact, here is a formula. The level below the root contains at least 2 times n by 2 values. The next level has n by 2, ceiling n by 2 times this. And in general, we can say that if there are k search key values in a tree and the size of a node is n, n pointers. So, the minimum number of pointers is n by 2. Then the tree height we can show is no more than ceiling of log to the base n by 2 of k. Now, that is why n being large is important. Log to the base 2 of a million is 20. Log to the base 100 of a million is 3. So, that is why n is large here. We are going to see how to do insertions and deletions. We just saw how to do searches which is very efficient. What about searching for a range of values in a tree? So, let us take this example tree and there is pseudocode which is actually very simple. It says start with the root and until you reach a leaf, find the pointer which you need to follow to go down and the details are here. I am going to skip the details. You can read it up later, but I already explained the intuition how to find which pointer to go down on. And once I hit a leaf, I will see if the value I am searching for is in the leaf. If so, I follow the pointer and I am done. If not, the value is absent. That is for a point lookup. But, I also told you that a B plus tree supports another kind of lookup which is given a range of values, find all records in that range. How do I answer such a query? Supposing I say, find me all instructors whose names lie between Einstein and Mozart. So, it is a range of values. How do I answer that kind of query? In fact, it is very straightforward. What I will do is I will take the bottom end of the range which is Einstein. And Einstein less than Mozart, yes, down here. Einstein less than Einstein, no. The Einstein less than gold, yes. So, I will follow this pointer and I will end up at Einstein. Now, the trick is I want to find all things from Einstein up to Mozart. So, I will start from this node, find all the values. I will keep going. I find al-saeed. I hit the end of the node. I follow the next leaf pointer to gold, cadds, kim and finally, Mozart. So, when I go beyond Mozart, I find sing which is tubing. I am not interested in sing. So, I can actually just scan the leaves consecutively to answer a range query which says, find all names, all instructors with names in this range. So, it is very easy to answer such queries. So, this slide gives you more about the examples I have already been talking about. So, you can read this offline and I am going to move forward. How to do insertion? Now, the key idea for insertion in a B plus tree is what I told you earlier which is to split a node if there is no space. So, the first step is to search down the tree to find which leaf node would contain the entry which I want to insert. Once I find that leaf, it is possible that the leaf has space. In all the examples I showed you, some of the leaves have extra space, some do not. If I land up at a leaf which has some empty space, I will use that space to store values. But if that leaf is full, what do I do? There is no space. At this point, I split the leaf into two nodes. Now, what can I say about how many values are there? The leaf was full and I am trying to insert one more value. If it is full, it must have exactly n minus 1 values at the leaf and now I am adding one more. So, it has n minus 1 plus 1 which is n values are there and there is no space in this leaf. So, I am going to break it into two and I am going to put half of the values in one node and half in the other. How many go to the first one? n by 2. How many go to the other one? The other n by 2 or if you have n by 2, if n is an odd number, one will have ceiling and the other will have floor. Those are minor details. So, I divide it up as equally as possible between the two nodes. Now, here is what is interesting. Each of these has n by 2 values now, which is the minimum number of values which a leaf node was supposed to have. So, the split has preserved the property that the leaf has at least n by 2 values, but I cannot just split a leaf. I need to do something about the parent node too because the newly created leaf node does not actually have any thing pointing to it. You would not find it if you search down from the root. So, I have to insert a new entry into the parent node. So, that is coming up in the next slide. So, how do I insert an entry into the, by the way the leaf node, the entry which I am inserting, I have a key value and I have a pointer for a newly inserted record. That value will be stored in the key node. The pointer is to the left of the value in the leaf node. Now, in the internal node, the split is a little trickier. So, the this is by the way for splitting the leaf node. If I inserted Adams into the node containing Brandt, Caliphari and Crick, Adam and Brandt lined up at the in one node, Caliphari and Crick on the other node. So, one leaf node became two leaf nodes here. Now, how do I split parent node of the leaf? Here is an example. First of all, I have to insert an entry into the parent. So, this is the continuation of the same example, where I had Brandt, Caliphari and Crick in one node. It got split into Adams, Brandt and Caliphari Crick. So, coming back here, this node got split into this Adams, Brandt and Caliphari Crick. So, how do I update the parent node? Well, I need to find the minimum value here which is Caliphari and insert it into the parent. So, what I am going to do is insert Caliphari. Its left child is this node. Its right child is this node. So, the value which I put into the parent, one side of it will be the original pointer. The other side is the pointer to the newly created node. This value which is the minimum value in the second node is the dividing point between the newly split nodes. So, what I have done is, when I split a leaf, I updated the parent level by inserting an entry into the parent. The entry contained a pointer to the newly created leaf and the minimum value in that leaf. So, this case was easy because the parent actually had space. It had two values and three pointers and I could easily add one more value and pointer without overflowing that node. What if the parent itself is full? So, here is a case now where the parent is full when I do an insertion. So, what I have is this tree. The font is very small, but I should be able to explain it even if you are not able to read the values. I am going to insert the value lamport into this first tree. Lamport less than Mozart. Yes, go to the left. Lamport is greater than gold. Yes, I will come to this node. So, what do I land up with? This node has gold, cadds and kim. If I add lamport to gold, cadds and kim, there is no space because this leaf is full. So, I am going to split it into gold, cadds and kim lamport. So, that is what I have here. Gold, cadds is one node, kim lamport is the other node. Now, I need to insert an entry into this node with the pointer to kim lamport, that node. So, I have the value kim and a pointer to that node has to be stuck above, but there is no space. So, what do I do? I had three values and four pointers. Now, I am going to insert a pointer, another value and another pointer. So, temporarily let me expand the node beyond its capacity. So, now, the temporarily expanded node has four values and five children. So, what I am going to do is take of the five children, three children go to one node. I am going to split it with three children pointers going to one node and two children pointers going to the other node. So, that is what I have done here. The three children here go to this node and the other two children go to this node. Now, what about the dividing values? I had four values, but when I do a split, here is something interesting. Here, I need only two values to differentiate between these three pointers. Here, I need only one value to differentiate between these two pointers. So, how many key values do I have now? I need only three. What about the fourth? I have a fourth value. What happens? Not something interesting. What happens is because I inserted Kim and a pointer here and that Kim and the pointer went to the newly created node here over here. So, what happened is there were three pointers which went here, two pointers which went to the other side. There was a value in between those pointers which is gold. That value which was in between these three pointers and these two pointers goes up to the parent. So, what I do is in the parent I insert an entry for gold and the pointer corresponding to it is to the newly created node. So, I have a value and a pointer which corresponds to the dividing value between the two split nodes. The dividing value along with the pointer to the newly created node goes to the parent node. So, in this case there was space in the root. So, that got accommodated there. Now, if I do another series of insertions, eventually what will happen? The root itself will get full and eventually one more insert causes one more split which now tries to insert a new entry in the root. Now, what to do? The root is full. I have to split the root. If I split the root, I need to actually create one more level of the tree and that is basically what happens when, but you can imagine this happening over a series of insertions. Eventually the root is full. If I split the root, I will create one more level above the root which is the new root. What will be the value over there? It will be the middle value here between the two sets of pointers. So, when I split any internal node including the root, I have some pointers which go to one side, some pointers which go to the other side and the value in between those will go up to the new root and the new root will now have a pointer to this node and to this node. So, this process is actually very easy to understand although when you code it, you have to be careful about special cases. So, we have provided pseudo code for insertion and deletion in the book. What is interesting is in the, in earlier editions of the book, we try to explain how exactly to deal with the case where the node is now full, now how do you split it and so on. It turned out the code was getting fairly complicated. So, we could actually simplify our code a lot by explaining it like this. First of all, I expand the space for the node by making it a little bit bigger temporarily. I insert into this and then I split this in two. That is how I explained it to you. In fact, when I write the code, it is a lot easier to write it like this to make sure it is correct. In the earlier edition, we had written it a little bit different. I think in the fourth edition, we had it a little bit different. In the fifth edition, we used this version, but we had a student from, I forget which college who sent me an email, some college in India. He sent me an email saying, hey this is a problem with your B plus 3 code. It is very confusing. What is going on here? What happens in this case? What happens in that case? It turned out that the code was indeed very confusing. I do not quite recall whether there was small problem or not, but the points he raised were actually very correct that there should be a simpler way of explaining this and it should be easy to show that it will work always. So, based on his input, we actually went back and revised the description. The algorithm is the same, but the new description is a lot easier to understand. In fact, that particular student joined us for an MTech the following year and did very well and graduated a few years back. Moral of this story, we welcome your feedback on anything like this. Just one MTech student who really tried to understand one B student, I am sorry at that point, who tried to understand what the code was doing, had very valid questions and it helped us to improve it. So, that was for insertion. I am going to talk about how to handle duplicates later, but here how do we handle deletion? I think the examples are in the next part. The talk had to be split in two parts, because A view could not handle large file. So, I will come to that in just a little bit. I am going to cover deletion by using an example. I have a tree here before deleting Srinivasan and now I want to delete Srinivasan. So, how do I delete Srinivasan? First I have to find Srinivasan. I use the usual search technique. Srinivasan less than modes art. No, it is greater than equal to follow this side. Srinivasan less than Srinivasan, no. Therefore, follow this side and finally, I find Srinivasan here and I delete the entry for Srinivasan from this node. So, that was easy. Deleting it is easy, but there is a problem. Now, this node has two few values. It has only one value, whereas it should have had a minimum of two. In this case, I look at its sibling node, which is the node to its left here. This node actually has space for one more entry. So, I can actually merge these two and get a single node with modes art, sing and move. So, that was an easy case, where I could actually merge it into one node. But if I merge into one node, I also need to update the parent node, because now this node is gone. Something has to vanish from the parent. In this case, the entry Srinivasan and pointer to this vanish from the parent. But now, there is a problem. This parent has only one child, one pointer and no key value. In general, the parent will be under full. It would not be as empty as this, but it will be under full, meaning it has less than n by two pointers. In this case, with n is four, n by two is two, and when it is under full, it has just one pointer. But in general, it will have more, but it is under full. So, then what do I do? I can try to merge with a sibling node here. But unfortunately, in this case, I cannot merge with a sibling node. Why? This node is already full. It already has four child pointers. Now, I have one child pointer here. I cannot add it to that node. Now, what do I do? I have too few pointers here. I cannot merge. The only solution is to borrow. If I cannot merge, I borrow. So, what do I do? What do I borrow? I have to borrow at least one pointer from this node, so that I am no longer under full. So, what it borrowed is this rightmost pointer was borrowed here. Now, what I have is two pointers. One is a pointer to this node, which is now Mozart's Singu. And I have a pointer to this pointer, which is to gold cat scheme. So, I have two pointers. But I need to know what is the value to go in between these two pointers. I need a value. Where do I get that value from? I can search down and find the value. But there is actually a simpler way. I am going to actually take this value from above. So, in this case, the dividing value is taken from above, which is Mozart, because I borrowed a pointer from here. The dividing value between these two is actually one level up. So, I borrow Mozart in here. So, what I land up with here is Mozart, a pointer to this node and a pointer to that node. But if I took the value from the root node, what is the new value in the root node? In the parent node here. Now, because I moved an entire tree here, the dividing value in the parent also has to change. Mozart would no longer be a correct dividing value. The correct dividing value is actually here. It is a value which was to the left of this pointer, which I borrowed. That is the correct dividing value. So, that moves up. So, if you notice here, gold has landed up at the parent and Mozart, which was in the parent, came down into the newly reorganized internal node here. So, when I borrow, I am actually doing a twist. When I borrow a pointer, I get the dividing value from the parent and push a new dividing value to the parent. It is sort of a small rotation operation. So, that is what I land up with over here. Now, if I do further deletions, I may actually end up merging these two nodes. If I merge these two nodes, what can happen? What happens? I have to delete an entry from the parent. Now, how many pointers does the parent have? It has only one pointer. So, the root in this case has only one pointer. Now, it does not make sense to have a root with one pointer. So, what I will do is delete that root node and make the child the new root node. So, what will happen is if after further deletions, do I have an example here? I have an example on this slide, where when I delete gold from this thing here, where is gold? Gold is here. When I delete gold, I end up merging CADS, Kim, Mozart. So, Kim vanishes. So, now, this has only one pointer. This has at this moment three pointers. It has to have a minimum of two. So, actually, in this case, it has what happened here? We could have borrowed pointer from here and landed up with, we could have had gold. So, we could have had a situation with ceiling n by two pointers in these two cases. But in this case, we decided to not borrow, but to merge. So, as I told you on deletion, there are two alternatives. I can borrow or I can merge. So, if I can merge, I will merge. So, after deleting a pointer here, there is only one pointer, which means I can merge with the previous node, which means the parent node would have landed up with only one pointer and then that is removed. So, what I end up with is this merge node becomes the new root. So, what does the merge node contain? The merge node, when I merge it, the value dividing gold has to come down here and the pointer here comes in here. So, this is the tree I get after processing this node. Now, the next topic in here is how to deal with duplicate values. It turns out the pseudocode which we gave almost works with duplicate values also, duplicate search key values. If we just insert multiple entries, search key value can occur many times, that is ok. But some very small changes are required. I should point out here, we goofed up when we wrote the algorithm in the first printing of the book. There was a bug in that algorithm, which did not quite work right for the case when you had duplicate keys. So, we have actually provided an errata and which shows how to handle the case of duplicate search keys correctly. And the slide here talks about how to handle it. For lack of time, I am not going to bother to get into it, but I just want to mention that if you want to see how to handle duplicate search keys by using the code without major changes, then this is one way of doing it, which is to modify the search and insert procedures a little bit and this slide tells you how to do it. I would not get into it here, but there are other alternatives and most databases actually use an alternative which has a very good motivation. So, first of all what are the alternatives? One is to have a separate bucket of pointers and another block per search key value, that has very high overheads. Another alternative is I keep a search key value once and then I keep a list of pointers to all records with that value. This actually works quite nicely as long as the number of duplicate records with a given search key is small, but keeping a list of records falls apart if there are a lot of duplicates. Let us say for a particular search key, I have 1 million records, it can happen. If it does, then the list becomes enormous, it would not fit in an order, it is way beyond and searching in that list becomes very inefficient. So, what most databases do is a trick which seems like a cheap trick, but it is actually a very good idea. What they do is they make the search key unique. How do they make it unique? Well, they will add a primary key. Supposing the search key was department name, I can tag to this search key the idea of the employee. Department name comma ID becomes a composite search key. Now, it is unique. So, if I want to search for all employees of a department, that actually becomes a range query which is start with department CS minimum employee ID 0. Let us say up to the range department CS maximum employee ID. That is the range of values which correspond to all employees in the CS department. So, the search key is composite and search on a prefix of the search key which is department equal to CS, ID I do not care, turns into a range query on the composite search. This is a very important observation. Department equal to CS is a range query on the search key department comma employee ID. So, that is for search. The real use though for adding ID and making it unique is to handle things like deletion and so forth. We can show that by making search keys unique, I can do deletion very efficiently. If I want to delete an employee, I know their ID. I can search for department name ID in this index and delete it very fast. On the other hand, if the index did not include ID, it had only department and a particular department had a million employees, deleting that one employee could be very inefficient because I have to search through a million records to find the correct entry and delete it. So, this is the reason many databases actually will always add an attribute either primary key or record ID to make the search key always unique. Now, the last couple of topics in indexing before we break, that is so far we assume the records are stored somewhere and the index just contains a pointer to that record. So, this is how most systems were built for a long time, but there was a problem with this approach. The problem is as you insert more and more records, the leaf level which maybe was sorted to start with, I did a sort when I stored the relation initially. As I add records, where do the extra records go? I will create overflow blocks or do some other hacks which do not work well. Eventually, the relation is not ordered properly. So, now, the notion of a primary index does not make sense because the relation is no longer ordered. I have to go sort the relation. Who sorts it? When is it sorted? It was all a pain. So, better solution was the following thing which is illustrated here. Note that there is a slight difference here. The search key is this letter I is C F and so on, but the leaf here is not a search key comma pointer. The leaf level contains actual records. In this case, the record has a letter and a number. So, in a B plus 3 file organization, the records are actually stored at the leaf level. The leaf level does not have a pointer to the record. It actually has the record. So, what is the benefit of this? The benefit is as I insert and delete records, the index remains clustered on records. So, any leaf of it will contain all records with a contiguous range of the first attribute which is the letter A, B, C, D in this case. So, the leaf levels nodes will be clustered on the search key. Now, consecutive leaf nodes may actually be scattered around disc and there are some tricks to deal with this too. I would not get into it, but what is important is that consecutive names will be on the same leaf node until some point and then it will go to the next leaf node and move on. So, the leaf level is clustered in this sense. So, that is one major benefit of a B plus 3 file organization to store actual records. There is also a drawback. What do you do if you have a secondary index? Earlier I said the secondary index points to the record. The problem here is the record now starts moving around. When a leaf node splits, half the records have to move to a new node. If records start moving around, what happens to the pointers to the record? They have to be updated which is expensive. I am moving a lot of records on each split. A lot of updates will happen on secondary indices. For this reason, many databases did not support this feature till recently and they started supporting it because there was a need in many applications and there is a very nice trick for secondary indices. The trick for secondary indices is the secondary index no longer stores a record ID because the record moves around. I do not know where the record is at any point of time. Instead, a secondary index will store the unique search key value which is used in the B plus 3 file organization. So, first I will search say based on department and I will find ID and if the index is built on ID, the B plus 3 file organization is built on ID. I will search down again on another tree to come to the leaf and that is how I can allow splits to happen in the B plus 3 file organization, move records around without affecting the secondary indices at all. They do not change at all when any such thing happens here. So, that is what today many databases do and in fact, the commercial databases Oracle, DB2, SQL server all introduced this in the last 10 years or so. There is one more point in here which says well if you have a B plus 3 file organization and it is the leaf level is only half full, that is a big overhead because half of the space is wasted. In a plane index, the overhead anyway the index was only a few percent. So, if half of it was empty, well if I started with an index which was 2 percent of the data and half of it is empty space that is 1 percent that is not so bad, but now if I store actual records half the space may get wasted. So, then there is a bunch of tricks which are used during insertion and deletion to ensure the minimum occupancy of a node is not half, but you can increase it to two thirds, three fourths, four fifths or whatever you want it to be at some extra cost during insertion and deletion. So, the last couple of topics again I am going to skip over them. One is bulk loading which is if you do a series of inserts into a B plus 3 one at a time, it is often very slow. In contrast, if I give a whole relation and tell the database build an index on this, there is a very efficient way to build the index and the trick is you first create all the entries as a unsorted file then sort the whole thing and then build the index from the sorted entries and it is very fast to do this. Why is this the case? Basically, if I insert one at a time into an index a lot of random IO can happen on a large relation. If I sort, I can reduce the random IO. Now, how slow is the random IO? 10 milliseconds per IO that is could be very slow whereas sorting can be done much faster. So, every database today supports the bulk loading you just drop the index insert records and then create the index again it will build the index very efficiently. So, there is some details on how to do this bottom up reconstruction and how to do bulk loading in the book. I am going to skip access. One more topic is multiple key access. I kind of hinted about this earlier. I said that if I have a search tree on department name ID, I can handle queries which say department name equal to CS by turning it into a range query on department name from ID. This idea is explored a little bit more in this section on multiple key access. Again, for lack of time I am going to skip it, but I will just mention that a composite search key can have multiple parts department name ID, department name salary and a composite key is ordered lexicographically. What does that mean? It is a standard alphabetical ordering. A 1, A 2 is less than B 1, B 2 if either A 1 less than B 1 or A 1 equal to B 1 and A 2 less than B 2. So, this is the standard ordering. I am going really fast here. I am sure I am losing those of you who already do not already know this stuff. For those who know it, it is just a very quick refresher. So, if you are not familiar with this stuff, my goal is to tell you that some of these topics exist. We do not have time here. Please go read them later. Again, indices on multiple attributes is the same thing. Index definition in SQL gives the syntax for creating indices. We will actually be using this in some of our lab exercises tomorrow. In SQL, it is not part of the standard, but everyone supports it. You can say create index, index name on relation name and give a list of attributes which are the search key. So, I can also say there is some syntax for create unique index. I am going to skip that. It basically enforces that a particular search key is unique and finally, I can drop indices. Before I take the next slide, I should also mention that every database automatically creates certain indices. If you declare an attribute as a primary key, an index will be created on the primary key. Why is that? Because the primary key value has to be unique. So, when you insert a new record, the database has to search and see if that key value is already there. So, to efficiently find out if the primary key value already exists, it needs an index to do this. So, the moment you declare an attribute as a primary key, the database will create an index on the primary key attribute. Many databases will also create indices on foreign key attributes so that they can efficiently enforce referential integrity constraints. So, primary and foreign key attributes are automatically indexed. Other indices you can create depending on the needs of your queries. To wrap up, this is back to something I mentioned at the beginning. B plus trees are very good for the most common kinds of queries, but there are a few other kinds of queries for which other indices are useful. Bitmap indices are useful for certain kinds of queries. R trees are useful for certain other kinds of queries. What are these? You can go read it up later. I will stop at this point. At this point, I can take questions. So, if you have questions, please indicate so on A view. Anna University, I am selecting you now. With the case of deletion in B plus tree, while deleting the gold, we are having, we have deleted the nodes from the leaf node, but not in the non-leaf node. Is it okay? Thank you. So, that was a good question. The question is, when we deleted from a B plus tree in our example, we deleted a value from the leaf node, but that value somehow continues to exist in a non-leaf node. The question is, is that okay? It seems a little disturbing that the value is still there. There is a ghost sitting at the internal nodes while it has been killed and removed from the leaf level, but it really does not matter. The point of the values in the internal nodes is to guide you to a leaf node. The fact that a value is there in the internal node does not mean it is actually there at the leaf. So, what actually matters from the viewpoint of returning answers is the leaf. So, there is no problem leaving the value around. It is possible to go back and replace that value by a value which is currently there in the leaf, but it is not required to do it. So, why bother? So, back to you if you have a follow-up question. Okay. The question was, with respect to B plus tree file organization, what did that figure mean? I apologize for going very fast over it. I probably did not explain clearly what this means. So, let me answer that question by using this slide again. What we have in a B plus tree file organization is actual records at the leaf level. That is the first thing. Now, specifically the example which I had here, I did not use the original records from our schema because they are too big. If I want to show four records in a page, that record is going to, the page is going to be very big, a page meaning a B plus tree page, a leaf node. It is going to look very big on the book diagram. So, what we did is we used much smaller records so that we could show the effect of B plus tree file organization while keeping the diagram small. And the record schema we chose had two attributes. The first attribute was a single letter. The second attribute was a single digit. So, to keep the record small. So, what we have at the leaf level? The leaf level here is the six nodes 3 plus 3, which again are all at the same level really. But because the page is not wide enough, they are shown staggered like that. So, all six nodes are at the same level and all of them contain actual records. So, A4 is the first record, B8 is the second record, C1 is the third record and so on. So, the leaf level contains the records. Note also that the records are sorted on the search key. In this B plus tree file organization, the search key is the first attribute. So, the records are sorted on the first attribute. That is why we have A, B, C, D, E and so on. So, that is what this picture was supposed to show. The internal nodes contain search key values. They do not contain the whole record. As a result, the internal nodes have things like C, F, I and so on, which are the values of the first attribute. That is the search key. And the point was, in contrast to a regular B plus tree index, which does not store records, what it stores is a search key value and a pointer to the record. Here, the records are physically stored in the leaf itself. There is no separate record to be pointed to. It is here right in the leaf. That is the difference in a B plus tree file organization. And the benefit is it clusters things together. Now, a related question, which somebody has asked over chat. So, let me take that question along with here. In B plus trees, if we lose sequential order for leaf nodes, as we do such insertions, at what point of the time does the index file get updated? This is relevant also for a B plus tree file organization where the sort order is preserved within a node. But as you insert nodes, the different leaf nodes are not actually sequentially arranged in the disk anymore. They may be scattered around the disk. So, what do you do about this? Should you periodically go and sort the whole leaf level? And the answer is periodic sorting is exactly what B plus trees were designed to avoid in the first place. A little bit of non sequential access is ok for regular B plus trees because for most relations, quite a bit of the leaf level will fit in memory. So, little bit of non sequential access is not the end of the world. But still, we would like it to be ordered. And the way that is done is by actually allocating many leaf level nodes together in one block. And there are a bunch of tricks which will let us move some things around in a very big block. So, think of a super block which contains let us say 128 leaf level nodes. Each leaf level node itself is fairly big, maybe 16 kilobytes. And now 128 of these are in one block consecutively allocated on disk. Now, when we split and merge nodes, we actually move things around in this block. So, that the nodes may not be exactly sequential within this block, but they are going to be nearby in the same block. So, these are used in real implementations. So, that when I read the leaf B plus tree nodes, what will happen is when I fetch from disk, the disk does not read just 16 kilobytes. The disk will actually fetch like a megabyte of data at one go. And this super block would be let us say a megabyte. And that is basically what ensures that even if things are slightly out of order, anyway there would have been read into memory at one go. They are all available right now. And they will get used without further IO operations. I do not have time to get into further details, but this kind of thing is done in real implementations to ensure that we do not have too much random IO as we go along consecutive leaf nodes. If the main operation is searching from the root to the leaf for separate key values, this really does not matter. We do not care if adjacent leaves are here and there in the file, it is irrelevant. It matters only if we have operations which actually scan the leaf level. So, and that particularly matters for B plus tree file organization. That is why I took that question here. So, back to you if you have a follow up question. The question is if the structure of a leaf and non-leaf is the same, how do you distinguish between leaf nodes and non-leaf nodes? How do you know you have reached the leaf level? And the answer is well the basic structure is the same, but you would have one extra tag in any actual implementation which indicates whether this is a leaf node or not. Therefore, you know whether you have hit the bottom or you have to continue searching further. We left it out of the discussion, but it would be required. Otherwise, you do not know whether you are at a leaf or not. So, that is a good question and the answer is you need to maintain this information. Remember that as the tree grows and shrinks, it does not affect whether a node, if a node is a leaf node, it stays a leaf node. It does not become an internal node ever and vice versa. Back to you if you have any follow up question. Thank you very much. Let me take a few questions which came on chat before I open up to the video mode again. The first question is can you differentiate dynamic SQL and embedded SQL in the implementation part? And the answer is embedded SQL is much like dynamic SQL except that there is a pre-processor which understands the queries and can do some type checking and other stuff at compile time rather than doing it purely at run time. So, if you use JDBC, if you make a mistake in the query, the compiler does not catch it. It is only when you run the query does the query fail and then you get an error message. It would be a lot better if you had something which can pre-compile the program and say, hey, you used an attribute X, Y in this query and there is no such attribute for that relation. Check the query. So, this kind of type checking ahead of time is done by embedded SQL and in order to allow embedded SQL pre-compilers to do this, you have to have a specific syntax which whereby they can find out that this is a query and this is not. Doing it, if you construct strings and then pass it to JDBC is much harder. So, that is the basic difference between embedded and dynamic SQL. Now, JDBC is the dynamic part of Java to SQL access. There is something called SQLJ which is the embedded SQL in Java standard which unfortunately is not very widely supported, but a few databases do support it. I think DB2 supports it. Oracle used to support it, then they withdrew, then people protested, they reintroduced it. So, it is a little messy, but there is a standard for embedding SQL in Java. But embedded SQL in COBOL and C have a very long history and they are actually widely used in older applications, legacy applications. The next question is what is your opinion about no SQL databases? Is it something that will be covered? For large scale data processing, more and more people are using no SQL databases. This question is part of a trend I see where a lot of people maybe came to this workshop assuming it is an advanced topics workshop and that is what you are looking for. Whereas the NPTEL, sorry, the program under which this is organized, its goal was to make sure that faculty are totally conversant with the basics and they are well trained in what to teach, how to teach, how to conduct labs and so on. So, the way I designed the course was based on this understanding. However, I can fully appreciate a lot of you wanting to know the latest stuff which is going on. So, I have squeezed in one day, the last day for these topics. So, as I already said I will be covering the new generation of parallel databases and this no SQL business is also related to that. Bunch of people have been building data storage systems which can be accessed using an API without SQL and a lot of people have been saying they are faster and useful. But more importantly, other people who have actually used databases for a long time point out that there are many problems with these systems. They are like object oriented databases were 20 years ago and they will fail for the same reason. So, there is a lot of you know arguments happening between members of the two camps. It is almost at a religious level. As the CTO of Teradata was saying recently, he says yes there are certain applications which can benefit from accessing data without ever using SQL. But others need SQL with relational data. So, he says no SQL does not stand for no SQL, it stands for not only SQL, but other things as well. So, we will talk about some of these ways of accessing data which do not use SQL and on the last day I will talk about that. And indeed, they are used for really large scale thousands or tens of thousands of nodes operating in parallel to supply data for applications such as you know web mail, Facebook and variety of other such sites. The next question is does PostgreSQL supports spatial databases? Can you cover spatial database in the workshop? Spatial databases is a fairly big topic. PostgreSQL has some support for spatial databases. It includes R tree index. I mentioned earlier today that for indexing spatial data you have an index type called R tree. So, PostgreSQL supports it natively. You can say create an R tree index on these attributes. And it also has few other features designed to support spatial databases. In fact, there is a version of, there is a software called PostGIS which adds more spatial features on top of PostgreSQL. However, we have used PostGIS in certain other workshops we conducted and people found that it is a nice academic system. It helps you learn the concepts, but it is very slow on certain real tasks. In contrast, many database vendors have spatial extensions to Oracle, to DB2 and so forth which probably do the same things in a more efficient manner. In addition, there are vendors of GIS systems which only deal with spatial data. Their goal is to store spatial data maps and so forth and they do not care about any other kind of data. So, there are a whole class of geographical information systems which deal with that. So, there are different things on the market and I am not going to cover any of them here because I would not say that any of the spatial extensions is very widely used. They are still in a very infancy in some sense. So, if you are interested, there is a little bit of technical material on R trees in the book, but beyond that spatial data processing, you can look up PostGIS to get an idea of some of the issues even though like I said the implementation is not very efficient. Next is compulsory that the leaf level in a B plus tree is always dense. That is again a good question. If the relation is stored sorted on a particular attribute, then the leaf level need not be dense. It could be sparse. You can always follow that pointer and then go in and find the record. For simplicity, we have assumed it is sparse. If you look at it another way, what is the B plus tree file organization? It has its leaf levels which are sorted. If you look at the next level up in a B plus tree file organization, from the next level up, it is a B plus tree except that the leaf level which is the next level up is sparse. In fact, a B plus tree file organization can be thought of alternatively as data sorted and then the leaf level is sparse. That hopefully answers your question. This was from Raja Ram Bapu Islampur who also asked the other question about if we delete a entry from leaf level, does the entry need to be deleted from non-leaf levels? I think I answered that question which was asked from Anna University I think. The next question was about NIT Varangal. In B plus trees, insert deletes can be performed on exact string. If you want to insert a string on a pattern, so this is an interesting question. Supposing I want to query, so I can address this question in two ways. One is if I am storing strings in a B plus tree or in a database in general and I want to find all strings which match a particular pattern. The simplest kind of patterns are like. So, I say find all names like ABC percent. How do I answer a question? Name like ABC percent. It turns out this particular pattern is very easy to deal with because ABC percent is a prefix of a string. All names of the form ABC percent will be consecutive. So, what I will look is for the first name which is equal to or greater than ABC that will be the first one. Then I keep going as long as I keep finding names which are which start with ABC and that is the ones that I want. So, if I have a percent at the end, but the first part is a fixed string that is very easy. The hard part is if I have a percent at the beginning. So, for example, if I want to find all names that contain let us say Phatak somewhere in the name. So, then I will say like percent Phatak percent. Now, there is a problem. The names which contain Phatak anywhere in the middle are not going to be consecutive. Now, there are two answers to this problem. One is if I am looking for a full name, I will break up a name into first name, middle names, last name whatever or break them up into parts and build an index on all the words in the name. This is what Google and other search engines do. They take a document, break it into words and build an index on the words. So, I can find all documents that contain a given word. What Google and other search engines do not do is let you find all documents that have words of the form Phatak or contain Phatak anywhere inside a word. So, if I have a word which says Phatak x, y, z consecutively that will not be retrieved by a query for Phatak. So, the like operator on the other hand, if I say like Phatak, like percent Phatak percent, I have to retrieve those two Phatak x, y, z is like Phatak percent, percent Phatak percent. So, people have come up with extensions of B plus please which support such things. It turns out such things called suffix please. They are actually quite expensive. They take a lot of space overhead. So, there has been a fair amount of research to support like queries efficiently. People have built it in specialized applications where they badly need like queries, but most databases do not support it as a built in feature. So, if you run a like query which is simple like Phatak percent, they will do a good job. If you say like percent Phatak percent, they will abandon the index and just go through the whole relation, take each name and match it with percent Phatak percent. Indexes do not get used. So, I hope that answered that query. In fact, the query goes on to say patterns IE like operator. So, that is what the that was exactly the query. I think I will stop there and take maybe one more question from somewhere. Let us see if anybody is ready with the question. Valchand college also has its flag up. Are you ready with the question? You are on now. Sir, if database does not use size, in that case, if you are going to use sparse index, so that the block size would be of seen or whatever the criteria that we are going to put for the size of block. So, to answer that question, sparse index was a concept which can be implemented in different ways. Now, as I explained, sparse indices are particularly useful when you have multi-level indices and a B plus tree is a multi-level sparse index. So, whatever we did with B plus trees exactly answers your question about sparse index. So, we choose a fixed node size and we have one entry per block. That is a fairly standard way of handling sparse indices. You fix a block size and you have one entry per block. Now, how do you fix that block size? That is another good question. Should the block size of a B plus tree be 4 kilobytes? Should it be 16 kilobytes? Should it be 1 megabyte? And there are trade-offs here. So, one clever way of addressing this trade-off is to have a block size of about say 16 or 32 kilobytes and then have a concept of a super block which has multiple blocks together. I do not have the time to get into the details, but there are some clever tricks to have the benefit of larger block size and the benefit of smaller block size all at the same time. If you are interested post the note, I will point you to research papers which deal with this. So, if there is any follow-up question, please go ahead. No sir, thank you. Okay, thanks.