 Let us do a very brief refresher on what is indexing, first of all what is a sparse index shown by this example, I have data which is stored sorted on the ID here, so now this is an index which helps me quickly find an ID that I want to find. The index does not have an entry per ID otherwise it will be very big, it has one entry per some number of IDs here. So first I will search, if I want to look up ID 45565 I will go here, is it greater than 101, well this index starts with the lowest value but I could even not store this value and it will be fine. Is it greater than 32343, if not look at the previous pointer otherwise check if it is greater than 76766, the value I am looking for is 45565, so it is greater than this but less than this, therefore I will follow this pointer and then search sequentially here which I will reach it pretty fast, this is the basic idea of an index, this is a single level index, of course database data is very big, single level index is not sufficient, so what is actually used is a multi level index. So the way it is organized is data is stored in blocks here, sorted on whatever key the index is built on and what is stored here I will come to in a little bit, it could be the actual record, it could be a pointer to the record. But regardless of what is stored here, if I build an index on a particular attribute or set of attributes, you sort the data on that key and put it in blocks and then the pointer here you know I will start with the outer index, find the appropriate pointer, go in there, within that I apply the same algorithm to find the data block and then in the data block I will find the key that I want, this is conceptually what tree indices do in general. There are many, many types of tree indices but all of them use basically the same concept. So each level of the tree index is a sparse index on the lower level, so this level is a sparse index on this, this level is a sparse index on this, why sparse? If you see for maybe 100, 200 entries here I have just one pointer here, now each of these blocks itself will have several hundred entries, but there is just one pointer here for several hundred entries, so it becomes smaller, each level becomes like a hundred times smaller than the previous level. So with a few levels I can quickly find what I want, now each level here is potentially a new disk seek because each of these may be at some different location on disk, each is a block, may be somewhere on disk, so if I want to fetch a block I have to do a disk seek. On hard disk how long will this take, well if it is already in the, in memory buffer it may be fast but if it is not it might take something like 5 to 10 milliseconds. So if I have a disk which takes say 5 milliseconds for a request, how many requests can it satisfy in one second, 200 requests. Now if I am building a website with, so first of all 200 is the limit for one disk. Now for each level in this index I may have to do a disk seek. So if I have an index with 10 levels I will have to do 50 milliseconds and then instead of 200 lookups I can only do 10 look, sorry 20 lookups. So this tree has to be short, so an index like this is always short. The second thing is if you keep inserting deleting data how do you maintain this index, what if too many records are deleted from this block it has very few records what do you do, you need to maintain this index and the key idea of B trees which is proposed in 1972 long time back now, 40 plus years now is that they adapt bit by bit each time you do an insertion or deletion either no change happens or a small change happens but you never have to go and reorganize the whole index at once, that is the key idea. How many of you do not know about B trees, all of you have few, so since there are very few I am not going to actually cover it here, I am not going to cover basics of B plus trees but I will cover some more advanced topics, so all of you know this is a B plus tree. So there are some examples in the book of B plus trees, we are going to skip all that and I just want to say a little bit here actually this slide is outdated, I am going to replace it for the main workshop with newer slides. So the key point I want to make here is if you have a unique key that is there is only one record with a particular key, not more than one record with a particular key it is very easy to understand how to use a B plus tree. So all the standard B plus tree algorithms make this assumption to keep life simple, if you see any textbook description of B plus tree typically it will make this assumption that the key is unique, in fact keys need not be unique, why? If it is an index on a primary key yes it is unique, no problem but many times I want to index on something which is not a primary key, if I want, if I am a bank or a university and I hear that so and so student has done something I want to find out who is this student I do not know their roll number, I want to index on name, so I can look up the student by name, now there will be many students with one name, let us say I do it by first name or last name separately, there are many people with the same first name, many with the same last name, so now the key is no longer unique, there are many records with the same key and I need to store pointers to all of them, what to do about this, this is an issue which any real system has to address and by and large textbooks show this, we also kind of did this we in the sixth edition we decided to address this issue and say how do you handle duplicates and in the hurry to put it out we goofed up and our algorithm for handling duplicates actually has bugs, we have corrected it in the Eritre, we realized it soon after it was published, so if you read the sixth edition and read the part on B plus trees, please read it in two ways, one is assume there are no duplicates there is no problem, everything works, we do talk about duplicates but there is a bug in the way we have described it, so for the case with duplicates please look up the Herata, now it is possible that you have a more recent printing where the Herata was fixed, we did give it to the publishers but it is all even more likely you have a copy I do not know the current one printing whether it has been fixed but in all the older printings this bug is there, so please look up the Herata page, okay so that of course leaves the question and so how do you deal with duplicates, so there are several options, so oh okay this slide actually does have the Herata node, I thought this is a older slide, this does have it, so I do not have time to get into this now but I will just mention that you should go read this up to see how to deal with duplicates if you want, but in fact there is another solution which is also very widely used in the industry which is get rid of duplicates, how do you get rid of duplicates, I am going to tell you that my B plus tree implementation will not handle any key with duplicates, at the same time you want to be able to do lookups based on names which might have duplicates, can we do something about this, so the basic idea is B plus trees luckily support range queries, so if I what I will do is in the B plus tree I will create a key containing the name followed by the roll number, okay name has duplicates, roll number does not, name followed by roll number does not have duplicates, now my B plus tree code can handle this because it is unique, I do not have to deal with duplicates in my B plus tree code, but now what about my original problem, I wanted to find students with a given name, how do I do it, I can do a prefix query or I can do a equivalently a range query, I can say give me all names the key is name plus roll number, so name followed by the minimum roll number, let us say 0 0 0, up to name followed by the maximum roll number, that is the range I am looking for, so find me all things in this range and that will give me all the things, so when I insert there are no duplicates, when I do a single lookup based on name plus roll number there are no duplicates, when I get a range query that is not a problem, the standard B plus tree code handles range queries and gives me back a set of names, is this clear, I do not want duplicates, so I get rid of duplicates by adding something unique roll number or it can be a record ID or anything else and then I do a range lookup, so this is a standard trick which many databases use, why do they do it, are they too lazy to go implement non-unique search keys on B plus trees, turns out there are other good reasons for doing this and the book talks about some of these reasons, in particular efficiency of deletes is better if you make it unique and the second thing is concurrency control techniques are a little easier to deal with and even the code for handling B plus trees splits and so on is easier if you make the unique assumption, but we have discussed this in the new edition of the book, in the old editions it was not there. Now there are several implementations like I said, one is to keep actually duplicate keys, keep a list of duplicate, sorry, keep the key ones but with that key there are many records, you can keep a set of record pointers or make it unique, so that is where we left off. Then the next topic is what should the index contain, should it contain a pointer to a record which is stored somewhere in the database or should it contain a record itself, the contents of the record. If you ask this question about say 15 years back or 20 years back, pretty much all databases would store the record in some place when the record is created in the system it goes to some page and it lives there for the rest of its life, unless for some reason the record increases in size, there is no space in the page, it has to be moved. Moving of records was a rare occurrence, they land up at one place and live the rest of the life there, this is like being an a government job, get a government job, be there tell you retire, but today's world is a little bit different, people keep moving across jobs and it is no longer viewed as important that a record sits in one place for its whole life. So why does this matter? So if you go to an old database and say that I want the data sorted in some order, I want to fetch all records in a certain key range, I can build a secondary index on some keys, but the records will be scattered all over, if I want all the records in a range they will be scattered all over, now it is very inefficient to jump around on disk and fetch data. So there are a lot of examples where you can see a huge performance difference because of this, so we had index which would index a record on keywords and the records were not, the index records themselves were not organized properly on keywords, I won't get into details, but we had a table which had a keyword to record mapping and that table was indexed on keyword, but the table was not sorted on keyword initially. So now to fetch all the record IDs which are scattered all over would take a very long time. So then we told the system in this case PostgreSQL, sort the file, sort the relation and once it did that suddenly if you wanted to find all the records with the keyword they were all together, what took 20 seconds would now finish in half a second, it is that kind of huge difference. So you do want in many cases data to be sorted so that you can fetch many records which are co-located. So old database systems would allow you to take a relation and say sort this relation now and build an index after sorting it, but you could only sort a relation in one way, if I want two different sortings tough luck. So in PostgreSQL as of at least 9.1 they may change it in future, as of 9.1 this still had, I could say sort the relation, cluster the relation and I sort it on something if I build an index on that attribute on which it is sorted is very efficient to find all records with certain range of key values. But supposing I want the same relation I want to index it in this way on two different attributes, what to do. So the solution to this there are actually two possible solutions, one solution was to say look I will take the record sorry the index, the index is on let us say word, but I will create an index not just on word, but on word comma the extra data which I want. So in this case I wanted record ID which is one of the fields of the relation. So I build an index on word comma record ID. Now when I go through this index contents what happens? I will find all the word record ID pairs in the index and I may escape not even going to the actual relation that will also have a pointer to the relation, but I need not go there because I already have the data in the index. So this is called index only access, index only access is very useful because it can get records which are clustered even though the relation may not be sorted on that attribute. So what about PostgreSQL does it support this kind of thing? It turns out PostgreSQL has a multi version concurrency control mechanism which interacts a little badly with this. So there may be two versions of the record. Both those versions are there in the index. One of the versions is old, one is new. Looking at the index I cannot tell which is the version that I am looking for. I have to go to the record. So there is a slight catch. So PostgreSQL did not have this earlier. I think in more recent versions people have been working on some very clever hacks to avoid the lookup for most records, to pay the extra lookup only for some records not for all. So they are working on that and there are some nice ideas there. But other database system we did not have multiple versions. It was a lot easier for them and so they could provide index indices where you can add extra attributes. They are called covering indices then index only access. So many databases supported and PostgreSQL has started supporting parts of it now. That is one answer. The other answer is what about storing the records themselves in the B plus tree? Why store a pointer to a record? Why not store the whole record in the B plus tree? So this is called a B plus tree file organization. So the records do not live outside of the B plus tree. They live only in the B plus tree. This is why they live. In the earlier systems the record would be put in some page when it is inserted and it lives there. Now it lives in the B plus tree. Now all of you who are familiar with B plus tree know that there is a difference. B plus trees are dynamic. Pages get split, records move. So earlier there was a notion of a record ID and a typical record ID was like a page number. Like in the file which page in the file and inside that page which record and the record never moves. So the record ID never changes for a record. Now when pages split in a B plus tree the record moves. So what is the record ID? Now why do I want a record ID? If I have another index on that same relation it needs to store some kind of a pointer to the record. And that pointer used to be the record ID, the physical record ID, the location in some sense of that record. But now you cannot do that because every time a page splits half the records in that page move and if I have to go and update all the other indices that point to this record it becomes very, very expensive. One page split can turn into 50 different page accesses or another indices each of which has to be updated. So it becomes very slow. So this is one reason people initially did not implement B plus tree file organization. But eventually there is a simple solution to this problem which eventually many people implemented. And the idea is in the secondary indices, so there is one index which is the primary index. The primary index does not have to be on the primary key, it can be on any key, it is clustered on that and the records are stored in a B plus tree file organization, the record is stored right in the leaves of the B plus tree. And now there are other indices on other attributes. Those indices no longer store a record ID, instead they store the primary index search key. And that has to be made unique of course. So some form of unique ID for the record has to be primary key, also has to be part of that. So if the relation is, let us say sorted on some key C, you will add the primary key of that relation or some other unique key, append it to C. So it is sorted on C, but you can still have a unique ID for that record. And so when a split happens here, nothing happens to the secondary indices. Why? Because they stored the key value. When a split happens, the key does not change. And normally you will use something which does not change typically, that is what you would like to use. Because if that is updated then you have to change it in all the indices, but that is part of life. If you update a record, you have to go change all indices where the record occurs, so that is okay. But when you split a leaf, you do not want to go and update all other indices. So now you suddenly touch hundreds of records, you do not want to do hundreds of updates. So this takes care of that. Now what is the cost of this? This means when you use the secondary index, you traverse down the index, you do not find a pointer to the record. You find a key. You have to take this key, come back to the primary index and then search down again. So instead of one B plus three traversal, you have turned it into two B plus three traversal. So that is the price you pay for it. But it has been very useful for many applications. So one after another, all the commercial databases supported this about 15, 20 years back, they all supported it. I mean the idea is very old, but they had not implemented it for many years. Any questions? This is regarding, in 1995 itself we started using one tool which is called as fragmentation, defragmentation tools. Is there any relationship in between these two ideas? In some sense, yes. So disk be fragmented. What do they do? So if you have a file which is fragmented, that means pieces of the file reside here on disk pieces, the next piece there, the next piece somewhere else. And when you try to read that file, it takes a long time. And this particularly starts happening if your disk is full. Any new files you create in a disk which is get over say 90% full, tend to be fragmented. So the way to improve performance is to re-sort all the files on disk, reorganize such that all the blocks of a file are sequential on disk. So if you read them, you do not have to move the disk come around. That is very expensive. So disk defragmentation essentially sorts the blocks of the file and in the order, sorts meaning the physical layout is in the order that they appear in the file. So clustering a relation, like I told you, is kind of like this, but at a logical level. At the physical level, how the system manages it is up to the system. But what happens is systems are usually good at this. So if you sort a file, database systems will make efforts to keep the blocks consecutive. So there are some nice tricks which they use. They cannot be guaranteed to be consecutive on disk, but it will do a best effort so that you minimize the amount of disk movement to scan a particular file. One more question. This is related to that composite key. So we are using composite key in the relational database management system. It means that we are going to treat more than one field values as a primary domain values. Composite key is one which has multiple fields put together. And you use the normal sorting order, first sort on the first key, then on the second and then on the third. So they are very useful for many things. So it is common to build indices on a composite key. And then you can actually do an index lookup on a prefix of the key, if you wish. So that is what we did actually. When the thing I told you, first you have some key, second which may have duplicates, and then the primary key, that is a composite key, because it had two attributes. And you can do a lookup on a prefix which will return possibly many records. So in the B plus T file organization, the complete record would be copied in the index file or some partial record, typically the columns which are frequently accessed. So if you store only some columns, then you have to store the entire record somewhere else. So that is a regular index with extra attributes and that is called a covering index. So when you store the record itself somewhere else. But when you store the record itself in the leaf of the B plus T, then it is a B plus T file organization. Sir, DBMS have three level architecture. And at the physical level, it is defined that we have the frame memory model and unifying model. How they are related with this indexing part to the system at physical level. So first of all, the physical level, view level, logical level, whatever. Those are all very high level concepts which don't have a direct bearing on this. So that is something which we have already dealt with. All of this is a physical level. Relations are the logical level. We are working at this point. We are moved from relational database design to storage and indexing and so on which are all purely at the physical level. So the other levels are irrelevant to us right now. Now when you're doing database design, you need to worry about the other levels, but not so much now. Although in some cases, depending on how things are accessed at higher levels, you may choose appropriate physical structures below. What do I mean by this? Supposing you have a lot of queries which retrieve a record based on some key, then you should build an index on that key. So what you do here depends on what queries are being executed on the system. The queries don't know about the physical layer. They are working at the logical level. But your physical design is generally optimized for the workload which you find on the system. So that choice of what indices to create, and I mentioned materialized views earlier, what materialized views to create and so on is part of the physical design of the database. And that's an important part of tuning a system. We don't have time to get into it. Sir, from which interface we can put this type of indexing, B3, or we want to change the structure of the organization of the files in the DBMS? Yeah. So most databases will provide SQL extensions. So the SQL standard try to stay out of this. But most people, most databases provide a fairly standard syntax for create index. Now the create index command can take a number of extra arguments which are database specific. So there you can say you want a B plus 3 index, you want a file organization, you want it this way, you want it that way. So there are many options which you can specify. So PostgreSQL has many such options. Oracle has its own set of options. So that is database specific. But you can do all of this from the SQL interface. In a B plus 3, B plus 3 file organization, you said that entire records will be stored. What if records are of variable length? How node size will be defined initially? That's a good question. What if record is variable length? How is it stored? So as long as the fields of the record are relatively small, and the whole record is smaller than a page, you can store records with a variable length. If you want to know how the records are physically stored, chapter 10 has a description. Let me just briefly show that. So this is an example of a variable length record. So what it does is, there are four fields here. The first one is the ID, the name, the department, and the salary. Now if you see here, the first three are variable length fields. Name, sorry, ID can be variable length, let's say. If it's fixed length, it can be stored here. But variable length, it goes there. So what it's storing is, where is it starting in this array, and how long is it? So the first field is name. So it says 21.5. And there is a data dictionary which says that this is a variable length field. So I'll interpret these first few bytes as an offset plus length. So now I know it's starting at 21, and it's length 5. And I read those 5 bytes to get the ID. The next field is name. So I say 26.10. So I go to 26 and read 10 characters, get Srinivasan. The next one is 36.10. So I go to position 36.com side. The next one is 65,000. It's fixed length. So it is in line here. And then there's one extra thing which is null bitmap. So if any of these fields is null, you can actually store zero offset, zero length. But if this is null, you would store a bit in the bitmap here. So that's how variable length records are represented. So when you fetch this record, you have to interpret this structure and then extract the fields that you want. And in turn, this is one single record. Now how do you store these records in a page? You have variable length records in a page. If it's all fixed length, you divide the page into equal size pieces, it's very easy. If it is variable size, typically what is done is this is called a slotted page architecture. So you store all the records at the, let's say one end of the page. And then at the other end of the page, you have an array of entries which point to the records. In between is the free space, unused. So what does this header say? It says there are so many records in this page currently. And some of these may be empty actually. If the record is deleted, you may have a hole here. The entry will not point to anything. But if it does exist, it has a pointer to the place where the record starts. And then there's a length also which can be stored here. The size of the record is stored here. And the location is stored here, logically. I mean, physically in the page. The diagram is logical. Physically, it's all linear in the page. So that's how you store variable length records in a page. And if a record changes in size, you move records around in the page to make space. So for B plus 3, we say that we have to define nodes of size n. If it is n is 3, then it will have three pointers. And here, we are talking about p. I'm not understanding those two terminal, relation between those two terminologies. So again, I didn't talk about variable length keys. It's again there in the book and in the slides. For today, I skipped it. But since you asked that question, the initial coverage of B plus 3 assumes fixed n that you will have at least n by 2 entries for some number n, which is perfectly fine for fixed length records, fixed length keys, the B plus 3 stores keys. And if you are talking of a file organization for fixed length records. But if the key itself is variable length, this can happen. Name is a keys variable length. Or for a file length, a file organization, the record is there and the record can be variable length, as we just saw. So now, how many things you can fit in a page depends on how big the key or the record is. We can't guarantee this. But what we can guarantee then is not the number, but the occupancy of the page. So we can say that the B plus 3 page will be half full with data, at least. Maybe more than half full, but it will be at least half full of data. Now, how many entries are there? That depends on how long the keys are. We can't say that upfront. But if we say that the average key is so long, say average name length is 20 bytes, we can say the average entry is let's say 32, 35 bytes. Now we can calculate the average number of pointers out of the B plus 3. But we don't actually have a minimum number of pointers. But we can say that space will be half occupied, that we can guarantee. And now given a particular level of the B plus 3 and the average size of the keys, you can say that on average, each node will have so many children at a minimum. Some may have less if they have longer keys. Sir, excuse me. Sir, here. Yeah. Sir, you have mentioned that each record, the starting address and the length will be there. So for example, if the department, if it changed, that is if initially it's a computer science, if it changed, then the consecutive value address would be changed. Absolutely. So you will actually change the size of the record. So what happens then is you have to move some records here to make space for it. But you will keep it compacted, meaning you won't have holes in this area. And the important thing is you can move records in a page because externally nobody will store a direct pointer to the record. They will store an offset into this header array. So all you do when you move this issue, update the pointer in the header. So an external, let's say an index which is storing a pointer to record. The record is moving in the page, but the index does not have to be updated because it is storing an offset in the header array, not the actual data pointer. Even though that's a, it is in the continuous location, it will affect all the data. So yeah, it will affect everything in this page. So if supposing you change the size of this record, now everything here may have to be moved. But only in that one page. You don't have to touch other pages. By order, do you mean the physical order or the number of children? Sometimes the word is used for number of... The physical order. So if you create a B plus tree on a particular key, the thing is sorted on that key. Now how is the sort order defined? That is dependent on the implementation and on the type. Integer sorting is standard. What about sorting of strings? That may be language specific. You know, sorting in Hindi is different from sorting in English. So whatever it is, it's sorted in that order. Sir, it will be the different number of nodes for the searching in each label. If you have variable length keys, the number of children may vary depending, if you have variable length. Yes, there is a different number of nodes, sorry, different number of keys in a particular node. Yeah, it can vary. Even in a normal B plus tree with fixed length keys, depending on what order insertion deletion happened, a node can have anywhere from N by 2 to N, for some N, that many children. It can vary. So that will be, that is true for fixed length records. It is also true for variable length records. Only thing is with variable length records, you can't guarantee a minimum of N by 2. It depends on the record length or the key length. Excuse me sir. Yeah. The indexing will it work? I am seeing the option index or by default it will do the indexing for all the relations. So, okay, that's a good question. So what indices I interpret it as? If you create a relation, will it be indexed automatically? Is that what you mean? Yes. Yeah. Okay, that's a good question. So on pretty much all databases will automatically create an index on the primary key. If you define a primary key, they will create an index. Why? Because if we insert a record, they have to check that there is no duplicate. Without an index, it will be very expensive. So they will create an index. Most databases will also create an index on a foreign key. Why? Because if you delete the referenced record, then the referencing record is affected. Unless you have an index, it's very expensive to find out if there is a referencing record. So if you declare a foreign key, usually an index will be created on the foreign key also. These are the minimum set which most databases will automatically create. Now beyond this, what indices do you need? It depends on the queries which are asked on the system. So many databases have a tool called index tuning wizard or database tuning assistant or there are many different names for it. So what that tool can be used for is you know what kind of queries run on that database. You can actually run a test system and run the kind of queries which you expect to run. It can record those things. And then the system will see you're asking a lot of queries which are doing a lookup on some field which is not the primary key field. I will add an index for that. So that kind of distance, these tuning assistants, wizards, call it what you want. But there are tools. Post-critical, I don't know if it has one at this point. But the commercial databases all have such tools. So if you delete all the records for a particular key, then that key record will also be deleted from the index or it would still maintain it. So at the leaf level, that key will vanish. At internal nodes, the key may still be present. So we have an example in the book which shows this because the internal level, the keys are simply used to guide the search. It can remove it but it doesn't make any difference because the keys of the internal nodes are simply guiding. There's no implication that they actually are present in a record. So if a B plus three node becomes underfull, you are going to take action. You might merge it. Actually, many implementations don't do merging of underfull nodes. Only if the node becomes empty, do they do something. But that's an implementation choice. The B plus three algorithms allow you to do merging if they become underfull. And at that point, whenever any such operation happens, some keys may get removed. I want to add something. Sir, ask the questions about the logical level and physical level where the indexing will fit. As I think it will come in logical level. So the actual index construction is at a physical layer. But the decision to build an index, so the interface says build an index on this attribute of this relation. But the index itself is very much part of the physical design. It's not part of the logical design. That's how the layers are defined. The logical design only sees a relation. It is not worried about performance. Indexing is a performance issue which is at the physical layer. Just it is easy to access or fast. Nothing more than that. Exactly. That's why it is a physical feature, not a logical feature. It could be a change when you change the function or the fetching. You can add and drop indices. Applications will still run. They may run very slowly. But they will, it won't affect the correctness. So it's purely physical. Anyway, let's move on. So I want to complete a bunch of other stuff. I'll stop questions on indices here. Just want to mention a few couple of words, buzzwords. So one of them is called bulk loading of indices. And the idea is supposing you take a relation and fill it with a lot of records, one at a time. And you have defined some indices on that relation. Initially, everything will run fast when the relation is small. But if the relation becomes very big, you will find that it takes a lot of time to add all the records to that relation. So any book on tuning will say first drop the index on the relation, then add the record, drop all indices on the relation, including the primary key, drop everything. Then add the records, then create the indices back again. Now, why would they recommend that? And the answer is that if you insert records one at a time into the index, it's a lot of IO which happens. On the other hand, there is a technique called bottom up building bulk loading of B plus 3 index, which is much more efficient. It basically sorts the records and then creates a tree with far fewer IO operations than if you inserted one record at a time. But it's much, much faster. I mean, it's like something which takes hours can run in minutes. That's the kind of difference which people actually see. This is not just hypothetical. It's real. So what has happened is the underlying B plus 3 implementation when there's already a relation, you say create index on it. It doesn't do the insertion as we saw. Instead, it does sorting and builds it. How does it do it? Again, the details are there in the book. This is the bottom up build. And so that is why the tuning books will always tell you drop indices, insert records, then rebuild indices. Now, the question arises, what if you can't do this? If you are, I mentioned big table, Google's big table as a key value store. It's actually a B plus 3 index which is distributed across many, many machines. Now, there, if you say drop the index and rebuild it on one billion keys, this is not a joke. It will take a huge amount of time to build the index. What happens to applications which run during that time? So there's a bunch of research which happened on this, including some which we did in IIT Bombay back in 1997 on how to have efficient insertion into B plus trees. And one such technique is used in big table. I won't get into the details. To make inserts into the big table B plus 3 index much more efficient than the normal B plus 3 insertion. If you're interested in it, there's a paper on big table which describes all these details. You can go read it. The big table is essentially a enormous B plus 3 index spanning many, many machines. So it's pretty cool. If you're interested, do read that paper. That's the very last slide on indices. I have focused on B plus 3 indices. It's not the only kind of index though. There are several other kinds of indices. There is something called a bitmap index which is primarily used for attributes with very few distinct values. And its goal is a little different. Usually, its goal is not to find a few records. Its goal is usually to do some other operations such as what the bitmap index does is it creates a bitmap which says that a bitmap for a particular value, let's say I want department is com side. Let's say the institute has only 10 departments. A bitmap index will create a bitmap. One bitmap, one array for computer science. For every instructor who is in computer science, the bit will be one for others, it will be zero. Now what do I mean by that? The instructors also have to be in some order, one, two, three, four. Somehow that order is fixed. It should not change. And then you can create a bit which is one if the instructor has computer science, zero otherwise. This is the computer science bitmap. There's another bitmap for electrical, another for civil and so forth. Now what is the use of these bitmaps? I can do various intersection and union operations very, very efficiently with it. I won't get into the details. So, it's used for certain OLAP kind of operations, data analysis operations. It's not very useful for finding a record with a particular roll number. That's not what a bitmap is for. It's used for other operations. Then there are hash indices, which are used a lot in memory during query processing. They were used at one point for disk indices, but people have given up on it. They don't give any great benefits over B plus trees, but they have a big drawback of not allowing range queries. So, they have been more or less abandoned. And then there are R trees, which are used to index geographical data, multi-dimensional data. So, B plus tree assumes a sort order. But how do you sort? Let's say locations in a map. You can't sort on X or Y. Which one do you choose? Latitude or longitude. So, it's not clear. So, for such two-dimensional data, R trees and there are many other indices, but R trees are widely used in databases. So, PostgreSQL supports R trees for example. Most databases today support R trees for such spatial data. So, that's it for indices. Any last questions on indices before we move on? Oh, I should also mention one other point here, which is when we do query plans, you will, on PostgreSQL, you will see something called a bitmap index scan. I think bitmap, I forget the exact name that is there. Now, I just told you about bitmap indices. Now, PostgreSQL does not actually have bitmap indices, but it has something else which is not really an index. It's a kind of data structure computed on the fly, which it calls bitmap index. I'll come to that later when we work on query plans in PostgreSQL. But I just want to mention that now because PostgreSQL, which we will look at in the lab, uses this term bitmap index scan for a different thing. It's not the regular bitmap index. It's related, but it's a different beast. So, don't get confused when you see it. In the book, there is a section on bitmap indices that is completely different from the PostgreSQL's use of bitmap index scan. It's a different term.