 Welcome to today's Postgres Conference webinar, Power Use of Indexes in PostgresQL, a user perspective. We're joined by Joven Augustin, Senior Support Engineer at Precona. We'll discuss proper examples and demonstrations in order to drive users to the right selection of indexes and better usage. My name is Lindsay Hooper. I'm one of the Postgres Conference organizers and I'll be your moderator for this webinar. A little bit about your speaker. Joven is a PostgresQL expert and open source advocate who has more than 18 years of working experience as consultant, architect, administrator, writer and trainer in PostgresQL, Oracle and other database technologies. He's always been an active participant in the open source communities and his main focus area is database performance and optimization. Joven holds a master's in computer applications and joined Precona in 2018. A few logistical notes for our attendees and then we'll get started. So the first is that I've put you all on mute to avoid background noise and to preserve audio quality. The second is, yes, we are recording this and it will be live on the Postgres Conference site by early next week. And finally, my favorite, questions are encouraged and will be answered after the session. You can ask questions as they come to you through the chat function either at the top or the bottom of your screen. So with that, I'm going to hand it off. Take it away, Joven. Thank you, Lindsay. So today we are going to discuss about the power use of indexes in PostgresQL. And I took this topic for this presentation because of some reasons. We often encounter people using indexes, but they won't be choosing the indexes in the right way. Postgres offers so many varieties of indexes and each of those index can be used in multiple ways. So this presentation is all about the user perspective of the indexes, how to use it. Okay, before starting, I want to thank my friend and colleague, Sergey. He's from Russia and he's very instrumental in preparing this presentation and many of the content is prepared by Sergey. Thank you, Sergey. And today's topic, the Postgres index types. We have so many indexes and today we'll be discussing about these types like B3, hash, gin, gist, and SP gist, brin index, and bloom and rum index. So let's get started. So we can't start a discussion about index without discussing about the B3 index. That's the base of all the indexes. So the B3 index is the most common type of indexes. So when we create an index without specifying what type it is, it will be the B3 index and it is a default. And this is the only index type supporting the return of ordered output. So if you are having a query to have ordered by close, the B3 is the index to be selected. And this is the only index type supporting unique constraints, say primary key. And B3 index supports index only scans and range scans. So this is a specialty of B3 index. I'm not going to the details because we have to discuss a lot more. Okay, the B3 index is basically a tree. Each leaf node stores the key and reference to the double, the TIDs. It offers predictable search performance. And every key lookup takes almost the same amount of time because it's a B3. And Postgres actually uses something called B-linked tree or blinked tree behind. It's a modified B3 implementation. It has its own advantage. And from the user perspective, the B3 index is burst for unique constraints and the index only lookup, sorted access. And it is the burst if we have high card cardinality data. Cardinality means how unique is the value. And the range B3 lookup, range operation, returns sorted output. The optimizer can pick up the read index, read the index for sorted result or read the heap and then sort. The B3 index sort order is defined upon the creation. So we can have ascending or descending order creations. So we'll see some of the examples in the coming slides. And please remember that every B3 index is not suitable for a sort or access operation. Okay, so moving forward with a good set of examples. Here is an example where the CTID is searched in ascending order. By default, the B3 index will be created in ascending order. So as we can see, it is using an index scan. But what if we query the same data on descending order? Still the B3 index can be used because it supports backwards scan. So it doesn't matter what is the sort order. It can use the B3 index. And what if there are multiple columns on the B3 index? So here is an example where the order by close, the first column is in ascending sort and the second column is also in ascending sort. So definitely it can use the B3 index. And vice versa, if both the columns are in descending sorted order, still it uses the backwards scan of the B3 index. But what if one is descending and another is in ascending order? That is where the B3 index fails. It can't really do the B3 index. So same is the case, it doesn't matter the order. Say one is ascending and one is descending. That is where the sort operation will be performed after the index scan. So as we can see, the sort happens after getting the data from the index. Basically it gets the data and then do the sort. But that was not the case in the previous case. In this case also it uses indexes, but the sort is outside. And another thing we should be remembering is the B3 index can be really large. Because each leaf node stores the key and the reference to the tuple. So sometimes we see the B3 index at least half the size of the table or even bigger than that. So there is no sparse B3. We have to store every key in the data. But we'll discuss about the deduplication effort in the coming slides. There is some star attached to this. The wider the keys, more columns in index, larger the index footprint. So what is the solution for this? We can create partial index. So Postgre supports partial indexes. And the partial indexes can create index that only covers some key values. So for example, say you want to scan latest few days. So the index can be on that column. The date column where the date is below last one month or something like that. And as I mentioned, there is index deduplication effort from Postgre 13. So the index size is really small after the deduplication effort. Especially for those indexes where there are a lot of duplicate key values are stored. And it is suitable for columns where even null values are included. And coming to the partial B3 index. This is something which users often forgets that there is something like the partial index can be created. This happens when we have data in the tables for years. And we are dealing only with the last few months of data. So we can have partial B3 indexes. It is possible to create B3 index on subset of key space. This basically allows to minimize the wastage. If a column has low cardinality and only one value of the key is interesting. For example, a flag. Flag saying whether the record is active. Then that is a criteria for creating partial index. And another criteria is the key distribution is skewed. So there also we can create partial indexes. And this significantly reduces the index size and obviously the operating cost. And it improves the query performance obviously because the index size is small. It can be cached in memory in a better way. Even the index cans will be faster because the index size is small. And it will not be usable in queries that do not have filters. So the partial index is all about having a filter. What is the criteria to have an entry in the index. So we are going to see the details. And here is an example. Say we have a table with a lot of values where outstanding amount is zero. And those are old records basically. And we have a comparatively new set of records where outstanding value is not zero because it's ongoing transactions. So those values are very less. But that is the set of data where we frequently act on. And those active rows will be of more interest for our frequent query. So we can create an index on those column. This is the regular way of creating an index on outstanding amount. And this is the way we can create a partial index on that column. Because in this case there is a word clause because of that the index will contain only those rows. The reference to only those rows or only those keys where outstanding amount is not zero. And as we can see when we create an index with every record it will be in a typical case it will be 21 MB. But when we have a partial index in place it consumes considerably less space. With all the improvements in deduplication in Postgres 13 still it is bigger than the one with the partial index. So there is a huge saving. So obviously the Postgres 13 onwards the deduplication really does have magic in terms of index size. Especially when we have a lot of duplicate keys. And partial index works when our queries tells us that they don't need to skip the data. So here is an example. So we are querying for those records where outstanding amount is greater than zero. So obviously it can use this partial index. Or we can specifically look for a value which is not zero say 284. So there also it can use the partial index. But if we randomly pick some value so it could be even zero. So the Postgres optimizer won't go for this partial index. Instead it goes for a sequential scan. So now we know how effective the partial index is and where it won't serve us. So when we have a condition which is not predictable the outcome is not predictable. Then the partial index won't be helpful. And coming back to the deduplication of B3. So this is one of the reason why we should upgrade Postgres to Postgres 13 or above. Many of the index sizes are considerably small. In real world the key values are not so unique. There are unique indexes but other than that the values are not so unique. And there are many occurrences of few values and few occurrences of many values. So it's completely skewed. And partial index is not a universal solution. So we need to have a solution which works for other cases also. So the deduplication comes into picture and it is available from Postgres 13 onwards. And what is duplicate leaf page? This tells us about the definition because the definition basically tells that the key values are appearing in some other page. It is a duplicate table and it is a lazy process. And the deduplication can occur while creating the index or re-indexing. And obviously there is a performance penalty for deduplication effort. And it can be disabled. We have an option to disable that. In the latest Postgres, Postgres 14, there is an option. Actually there is a feature to deduplicate even MVCC duplicate tables. So it avoids the space plates. So this reduces the index bloat. And there is a new concept of bottom-up index deletion also introduced. So the Postgres 13 and 14, the B3 indexes are far superior than the previous versions because of all these improvements. Okay. And there are some limitations for deduplication. The character types with non-deterministic collisions cannot be deduplicated. Numeric types or JSONB type, float, container types. And include index close. All these cases, we cannot really use deduplication. So there are limitations with deduplication. And here is an example of deduplication. So here how to enable and disable the deduplication. So as we already discussed with deduplication, the index size is 7MB. But with the duplicate values, it is 21MB. This is a typical example, but yeah, there is a huge saving. Okay. And I'm skipping over because of the time limitations. And this is another important thing which the end user should remember. The B3 index has the feature to have include close. So we can have other columns also included into the B3. So that they will be presented in the leaf nodes. This improves the index only scans because otherwise for those columns which are not present in the B3 index, the table lookup is required. When there is include option, the B3 can return the value from the leaf node itself. And this is all about expression index. This is not specific to B3, but it is available in other index types as well. But discussing with the B3 as it is commonly used with the B3 index. So we can create expression index on indexes. So here is a case where the lower value of a column can be effectively indexed. Without index, it will be going for a sequential scan and text string match. But if you create an index on that, basically an expression index, it results in an index scan. And this considerably lowers the execution time and effort. So even the full text search or string comparison become so simple because of this expression index. So expression indexes lookup by function values without having another column. So otherwise, without an expression index, the alternate option is to have a generated column. So the expression index as a substitute for generated columns. The value, the expression value will be stored in the indexes. So the calculation and there is a lot of savings in terms of that. And going to the hash index, this is probably the simplest of all index because it's a simple hash table. But it is not obvious for its problems historically. Because previous to Polkistan, this was highly discouraged because it's not well logged and it was not replicated. But now it can be used heavily wherever it is applicable. But there is limitation. As we know, the hash will work only if there is a quality operator. So the hash index is basically a table in memory, a table structure. So it stores the TIDs and the values in the buckets. The hashed values of keys in buckets. And here is an example from Wikipedia how the hash index works. So there is a bucket where the hash values of the keys will be stored. And whenever the keys are looked up, the hash is prepared and it will be searched in the buckets. It's very simple. The bucket management is automatic. The function is predetermined. So probably this is one of the simplest index types. And the hash index storage is organized with multiple structures. One is a meta page where internal information about the index is kept. And the bucket where TID and the key value hash is kept. And there is a concept called overflow. So overflow pages also will be there. And some of the important points which we need to remember is hash index grow non-linearly. So it will be step by step growth because whenever bucket pages are added, the size grows. And vacuum on the hash index does not return space. So vacuum won't help to free up the space. So for freeing up the space, we should be really doing reindex or vacuum full or similar things like a pgpag or things like that. And hash index stores the hash to value. And another important point is hash value has a uniform size regardless of the key length. So depending upon the key value length, it can help us to minimize the storage. If the key value is larger than the hash value, then storing the hash value is cheaper, smaller. So it can reduce the storage. However, if the actual value, the actual key value is smaller, the hash value will be bigger. So it can increase the storage. Because of the same reason, it is good for big key values. We are going to see some of its implications. And the limitations are like it can only serve the equality operator. No ordering is possible or no range of lookup is possible. These limitations limits its usage and purpose. Still it can be used over a lot of unique and almost unique keys are values are available. And it is mostly useful when the keys are big enough. And it cannot be a base for unique operations. You may mean unique constraints and things like that. And what will happen if the key values are small? Here is an example. First we have a hash index. It is 28 MB. But if you create a B tree index, it is even smaller. So a hash index has a demerit here. Because the average length of the column is only 7 bytes. But what happens if the key value, if we have a large key value? As we can see, if the average size increases to 16, we have a slight advantage for a hash index than a B tree index on the same column. And if we have really large key, we have bigger advantage. And there is one more point. The default fill factor for hash index is 75 compared to 19B tree. We can even adjust the fill factor and we can see a bigger saving in terms of space. But if there is a lot of updates and things like that, reduction fill factor may not be a right thing to do. And here is an example where we have a lengthy key, the URL. So in a table, we have malicious URLs stored, which normally results in lengthy string comparison and table scans. So if you have a hash index, this value will be hashed compared against the hash index. So it is, as we can see, it does say index scan, hash index scan. And we can have another normal hash index on with the ID column so it can result in, because it's a straight away lookup. But as we already discussed, the hash index can't help if there is a non-equality comparison. So that's where the hash index fails. And in summary, hash indexes are good for long key values, like URL. So the hash value will be very small and it can be easily looked up. And it is helpful in equality operator and single key lookup. And because of the size, it limits the right overhead, improves the query performance. And hash index are not so good for cases where there is no index scan only scan, index only scans are possible. Or there isn't cases where ordering or range operations are important. So because the hash index cannot support ordering or range operations. Yeah. So the hash values are stored. So it can't really do a index only scan. And it cannot be basis for unique constraint. It cannot be used for clustering the table. And for Postgres versions, below 10, it is highly discouraged. Because it cannot be replicated or well-loaded. And coming to the Brin index, this is a block range index. Here is an illustration of block range. Say a series of values 124, 125, 126 up to 143. It's a series of values. So instead of saying each and every value, we can say that we have values ranging from 124 to 143. This is how we even tell in English or human languages starting from this value to that value. So that's the same concept we use in Brin index. So we just need to store the minimum and the max value. This is introduced in Postgres 9.5. And it make use of the natural correlation of data with their physical location. And remember that this don't really store the TID values of the column, the rows. And in the concept wise, the table is split into range of blocks. The default range is 128 blocks. It maintains a summary of information about the ranges like min and max values. And many times the Brin index is not considered as an index because there is no TID store. It doesn't have any property of index. It is just the min and max values of a range. And it works as an accelerator for sequential scan. So the sequential scan knows where to scan. So it's an improvement. It is a virtual partitioning method than an index. The process of indexing the data in Brin index is called summarization. Because it really don't store the value. It requires a bitmap index scan. So it reads the summary tuple and comparing that with the query calls. Query calls means what is the word close, the filtering condition. And the tuple will be compared. The summary tuple will be compared. And it can identify this value can be found in this range. And in order to improve the instruction speed, the blocks at the end of the keep may not be updated in real time. So the Brin index is not updated in the real time. The vacuum and auto vacuum scans does the newly inserted block. So some of the work is the index maintenance work is offloaded to the vacuum and auto vacuum scans. And there is a new feature in Postgres 10 onwards called auto summarization. We'll discuss in the coming slides, which will improve this index maintenance for some edge cases. And there is improvement in Postgres 14. So as we discussed, there is a min and max values is what jungly stored. And now from Postgres 14 onwards, we can have multiple ranges. So that can be a gap in between. This allows more efficient handling of data with a poor correlation with the physical location. And because if there is inserts and updates and beliefs happen, the gaps will form. And this is a sample syntax, which can be used for a multi, if there is multi ranges. And another feature in Postgres 14 is that the Brin index can use bloom filters. We'll discuss about the bloom filters in the coming slides. And the Brin index is burst because for those cases where it can result in a really small size. And it is fast in insertion updates because the work is offloaded to the vacuum or the later job. And it is burst for time series data because time series data is sequentially in nature. And Brin index need to be avoided if there is a lot of raw migration happens because of the updates. And in that case, please consider reducing the field factor of the table so that the hot updates will be happening. And one common mistake when people compare the Brin index is that they compare the performance of the query performance alone with the query performance achieved by B3 index. So the B3 index has higher cost in maintaining. So obviously that will result in better query performance because it's a lossless index of bigger size. So there is a cost associated with the B3 index, but of course it can give a slightly better or some comparatively better benefit. The B3 index is much cheaper to maintain and produce. So it may not be able to produce that high result as compared to B3 index. So here is an example. If you create a regular index, which is a B3 index, in a typical case it results in 114 MB size. But if you produce a Brin index, it's just a few kbs. So it's a 72 kbs. So see the difference. And the Brin index allows us to specify the block ranges. So by default it is 128 pages per range, but we can reduce it for more granularity and accuracy. Even if we reduce to a smaller value, still this saves a lot compared to the B3 index. And once we change the pages per range value, please remember to re-index. And we already discussed the autosummarization happens by auto vacuum. But we can even manually do that by executing Brin summarize new values. This is a statement like this. And coming to the autosummarization in posters 10 onwards. So as we discussed previously, it was taken care by the auto vacuum or vacuum, which can happen at any time down the line. So if the table is not a candidate for auto vacuum, it may happen after quite a long time. But now from posters 10 onwards, there is option to explicitly make a request to auto vacuum to do the job. The auto vacuum is still in charge of processing these requests, but it will be processed before the auto vacuum hits the table. And remember that this autosummarization is disabled by default, but it can be enabled at index level. So here is an example where we modify the pages per range and autosummarization. And this is one frequently asked question, how do I identify best candidate for the Brin index? So here is an example where we see that the information from biggie stats shows us that there is a higher correlation. Many times we won't get one, but it will be near to one. So those are the columns which is right candidate for Brin index. And coming to gene index. So this, it is a index where the key values are not stored in the leaf nodes. So, but the key values, the, normally we know the key, key values are stored in the leaf nodes in B trees. So but here the values are stored in entry tree. There's a separate structure for that we'll see that in the coming slides. But the reference to the nodes, there is a separate data nodes where the column references are stored. So coming to this. So this is entry tree which is, which looks like a normal B tree. This is where all the values are stored. And there is a posting tree where the key pointers are stored. And there is a separate painting list also made for new values to come. This is the gene index structure. So since the values are not stored in the leaf nodes, it's in index which is considered as inverted. Okay, and the gene index are good for researchers and it is good if the values are less frequently updated because we know that the values need to go to the entry tree and the references are kept in the leaf nodes. So if the updates are frequent, the gene index maintenance become heavy. And the best for less cardinality columns. So this is just opposite to the B tree index. And because the less cardinality cardinality columns can be fit in the entry tree more efficiently. And this is an example where the gene index is used for a text search. Say a lengthy statement like, hello, how are you? So the TS vector can pull out all the important lexemes out of that, which need to be catering to the text search. So age, hello, and my name, the TS vector identifies. And now this can be indexed. So here is an example where the expression index is used. So only these values goes to the entry tree. This is very efficient. And the gene index supports some storage parameters, whether we can set the pending list size as well as whether there is need to be a fast update on the pending list size globally can be checked. But it can be modified at index level as well. And one beautiful thing about gene index is it is extensible. So here is an example where Postgres provides B tree index as an extension. It is available in the contrib module. This is very useful because Postgres don't have index corresponding to bitmap index in other database systems. So we can make use of the B tree gene index. And many times we see up to 20 times space saving if we use. So this is an example of extensibility of gene index. And of course, there are disadvantages. Gene index may not be a great idea if there is a lot of transaction because we need to maintain the entry tree. Transactions occasionally get delayed. This is one of commonly reported problems. And before selecting the gene index, please make sure that we understand how it works. And the search must scan the list of pending entries. So all the new entries comes in the pending list. So the search also has slightly higher heat when there is a lot of frequent updates. So these are the list of some of the common demerits of gene index. Coming to bloom index, it's a bloom filter. Basically, it's a bloom filter and it tells whether a value is existing in a set or not. So the false positives are accepted. So possibly in the set is good. But the bloom filter should say that definitely there should not be any false negatives. So it should say that definitely it's not there in the set. So since it's a lossy index, false positives are allowed. The reach of keys are required. And it can be considered as a space efficient hash index because internally it does multiple hash functions. And it is available as an extension. So as I mentioned, it is multiple hash functions. I usually picture it as a signature or a signature. So the signature length can be adjusted. So there will be more content in the signature. So it can uniquely help to identify the code. And it is best for multi-column indexes where equality comparison is used. And another important aspect is the column position of the index is not important. We know that the column position is important in B tree index. So second column in the index may not get the ass. So here is an example. There is a query on total amount. And the total amount is the last column of the index. But still it uses the bloom index. It works. And efficiency increases as more number of columns are specified. So in this case, not only the total amount but booking reference also mentioned. So the index reach has to perform very less number compared to the previous. Reach has to do a lot more. And as I mentioned, the signature length can be adjusted default is 80. And the signatures for each of the column also can be adjusted based on the preference. So the total length of the signature can be specified. And we can say that the column one can use two bits. And column three can be can use four. So we can specify the importance of each column in the bloom signature. And there is limitation. So we don't have operators for all the data types. So but we can create operators using the existing functions. Here is an example where operator class is prepared using hash function. And as we discussed, the bloom is lossy in nature. It cannot be useful for unique keys. Result in bitmap scan and rechecks. Bloom does not support for such as of null values because there should be a signature. And coming to this index. This is generalized search index. It is similar to be tree. And it is a very generic in nature. And it is required because suppose the support's lot more data types than just common. For example, geo data types, images, documents. So normal operations cannot be performed on this data types. So for all these arbitrary data types request arbitrary operators. For example, we can we need to search for a location in the area. So the gist acts as a extensible framework for building new access methods. So that can be an overlapping data. We'll see some of the examples. And it helps in text searches. And it uses signature tree when it comes to text searches. And the gist index disadvantages. Ever since you've just may degrade over a period of time. So you may, we may have to reindex sometimes. The updates are always problem for gist index. And here is an example of R tree index created out of gist. So as we can see, there is a box outer box. R1 and R2. And that contains smaller boxes. So this can be represented as a gist tree. So R1 and R2 contains other boxes. And so we can, this kind of index can support searches like whether the point is inside a box. So this is a, normally general data types won't support this kind of searches. And we can even search for points, point, order, order by the distance from another point to save. P and its distance from another point. This order by close. And it supports mean the other operator types. So we can even create our own custom operators. So these are some of the examples of operators like a sickly left, whether a point is on the left or right or coincides with another point or say, whether it is in a box or a polygon or a circle. So then we can we can create a lot of operators and these are available in Postgres by default. And this is one of the frequently asked question, since both supports, both the gene and gist supports text searches, which one is better. So gene is faster in lookup. This is comparatively slightly slower in lookup. But gene is slower in updates. This is faster in updates. So gene stores the lexines, actual lexines, but the gist only stores the signature. And the gene rebuild time depends on the maintenance workman and just maintenance rebuild doesn't depend on maintenance workman. And this is an improvement in Postgres 13. The tier specter, the gist index, the signature length can be adjusted. And it's the same. Once we adjust the signature length, it will be available even while displaying the tables. And these are some of the more specialized indexes, SP gist and RUM index. So SP just stands for space partition gist index. So it is a non-balanced tree. And the concept is all about repeatedly dividing the search space into partitions that need not be equivalent size. So divide the space into non-overlapping areas. So SP gist is suitable for structures where the space can be recursively split into non-intersecting areas. So this is useful for quadtree or k-dimensional trees. So an example of quadtree is available here. So the space is divided into further and further recursively to the minimum to a point. So the searches can be very efficient using space partition index. And the RUM index, this is an extension. This is an example of index as an extension. And it is based on gene access methods. And one advantage of the regular gene index is that it even stores the positions of lexemes. If you recollect the discussion about gene index, the TS vector identified the lexemes and it indexed. But the position information was missing. But RUM index can even help to capture that position information. And the position information is important when we have a phrase search as attack searches. And the disadvantage of RUM index is it is slower to build. Insert time is higher than gene. And another disadvantage is it uses general world records. So normal indexes will have specialized world records. These general world records can result in a lot of value generations. Those are some of the disadvantages. So we have seen almost all types of commonly used index types in Postgres. And here are some very good references to some of the index related discussions. And by this I would like to conclude my presentation. And thank you everyone. And thank you for attending this conference. Thank you.