 So we're very happy to have you, Jimmy Angelacos, please come up. Thank you. Before we start, how many of you have a background in mathematics? So a few hands, many hands. And how many of you have a background in linguistics? Oh, okay, nice to see some people here with both kinds of skills. It's going to be boring for both of you because it doesn't have enough maths and it doesn't have enough linguistics, but it does have some samples. So the motivation for this talk is I was speaking to some, let's say, colleagues, customers, community members and I was surprised to find out that these things aren't all that well known. There are people that use Postgres that actually use search engines to index their text like external search engines. And it's all basically the same code base because it's always Lucene and Solar, which is Lucene, and Elasticsearch, which is Lucene. And these things aren't databases. They create indexes for your text, but they're not databases. Your text is in your database. Why do you have an extra thing when you can index your text in database? And that's the motivation. Let's see how you can search for text in Postgres. Now, we're going to look at what full text search means. We're going to look at operators and functions, what dictionaries are in Postgres. A few examples. How to index text in Postgres. How to deal with text that isn't human language or natural text. What collations are. What other things you can search for that are not strictly text and how to keep this whole thing running. Now, are you excited for full text search? You shouldn't be. Your trust is misplaced. So we do have things like this appearing in the presentation. And they have been known to induce drowsiness and inappropriate sleep. So if you feel like sleeping, nobody will blame you. Let's look at what is text. Let's look at the representation of text in Postgres. We have the well-known car with a character limit that says you can only store this many characters. This is padded. So if you define a car of 10, then and insert five characters in that field, you will get five blank spaces at the end because it's padded. These blank spaces do not count for anything. They are ignored by Postgres. Then you have VARCAR, which is variable length but still has a fixed limit. And then you have text and the confusingly named VARCAR, which is the same thing. And you should really use the word text, I believe, if you want to store an unlimited amount of text, because it makes it clear that it is not VARCAR. It's a different type of column. So this is a character large object. You can store a lot of text in there. And in variable length text in VARCAR, we see that trailing spaces are significant. If you leave a space at the end of your string, then that counts for like expressions or for regular expressions. How are they stored? Depends on the character set. There are multi-byte encodings like UTF-8 that will use one character for Latin text or two characters for other languages or as many as they add in the end with all these emojis and stuff that they keep adding. The storage is one byte plus the length of your string in bytes, not characters, so 126 bytes. And if you go over that limit, the storage changes to 4 bytes plus the length of your string in bytes. But there is no guarantee that this is the amount of space that it will take up on disk because of compression. And we also have toast or the oversized attribute, the oversized attribute storage technique, which puts it in separate tables if it goes over the maximum length of your row. So what is text search? We're talking here about information retrieval and more specifically text retrieval. This is a well-known domain and we mainly use it for searches on metadata, which can be like descriptive bibliographic, there can be tags, tagging objects, and are used for discovery and identification. You are generally not, when you're talking about text search and text matching, you are not trying to identify, let's say, the web pages or the articles that you're interested in through their content, you're looking at metadata fields and you're trying to match words out of the whole thing. So matching, substring search, data extraction, it's useful for, you can clean your data, you can use it for data mining, and these are the supported operators in Postgres for this sort of thing. You can use like or double tilde, and case and sensitive like or double tilde star, which is basically wildcard expressions. If you want to match something against a wildcard with the percent sign or underscore, percent sign for any number of characters, underscore for just one character, then that is the SQL way to do it. You also get regular expressions, and the tilde and tilde star are posix compatible regular expressions. So not quite pearl, but very standard, and Postgres implements them out of the box. There's also the function regex match, where you can give it a string of text and a pattern, and it will return whatever matched the pattern from that original string that you gave it. But are these things enough for searching for text? You can't have, with regular expressions and likes, you cannot get ranking of results, because nothing has meaning, it's just characters, it's not language. You can't have relevance in this. So no concept of language when you're matching with regular expressions, therefore if you search for a ship and you search for shipping, it will be two totally different things that will never be related, they will not match. And also they can't be indexed, which is bad if you have a lot of text that you want to search for. Okay, you can index these in some limited way. And there's also similar too, you better forget about that one. It's a mismatch, it's a mix between posix regular expressions in SQL that everyone is better off not using. So just forget about it. Now, we have inserted the word full, what do we mean by full text search? Now we're going a step beyond from information retrieval and text retrieval, we're going to document retrieval. So based on our predicates or what we are searching for, we want to return results that are documents, articles, web pages, whatever contains text. So full text search is a search of words or tokens, as we call them, in a database. So the sum of all documents that you have. If you don't have an index for this, this will degenerate into a serial search. So when you start searching, it won't be faster than grep, because it has to go through the entire document for every document. So you do need an index for this, because you don't want to scan through the entire document. There are also techniques that, let's say, make it easier to search and reduce the amount of text based on natural language processing algorithms. And we also have to talk about the famous trade-off of precision versus recall. What is precision? It's how accurate your search results are. Therefore, how many false positives you're getting or how many false negatives you're getting when you're retrieving your results set from your query. And recall is how many results you're getting back based on that query. How restrictive is your query? And these are influenced by things called stop words and also stemming. And we'll look at those. But first of all, let's look at documents and tokens. So a document is a chunk of text. It's a field in a row in your database, in a table, of course. Any type of text is considered a document for this context today. Parsing of documents into classes of tokens is what needs to be done next. So tokens can be strings. They can be alphanumeric strings. They can be numbers. They can be any sort of character group that you define that is interesting to you for your own purposes. So breaking up the text into tokens and classes of tokens is parsing. And we're fortunate enough, as a result of serious effort and time that has been spent, we do have a parser that is very good, that comes for free with Postgres. Or if you have different requirements, you can write your own parser as long as you can write it in C. Now, in order to do anything useful with our data set, we need to convert the tokens from these documents into lexemes. Therefore, we have to perform what is called the normalization of our strings. We have to turn them into some more usable form from their original form. And what is a lexeme? It's an abstract lexical unit that is representing the set of related words. So you can call it the word root. But a word root only applies for human language. So a lexeme in the context of natural language is the word root. Therefore, the lexeme search can stand for the words searched and searcher because you can reduce these words down to that lexeme. What are stop words that we mentioned? Stop words are very common words that appear in our text. Or they have no value. Like in English, articles like the and a do not have meaning for our search most of the time. So we filter them out. And by filtering out stop words, that increases the precision of our search, right? Because we're getting more relevant results. It does. And how do we remove them? We remove the stop words based on dictionaries. Some dictionaries have stop lists. And these are words that are checked and removed and eliminated from the document before we can do anything with it. But if you do remove stop words, you have to consider one fact. What happens to phrase search? If you're looking for something that includes articles, for instance, or very common words as a phrase, then you have to consider other things. Let's talk about stemming. And stemming is reducing, as we said, the words to lexemes. And it does increase the number of the results. So it does increase recall. And if we have something that matches multiple words, we're going to get more results. It goes without saying. So the algorithms that we can use for stemming are, as we said, normalization using dictionaries, where we can find out what the word stems are. We can strip prefixes and suffixes that we know exist in a specific language and break down the words to its constituent parts. We can also use automatic production. So maybe if you find the word stem by attempting to add ing to it, you are creating a possible result by producing a new word based on the stem. We have also limitization rules that define how words are stemmed in each language. And we also have n-gram models, which are probabilistic models that show how you can reduce words. But we mentioned languages here, and it is hard to do this across multiple languages at the same time. So how is full text search represented in Postgres? We have a data type that's called TSVector, which represents a document, stripped down to its essential form. It's a pre-processed document stored as a separate data type in Postgres. And then we have TSQuery, which is our search query, our predicate, and it's normalized into lexemes. So before you can use Postgres' full text search features, you have to convert everything to one of these types. And these come with utility functions like two TSVector that turns your document into the TSVector form by normalizing it. And we also have two TSQuery that expects specific TSQuery syntax that you have to know. Or you can just use plain to TSQuery if you have some text that you want to turn into a TSQuery and that ignores capitalization and punctuation points and all that. If you want to find out how these things happen under the hood, you can use TSDebug and that will show you how everything is converted and what token type it's been converted into by the dictionary that you're using. So the operator types that we have for full text search are the double at sign, which means TSVector matches TSQuery on the other side and you can also turn it around and you can make it TSQuery matches TSVector. It works both ways. We have TSVector concatenation and for TS queries we have these operators and or and not that help you formulate your query. We also have the followed by operator, which means that TSQuery1 must be followed by TSQuery2 and that's a way to define order in your search, word order. We also have contains and contained in. So we mentioned dictionaries multiple times. What are dictionaries in Postgres? They're programs. They accept tokens as input and their output is the normalized text that we need. They improve the search quality as we said by eliminating stop words and normalizing into lexemes. They reduce the size of your TSVector, which is good. And you can use search dictionaries by create text search dictionary name. You can choose one of the many available templates. Simple doesn't attempt to do anything smart. It just separates words by white space, which is useful if you want to match like proper names that are not words in any language. And you can also define the set of stop words for your text search configuration. You can also decide to change those in the order which is convenient for you. So you can have it first perform the dictionary normalization using the ISPEL and then the simple one. And you can assign different weights to each one of those. So you can tell apart the proper names, let's say that were kept by simple but were eliminated by English ISPEL. And you can assign different weights to them in your results. And these come with your system because we can use the open source ISPEL, MISPEL and HUNSPEL dictionaries for this on top of the included ones in Postgres. So let's do some text matching. So select to TSVector a nice day for a car ride as our document. And our query is I am riding. Does it match? Yes, it matches according to our English dictionary rules. But why? I want to find out what to TSVector generated that made it match. So to TSVector generated these lexemes, car, day, nice, ride and to TSQuery reduced everything down to ride. So yes, it matches. It's one of those. Let's see something that doesn't match. So if I select to TSVector a nice day for a car ride and my TSQuery is I am riding a bike, does that match? No, it doesn't match because my plane to TSQuery converted my query to it stripped it down to the two essential words ride and bike. And bike isn't included. So therefore the document isn't relevant to our search. It's not returned. Another example we can look at for matching is Starman and Star. They do not match because they don't, according to our dictionary always, they don't belong to the same word root. But I can cheat and I can say I'm interested in all words that begin with Star. So I can say to TSQuery Star followed by an asterisk and that matches because that makes it accept everything that begins with Star. Now you will notice that I am not using to TSVector and to TSQuery here because Postgres does it automatically for you sometimes. When it is able to it will cast these to the proper types automatically. We also have the wonderful function web search to TSQuery that attempts a Google style query. And you can use things like and or and you can use like double quotation marks to mark phrases. And you can also use minus signs to remove the results that you don't want from your like a web search engine. So this makes our TSQuery become stray followed by cat and it must not have cat followed by shelter. Because I'm only interested in the stray cats but not cat shelters for the purposes of our search, of course. Let's look at an example table of immediately followed by if you want. Sorry the question was it doesn't mean followed by or doesn't mean immediately after immediately followed by this operator means immediately followed by. But if instead of the minus sign in the middle of the operator if you insert the two or three that means two tokens away or three tokens away. So you can define the distance that you want. So let's use an example table of the contents of PGSQL hackers the mailing list. And the contents have an ID parent ID which email you replied to sent time timestamp. What we're interested in here is the subject, the author and the body of the email in plain text. So how big is this? This is 478 megabytes. We also mentioned how relevant our results are and ranking them. So for ranking we have TS rank and it's covered density variant TS rank CD cover density just takes into account how close your terms are in the text. And it bumps up the relevance of things that are closer together. But anyway, let's use TS rank for now. So from our example table, if I select the subject and the TS rank of our body converted to our email body converted to TS vector. And you'll notice that I'm putting coalescent here so that I can have meaningful results if I have a null text body in my email. I am trying to match this to aggregate the word aggregate. So I'm trying to find the top five subjects because I'm doing an order by descending limit five. I'm trying to find the top five subjects of emails that contain the word aggregate. And when I run it produces a rank and it tells me that these were the most relevant subjects. Excuse me, these were emails that in their body were most relevant to the word aggregate. I know that's not a very useful search because it's going to show up a lot. But these are the subjects of the emails. Sorry, 32 is the scaling of the result. So it's a bit mask that defines how the weights are applied to your ranking algorithm. These are all documented as you know Postgres or you may not know Postgres has excellent documentation of the web and there are going to be links to that at the end of the presentation. Let's look at some statistics. We have the function TS stat that can be used for verifying that our configuration for text search and dictionaries and everything else works fine for our own purposes. So if we select star from the function TS stat and we pass it a TS vector because that's what it needs. I will attempt to find the number of documents, the number of words, etc. And it returns according to my configuration the top words that it found in the documents. So the most common word was use, wrote, would, think, patch and so on. Which is something that you might expect from the PGSQL hackers mailing list. We mentioned indexing and we have to recognize that the normal default of B3 which is our index type in Postgres or default isn't all that suitable for full text search. You can use it for a limited sort of text matching and searching. If you create the index specifying VARCAR pattern ops then you can match left anchored text or if you reverse the string in the index then in the index expression then you can search from the end of the string forwards. It is going to be less useful than full text search but it is one way of doing it. For full text search we have the GIN index which is an inverted index which means that each lexing has one entry in the index. It is quite large and it's slow to update. So it is better used on less dynamic data and you apply it on TS vector columns. We also have GIST, the generalized search tree index that is lossy. It is going to be smaller on the same data set but it is also going to be slower because as a lossy index it produces false positives. And in order to eliminate those false positives Postgres has to go back to the row and actually determine whether the row is a useful result or not. So it is a bit slower. So it is better suited for fewer unique items. If you have let's say a limited number of things that you are searching for then just might be a better index for you. And you can apply it on TS vector or TS query columns. Now let's see what unindexed full text search looks like. I will do an explain analyze and that will show me how my query is being executed. And my query is going to be select count from mail messages. So find me all the text messages, all the emails where that contain the word aggregate. When I run it it will do a parallel sequential scan because in recent versions of Postgres you can have parallel scans of tables. And specifically in version 12 you have the just-in-time compiler enabled which also speeds up things. That is why I have highlighted it here. The execution time is horrible. It took around 26, 27 seconds almost to look through these 400 megabytes, 485 megabytes of emails and determine the ones that contain aggregate. Indexed, all we have to do is create an index on mail messages using JIN. That is the syntax for specifying the JIN index type. And then we convert our subject concatenated with our body. You can do that. I haven't done it in the examples that I'm actually running but that is a possibility. You could want to index the subject as well as the body. So because people will often use words in the subject that they don't repeat in the body of the text. So that is a possibility. You can index both of them as a TS vector using the English language rules. But you can also, in Postgres 12, you can use a generated column. And because you can convert everything into a separate TS vector column instead of doing it on the fly like this, you can alter the table and add a column of type TS vector and the new syntax is generated always as stored. For now Postgres only supports stored generated columns. That means they stay on your disk. They're not generated on the fly. As TS vector, coalesce to get rid of nulls. So coalesce this with a blank space and the body. And that will keep your TS vector column updated all the time even though you may be only updating the emails column. Sorry. And then you just create an index on that generated column. Now let's look at gist. Does it help us for our search? It's the exact same search we're searching for the word aggregate. It did help. We can see that and bitmap index scan was used and our search was much faster. But you will notice that we have rows removed by index recheck. So that means that it had to go back and figure out whether these results were useful or not. So it is a bit slower than it could be. It's not optimal for this sort of search. The execution time was only 5.6 seconds. So about 4.8 times faster than not using an index. It's still slow because it's not because it's not the suitable search. It's not the suitable index type for this search. Let's use gin instead. So explain analyze again. 5.6 milliseconds, not seconds here. So we're seeing a significant difference here. It uses a bitmap index scan of our index column. And it's approximately 4,700 times faster than not using an index. So you know you're doing something right when you're getting things that are less than a millisecond. Less than a second, excuse me. So gin and gist index operations. We have these operators for TS vector. Gin indexes the double at sign operator. So it indexes matching as we saw. For JSON B, or binary stored JSON in our database, it supports the existence operators and also the contains operator. So these are operations that will be very fast if you've indexed your JSON B column with gin. And gist supports again the at at sign for text matching. But it also supports the containment operators for TS query, whether something contains the other. Excuse me. Some super useful modules that come with Postgres are pgTrigram. We'll look at that in the next few slides. Unaccent that removes accents and diacritics from your text so that you can match a u with an umlaut to a u without an umlaut. If that's something that you need to do. For instance, many search engines do that. They remove accents before returning results. And fuzzy string match gives you useful metrics such as string similarity. And it can return things like Levenstein distances between two strings, how different the strings are. It also supports sound decks, metaphone and double metaphone. To be totally honest with you, I haven't used it for that purpose. So there is a warning on the Postgres documentation that these sound decks, metaphone and double metaphone may not work well with UTF-8. So it works like this. Select name from users where Levenstein, the function, Stephen on the column name is less than two. So if something is less than two characters different from Stephen, please return it. So I'm essentially looking for Stephen with a V. Other index types that exist follow the drinks paradigm and there is a proof of concept index called vodka. I do not know the state of its development right now, but it offers advanced features. And may I mention here that most of these text search functionality was implemented by our friends in Russia, community members that have contributed significant portions of code for indexing and full text searching. And we also have rum. Rum is currently maintained and you can find it on that Dodgy Russian website. And it stores positional information for lexemes, which means that it can perform faster ranking because our indexes are not that useful for ranking and faster phrase search because you're looking for consecutive things. It also stores distance between timestamps, floats and money. So these operations with this operator become very fast. Now one use case that I came across this year was a free text that was not natural text. It was a text column that contained arbitrary strings, non-human readable generally, that contained keywords that needed to match something. So basically the user was searching for strings that he did not know beforehand. So you couldn't index based on those strings. You had to search through the entire thing. So such as keywords and device logs. So dictionaries are not very helpful here because we're not talking about language. And I created an arbitrary example with around 10 million rows of about 100 character IoT device log entries, fake of course. Some contain strings that are significant to the user, as we said, but the user doesn't know which strings until the last moment at runtime. So we populate the table with random hex codes and we make 1% of these log entries contain a significant keyword. And I chose keywords from ETC dictionary's common words. So they look like this. They have something hidden in the middle of all the non-human readable stuff. So select message from log entries. They look like this, barely readable, barely useful for full text search. And it's about 1.4 gigabytes. Our query that we're interested in speeding up is select start from log entries and the message is like source. So you don't know anything. You're interested in the word source, but you don't know what precedes it and what follows it. So how long will this take? Parallel sequential scan shows that it took 9.6 seconds. As we said, too slow. How do trigrams help us? We mentioned that it's a probabilistic language model based on Markov chains. And three characters makes an n-gram a trigram. So what it can show you is the similarity of alphanumeric text by counting the number, for instance, of shared trigrams between the two strings. You use it by creating extension PG trigram. By selecting show trigram, you see what it converts your source word into and it splits it into these trigrams. You'll notice that some of them are padded with white space. Create an index on log entries using gin and I specify gin trigram operations here. Did it help? It's about 37,000 times faster than looking through the text. So this index type helps you to search for non-language text. It also works with regular expressions. So instead of like, you could have used the regular expression here. But this comes at a cost and the index is 1.6 megabytes. Gigabytes, excuse me. It's as big as the table. But if you want rapid immediate responses, that's the trade-off. Other tricks that trigram gives you is similarity of text as a number. The distance between two bits of text, a boolean that tells you if the text is more similar than your similarity threshold that you configure. And it's supported by gin and gist indexes as well. Postgres supports, as we mentioned, nearly every character said that is out there. You can find out what your client is using by querying pgclient encoding. And you can convert strings between source and target encodings. But you can also have your client do automatic character set conversion by setting your client encoding to whatever you want. And that will attempt, if it is possible to convert from the character set into database to your client's character set, it will attempt to do it automatically. Now, we have to speak about collision, which is the sort order that is different. Even with the same alphabets, it's different in different languages. So you have... you can define the collision order and the character classification for a specific language per column by creating it at creation time, by specifying it at creation time or during an operation. So you can select with a specific German language collision from a table. That means that you are not restricted by whatever configuration your database has in... excuse me... LC collate and LCC type. And... sorry. In Postgres 12, we have non-deterministic collisions that can also ignore accents when sorting text. And that is a new feature. We mentioned other types of documents, such as JSON. It supports indexing. It can be converted into TS vector. You can use the function JSON B to TS vector. And that will classify numeric key strings and booleans differently if you want to. Also in Postgres 12, we have the SQL standard SQL JSON query language that you can use to perform searches on JSON. And we also have JS query that supports... that's a different query language. Finally, make sure you vacuum analyze so that your table statistics are up to date so that your GIN entries are integrated into the main index because for update reasons it keeps a list at the end and only vacuum updates your GIN indexes. You can set your statistics. You need to keep them accurate by number of distinct values if you know them by setting correlated columns that influence the results of your searches. And you do need to run explain, analyze from time to time because you don't know that the query may work in a year's time or not. And maintenance work, because we're talking about huge index creation times here is very significant, especially for GIN indexes. Now, one final thing is the curious case of text name. I found in my... in the code that I was maintaining, I found something like create table user ID serial text name. Wait. That is type name and it works fine. It appears to work fine but it looks like instead of typing in name text so the column named name of type text they typed column named text of type name and that is an internal type that will actually store strings but it will only store up to 64 bytes or you'll have nasty surprises if you use type name for your text. So don't be sleepy when you're creating tables. Thank you very much. So the question is whether there's a difference between indexing very large values and or very many values with small content, right? So as we said, just is better at storing indexing things that are fewer but unique whereas GIN is better with large pieces of text that are non-structured or random. Sorry, can you repeat that? So you will... The question is how do you index documents if you don't know which human language they are in and the question is complex so you either have to use all dictionaries that you know in order to be able to match and then produce weights so you can decide based on the ranking which is the most probable language that is being used in the text.