 Goodness knows, that's a hard task. Thanks for staying for the last presentation of the day. I appreciate the endurance. I'll do my best to make it worthwhile. What we're talking about here is full-text search, and we're going to show two paths. One is your start time or your time elapsed from right now to getting it up and running with all default options is really, really short. Then we're going to show, but you have a whole bunch of configuration options. There are two, so some razzle-dazzle on the end. This is just so we're on the same page. The level set, what exactly is full-text search? Right there, it provides natural language search capability, really natural language-ish, and the ability to sort documents by relevance, not just existence. We'll show why that relevance piece is really, really important. Here's the business case. Why do I care so much about full-text search? What makes it so useful? The first thing is it allows, when we write applications, we can focus on the semantics, not the syntax. In other words, when I look for something, I don't want to get boggled up in uppercase, lowercase, like, I like functions and everything like that. I want to focus on the meaning of it, and we'll show you how full-text search really allows the application developer and the person writing the query to focus on what exactly it means. The other thing, it's a crowded space full-text search. There's Apache, Solar, Lucerne, Sphinx, there's other pieces out there, great products in their own niche. But one thing none of them can do is leave all your data right where it is, which is to say in Postgres. So the preferable solution, independent of your platform, it's always faster to do nothing than to do something. Well, if you can get that capability within Postgres, it's better than having to do it outside. Then we'll compare it, if the Naples Ultra or these dedicated machines, the other thing we're comparing it to is just standard SQL searches, and we're going to say almost no effort, you're about 20 times performance improvement. Anyway, and it's fast enough for nearly everything. That's how to get results. Then let's really focus on the ranking. What makes it so neat is again, we're used to focusing on the existence. All of our thought processes as database developers and administrator is, do I have a record? Change our focus here. We're now focusing on relevance. I assume I got a crap ton of stuff, sorry, but what of that is most important? So we have a full suite of tools to address that side of it. We have parsers, stop word list, synonyms, don't worry about the definition, we'll get to that through the presentation, but we have a sort of a whole suite of ways to customize it, and it's at the statement level. What that means is the same set of data, you can write different applications to make it look and return results differently. So anyway, and like everything else in this conference, Postgres or full tech searches, extensible. All right, we have a full packed agenda ahead of time. The slides, I'll show you where they're posted. We're gonna have a lot of live action and there's some SQL statements in there. Please don't focus on the syntax. If you're interested, go back and see it later. What I wanna focus on is the functionality. Sort of the idea is you can say, if this is cool for my business application, I'll learn the syntax, but not necessarily the other way around. We're gonna, I did some data wrangling to get the set, to get the data set we're using here and that's also available. Now we're gonna show you how to use triggers, how to manually create the full tech search piece, compare some things and there's really. Also, it is likely that we will not get through all the slides in this presentation. Again, all the materials up there on the website. What I wanna focus on, our objective here is for you to say, holy crap, that's really neat. I didn't know it could do that. I want my app to do it. Dad, stuff about me. All right, all this stuff is running 941, the one click installer from EDB, CentOS VM. I have two sets of documents in the world of full tech search. A set of documents is called a corpus. So for the purposes of this 40, 50 minutes, our corpus is two things. From the Library of Congress, there's 18 documents in American history. Sort of didn't get the gist that we're such an international conference. So for those of you outside the states, learn some American history. And 38 articles from a blog post from Data Science Central from Data Science Girl. Anyway, interesting domain to me. So those are the pieces that we'll be searching. PG Admin for everything, just a few Linux commands. I cut and pasted all this stuff from the web. The Data Science articles all started out as PDF files. And I used the Linux function PDF to text. What I wanted to focus on here is if you're gonna use full tech search, it includes a bit of this data wrangling. There's no two ways about it. In other words, you're not gonna get a... Fully populated tables aren't stillborn. All right, enough of this setup. How the heck does this work? So full tech search takes a text document, busts it up into pieces, tokenizes it, standard, it breaks it up on white space and on printing characters. It takes each of those pieces and gets what they call a lexeme. Which is you can think of, it's the root of the word. It takes away tense, it takes away singular, plural, it takes away all those pieces. So smarter people than I said, the lexemes, it's the basic unit of meaning. Of the word. What this allows you to do, this is that natural language piece. I can focus on what the word means. I don't care what tense it is in my document. As an example, look at the lexemes, or excuse me, look at all the words that lead to the lexeme jump. And also they're case insensitive. We're gonna show some examples here. But that's in your mind, all these searches, we're looking at the root of the word. Also all this will be done in English, but that's just a configuration option. All right, so I have my document, I tokenize it, I look at those tokens and I take all the endings off of them and then I put them into a vector. And that vector is where I get my speed. The vector is a sorted list of all the lexemes with just a number for what place they are in the document. So we'll show you some of the examples of that. So we don't, in other words, we're not searching on the full document, we're searching on the TS vector. And then against that I apply a query called a TS query. And that is also these lexemes, these word roots, if you will. Only in the TS query I add Boolean operators. All the stuff that you would expect. And or not, and stuff like that. Postgres has the ampersand to ampersand operator that matches, here's a TS query that I built against the TS vector, that gives me my result set. We'll show you some examples, it'll make more sense when you see it. All right, so we use built-in function to TS vector that turns a text document into this TS vector. That's the sequence of events. All the pieces are configurable for our purposes. I really want to focus on, by default, they work pretty darn well. So applying Occam's razor, see if it works by default. If it doesn't, add some razzle-dazzle. All right, one thing I forgot to add. There is the capability in a TS vector, so I have my document to add weights. What that means is, if I have a match in, for instance, the title of a document, it matters more than if I have a match in the abstract of a document, which in turn matters more if I have a match in the body of the document. We're gonna build some weighted TS vectors. All right, and this whole thing that the last two slides in Postgres are called a text search dictionary. That's really, think of it as a collection of all the rules you're gonna use to set up and use full text search. All right, and we talked about this. We have the ampersand operator. Getting too much stuff. Let's look at some examples. We know what time it is. And it never works as planned. All right, I have a database, FTS. And so here's my TS vector. Let's look at here. Just, oh, look at this. All those words have the same lexeme underneath. That's your power. Whether enumerate, enumerated, enumerate. You'll see for American history students the enumeration clauses twice in the Constitution and the Bill of Rights. That's what we're gonna look for. But that's the semantics piece. Don't care about the syntax, it's the semantics. And it also checked this out. In this simple statement, I did a couple things automatically. One, I threw away stop words. Stop words in this lexicon language here are words that show up in all your documents. That means they can't be discriminators. In other words, the word the, since it shows up in everywhere, it will never add value to your search. So full text search has a list of these stop words that's configurable, it tosses them out. So you're filtering noise out automatically. You can also see the roots at the bottom there and it shows what position they're in. That's a TS vector. It's a much more compact representation of your text document. For fans of Samuel Taylor Coleridge, it's from the rhyme of the ancient mariner, what we can see is repeated words only show up once. So there's our TS vector at the bottom. The word alone is the first, second, fifth and sixth position. And this is in our mind, this is what we're building with TS vector. Because I'm a fan of Aretha Franklin and who isn't. That is what the first line to chain of fools comes down to. What a wonderfully compact representation of that piece. All right, just wanna show you one quick example. Remember, we're focusing on tokens that are words. There's 23 different kinds of tokens beyond that scope of class. What I wanted to show you here, we're parsing Microsoft owns a class, you can read it. What I wanted to show you though is that full text search understands URLs, understands IP addresses, understands domains. So Microsoft.com got tokenized into, oh, there's a word, Microsoft, that's a lexeme, and a domain or a name, Microsoft.com. Those other pieces are beyond the scope of this. But you have the ability, as a teaser example, if you have numbers in there, to say store numbers to three significant digits. So if I search in my full text search document, if they match to three significant digits, that's a match. It does that by considering other data types. Okay, so here is a TS query. So that liberty and justice for all, you see and and for are stop words. Don't care, they're not the good discriminators. You see the two roots and the Boolean operator between those two. But that's a little awkward. So that's the same answer. The function is plain to TS query. The limitation is it's all and, so the Boolean operator in all those terms is an and. The advantage is, holy cow, is that a heck of a lot easier to write than some gosh forsaken SQL statement. That's all you need to do. And let's, I'll show you that, I'm sure you would trust me, but you can trust but verify. These two are functional equivalents. Four square and seven years ago, they produced the same result. And again, through the whole thread, what we're after are ways to reduce the space between what the person, the querier has in their mind and what I have to type. All right, let's jump back. We did all these, this is part of it. And then we return. All right, so you say, I'm sold, I want them, but I want TS vectors bigger than one line. So how do we create them? I'm a consultant, so I have to say it depends. Simple way, you can create them at run time. In other words, don't create them ahead of time. When you run it, use the TS, two TS vector function. It's simple, doesn't require any storage, but it's slower. You can create ahead of time with two TS vector. So in other words, you manually decide when you want to run this two TS vector. That is fast, it's flexible, it doesn't slow ingestion as a trigger would, but the limitation of that is it is possible if you manually update your indexes to have the document and the TS vector out of sync. You can create them ahead of time with a trigger. Fast and painless, but the trigger slows up ingest. This is your decision on there. There are two triggers to maintain TS vectors that come with Postgres. TS vector update trigger and the other one associated column. We're gonna show those, and then I'll write a custom one for weighted functions. But this is really, this is the business case. One thing that doesn't happen if you're using Lucerne or Solar or something, they never say it's easy and invisible to you. There's always, whatever solution they have has to have some effort in it. But finding it isn't good enough. I want them wicked fast. I have two types of indexes, as in everything else, gin and gist. There are advantages and disadvantages. We're not gonna get into the particulars, but the gin is a much faster index. It's also much bigger, takes slower, takes longer to create, and the gist is just the opposite. I gotta give a shout out, never met him, but Thomas Vandra has a blog post on Planet Postgres. He has a lot of details on high performance full-text search. I'm more into the functionality. But anyway, keep in mind there are options. The documentation's quite good at this piece. All right, so let's build some tables. All right, search there. I still struggle. I told my boss I knew I'd struggle with this piece going up or down. All right, so here's our first table. American History. What I wanted to point out here is TSVector. I'm creating a data type TSVector document. If anyone hasn't seen that other data type CI text, it's awesome case and sensitive text. It works in certain circumstances. If you don't want to use the upper or lower, just keep it as case and sensitive text. There are some performance implications, but it's generally, I'm not that correct. Comment my tables. We're gonna use a GIN index here, and this is what I wanted to show you here. I am creating, and again, all the syntax, don't worry about the particular syntax that's all available in the white screen. What I wanted to show you here is I am using the built-in TSVector Update Trigger function, so that's my trigger function. Again, everything we're seeing here right out of the box. There's the reference in the documentation section. Now, let's get a table with a bit more razzle-dazzle. Mother, and I'll show you. This is a table, the data science articles, and one thing I wanted to tease in for a couple slides over here. I have a field for the abstract, a field for the body, and a field for the title. Looking ahead, I want to wait. I want to use that in my waiting. Does that make sense? In other words, when I build my query, I want to say, hey, if my lexeme is found in the title, it means more. Well, to do that, you have to have the data separated. That's on you. Not hard, but it is on you. And then, I'll create an index. Here, let me show you. All right, what I wanted to show you in this, so here's my custom function. The first one was created right out of the box, nothing, here's my custom function. And what I wanted to show you is I take the first field, title, and I give it a weight of A. Second field, you can't see, but it's a weight of B. And just the whole body, the default weight, which is D. I only use the defaults. They work great. Details are in the documentation and in the presentation. But what I wanted to show you is you only have to do this once. From here forward, I have a weighted vector, a weighted TS vector to work with. Let me see if I got anything else. All right. And then here's my trigger. I think I had loading the documents as well. So now we have our document set up. We have triggers to populate it automatically. I'm gonna load my documents. I use pgread file for the text, pgread binary file for the PDFs. I could envision that it would be useful to wanna have the PDF. Some data wrangling limitations on that. They have to be under pgdata, but symbolic links work, so it's not such an onerous restriction. There's an example of the syntax. We talked about how the weighted document required you to have the sources bundled up. And, because we can't help ourselves, I couldn't help myself, show you a little razzle-dazzle on the loading with a Postgres function to look at the operating system, do an LS command on the operating system. All right. So, boy, I wish I had more resolution, but I don't. So, I will use this SQL statement to generate these SQL statements to load my files. You can see how it works. I wanted to point this out because it really, my objective, it makes it really easy. So, this is the American History documents. It's only one file for each. I synced everything up with Dropbox. Control-A, delete. Pardon me for this. Let's do this one. Let's do this one more for the data science articles. Slightly more complicated SQL statement. But, like most DBAs, I feel, crap, oh, I feel more comfortable in SQL than anything else. So, that was just SQL statements. It goes to the operating system, does an LS, finds the files there, builds these for you. Gotta live where you're comfortable. What we just saw, by the way, was loading files from the operating system, creating TS vectors, in one case, creating weighted TS vectors in another, and loading the PDF files. And I did that in front of an audience with text that was too big. Pointing here, it really isn't difficult. And I have, I have one more piece that I have to do. In here, to load the titles of each document and some ancillary information, that I created that separately. So, anyway, what we're doing here is, I want this, I have the source of each document, the title of each document, and it's just based on the filing. So, that would be, if I have a list of files on my operating system, you need to know some information about them. That updates that piece. All right, so let's jump back to a presentation. All right, so again, all the files were in PG data. I use a symbolic link. It's actually to my Dropbox location, which is pretty darn well actually. There's my syntax for read file. Read binary file works the same way. And that pglsdir, it's a pretty cool function. All right, so we loaded that. We loaded that. Couldn't help myself with the SQL, but again, we're database folks. SQL's our second home. Gotta go with the one you know. All right, holy crap, that was a lot of setup. So, let's look at American history documents with some full text search. All right, load. Our corpus, to use their terms, is 18 documents. Anyway, don't worry about it. There's 18 documents in here. And let's say, how does this dumb thing work? All right, here's a traditional search. If anyone's not familiar with the operator I like, its case insensitive like also works great. So, there are documents with the word welfare in them. Turns out, four of the 18 documents have welfare, but that's not what we came here for. This is what we came here for. Oh, you know what I should do before? Sorry about that. Just because we can. So there's, we got our documents back just for giggles. 4.06 milliseconds. Keep that in the back of your mind before. Same four documents, 0.215. Again, no special anything. This is default configuration of the VM, of the database, of all the pieces. We've just increased our search time by approximately a ton. It's round numbers. All right, all right, but there are limitations to I like search or, and by I like that really any sequel, any traditional sequel piece. Now let's look where the word enumerate. For those of you who sort of into American history, there's an enumeration clause in the constitution which talks about our census. We need to enumerate our population. And in the Bill of Rights, there's a right that says, hey, anything that's not explicitly there is in, anything not explicitly enumerated ain't in there. But it didn't show up. And that's all I got. So, same piece, that would be really hard to get those answers in sequel. It looks simple here, but that would be a hard thing to do. That's, again, I focused on the root of the word. All right, so let's look to mother pieces. Liberty and justice for all. So there we have, trust me, there's eight of them, but we got a bunch of documents there. Very simple. I got the same results with just that. If someone wants to say, hey, how do I get documents with Liberty and justice for all? You just write Liberty and justice for all. You can't get any simpler than that. That's one of the other, so we saw two really powerful things. One, I didn't know what form of the word it had and I guess the wrong one in my sequel statement, so I didn't even come back. Two, that's a lot easier to get a user to write. And again, the whole point of it is, people who are interested in my corpus of documents, I don't want to teach them anything. I want to take what's in their head and capture it without modification, all right? So anyway, this is a traditional way that would be really hard to explain. The other thing is, it only returned seven. I could never figure out why there was one document that was missing, so even knowing ahead of time, it's complicated to do it in a traditional way. And you don't have to trust me, it's like 20 times slower, it's a whole bunch of stuff. All right, so let's look, so that was all searches. Now let's look at some Boolean operators, because the default was everything and. But that ain't good enough for us. So let's look for the things with the word colony. I have six things with the word colony, but what if I want to say colony, but not a foreign colony? Probably need to highlight the right sequel. All right, that's, see the Boolean logic there? Everything with the word colony, but not the word foreign. Simple, very simple to do. Right now, let's look at another one, all right? A lot of French, a lot of British in our history. Thank you guys. So if I have French and British, I only get one document back. Some doesn't smell right with that, so let's look some more. So British could also be English, and I don't really know. So here's a way to say French and either British or English. So what I'm showing is all that you normally do with Boolean operators, except for I guess that four value Boolean that they were talking about at that end. But all the other standard things are in there. Make sense? All right, let me jump back. Did that. And again, the I like is a case in sensitive like. It's really nice. All right, that was vanilla. Still cool, but it was vanilla, and I promised you rank the results. What we have before is just the existence of the document. Now we're gonna shift gears, and we're gonna focus on the relevance of the document. So the rank, it's a arbitrary value between zero and one. Zero means it's not found, one is a perfect match. We don't care about the number, we only care about the relative number. One thing to keep in mind with ranked results, just it does more work than a typical limit command. So in other words, if I wanna rank, say 100 documents, I have to look through all 100. If I have a limit five, I just stop at five. So it's a more expensive operation than limit. I got two flavors available, rank and rank CD. Both of them consider how often a search term appears. The second one, rank CD, considers how close they are together in the document. We're not gonna use it, but there's another parameter that weights it to the log of the size. Don't worry about all the details. What it says is, if you find your words in a small document, that means more than if you find them in a big document. Google does this before, it's available here. You're not gonna see it in this presentation. But anyway, it works, without using that, it works well. Now, as part of the ranked results, I gotta set up this weighting piece. Remember, we built the weighted piece with just our trigger. So all this happened magically. It's already been done. You might have blinked and missed it, but it's already in there. There is the excerpt from our trigger. The title is an A, the abstract is a B, and the body is a D. One thing to very keep in mind, that second bullet, that weighting has nothing at all to do with, does a result come back? Unrelated, nothing at all. It only affects the rank of the result, if that makes any sense. So if your objective doesn't include ranked results, I'll wake you up at the end of the rating piece, or the weighting piece. Makes sense? I'm gonna show you how, but we're not gonna do it. This is gonna follow up, I could, if I wanted to, build these weights at query time. There's the bottom, because TS vector has the concatenation operation works for them. The reason that matters is, if you think concatenating a TS vector isn't just sticking one aside the end, it has to reorder them all and put the place numbers in the resulting concatenated document. Anyway, that would be the logical equivalent of how we do it there. All right, what does all this get us? I wanna focus on document relevance, not existence. My argument, customers expectations have increased. They demand ranked results, because everything outside of my world's custom build has ranked results. And the data and the logic are inside Postgres, which I learned at the keynote was pretty cool. Anyway, and they're available to whatever application you use. All right, so let's go look at some of these things, shall we? What we're looking at now, these are 38 documents related to data science that the data science girl said are really important for us in her blog. They're really cool, I was reading them on the train up the way here. All right, so if I don't do anything, so this is what I wanted to show you here. This is not weighted, so this is unweighted, just a vanilla, and I'm looking, hey, correlation, big data. Those are my limit five results. We could just look at the top one, 10 types of regression. It looks perfectly reasonable, but it isn't, and I'll show you why. There's our answers. Now, let's look at rank. Ranked results returns a much different data set. This is, again, my unweighted document, so it doesn't matter where the lexeme, where the term is, they're all weighted the same. Now, my number one answer is the curse of big data. I don't know if it's a curse, but whenever. So I wanted to show you got unrelated results. Your stuff was not quantitatively, it was a qualitatively different result because it's ranked. Again, knowing the back of your mind, this was more expensive than the limit five, but that's it. So let's take it one more and say, hey, remember, we built a weighted TS vector. And how hard did we work to build it? We wrote a trigger. We didn't do anything else. Now, look at this. The number one weighted result is correlation in R squared for big data. Seemingly a pretty relevant title. Also noticed that the top five weighted result wasn't in the top of either. So again, we don't have our user's attention for very long. I got four kids, I understand about short attention spans. So I wanna show them something deliberately relevant. This allows you to do that. Thank you much. All right. Oops. All right. Again, no more work. That's the, as a father of four, all about the lazy. There we go. All right, we got that. All right, so let's just take a quick pause. This whole thing that we did, the set is called a text search dictionary. The default one is PG Catalog English. You can, there's how to see yours. We created weighted and unweighted vectors and we either did it manually or with a trigger, but wait, there's more. Because we've been thinking of a DBA so far. Someone's gotta present this stuff. There's a function called TS headline that'll take our result set and give it a little razzle. Because if you think about when I search for document, I don't just wanna know that it matches. I wanna know where it matches. I wanna know what's matching it. So this is the one function we're gonna customize fairly extensively here. TS headline returns a text fragment, so piece with HTML text. The default are just bold and unbold. There is the result of that Liberty and Justice for All. You see, that's the file you get, their highlight. And that is fine in all senses of the word. But we're not here to be fine. So here's some options that we'll look in and I'll show you how this will give you some pretty cool result. How many fragments do you want? Configure that. What comes between my fragments? How many surrounding words do I have in the fragments? And I need something better than just bold. So let's make it red. 10 fragments between five and 10 words and I wanna break between each fragment. Let's see what this gets us. So here is what we're gonna show the answer to. I wanna know what American history documents contain Liberty, Justice, happy. I want the title in my result set and you can read that. I'll show you what it does. That is, because I ran them here. So there's our default. It returns actually four results. This came right out of Postgres. Did nothing to it. It's a pretty useful result set right here. Again, all I did was save the, I pasted it in a file in the operating system and I'm just reading it locally. There you see the pieces, the document title. Here is more fragments. How many times it, George Washington's farewell address and the Northwest Ordinance. So again, what I wanted to show you again, the emphasis is I always wanna do less. The only thing I customized here was my TS headline and the only place I customized it was in the query itself. And these are some pretty useful results I would argue. Let me, whoops, you're probably not supposed to see that yet. TS headline, those were the results from this query. Those were the results. Again, all I did was, you can see it, comes right out of Postgres. I didn't do anything to it. That, no customization except one SQL, has gotten us directly loaded documents from our operating system, automatically indexed, automatically weighted and maintained the index with no effort on our side, supports natural language-ish. It's not true natural language, but it's a lot closer than SQL, a query with ranked results and a well-formatted output. That's a lot. That really is useful. And again, once you have this set up once, you don't do it again. All right, but since we're at a conference and everything's extensible, that's still not enough. Up to this point, what I was arguing or sort of espousing at this point is, if that's good enough, run with that. If it isn't, we can start customizing those pieces that make up our tech search dictionary. So we're gonna add some razzle-dazzle. We're gonna create a custom tech search dictionary. That's the word that describes all this stuff. We can say, hey, what if my documents have different stop words, non-default stop words? What if I have synonyms? Again, what if I wanted my query to say English? Well, English means British or English. And we won't get to it, but it's in the slide presentation. We can even use this to suggest miscorrections for misspelled words in the query. So a little bit of sprint to the end. It won't cover all the pieces. They're all in the slide there, but it's pretty neat. All right. So now I have to say, all right, how do we do this? We're gonna focus on stop words, because that's a pretty common and simple one. There's a file, comes with Postgres lists of all your stop words in English. They have them in other languages. We're gonna customize it. It is in the PG-shareder, T-search-data, option stop words equals English, means useEnglish.stop. So that's what we're gonna create. One note, and you'll see it in there, is every time you touch the source file, in other words, every time I viEnglish.stop, I have to touch the text search dictionary. Don't have to change it. I just have to, in other words, alter it to itself, because that forces it to relook. You'll see what I mean. All right. We have a function to say, hey, that's how I add stop words, but what stop words do I have? How do I know? We have something called TSstat, which gives us statistics on our TSvector. It says, hey, what words are most common? Because if it's, pardon me, if it's most common, it's not discriminating in my results, and which ones show in most, pardon me, which ones show up in most of the documents in my corpus. All right, let's do that, because that's pretty darn easy. All right, so what we're gonna do, we're looking at the data science articles. No surprise, what word shows up in data science articles a lot? Data, and use data. No surprise to us, but you can imagine in your corpus, it's gonna be nice to throw out noise. So there's our noise in our data science article. I wanna get rid of the word data, because it doesn't discriminate in my documents. And again, the term discriminate in this context means it doesn't help me identify ones that should be distinguished from ones that shouldn't be in my result set. All right, here's a variation on it. I may also ask, hey, how many documents does it show up in? It actually shows up in every single document. So it's absolutely worthless as a term. So what I'm gonna do is create this custom dictionary called Stop or Dict. But first, let's look at it. So PWD, share dir is actually, sorry, PG Home, share, PostgreSQL, we're in T-search today. This is also in yours. You'll see I have an English dot stop, created a backup of it. So VI, English dot stop, go to the end, put my name so I know where they are, data, write quit. So I just, sorry about that noise. All right, so let's do one quick thing here. Create that database. So here's another function here called TSLexSize. That's like TSVector, only it just checks one word at a time. So what I'm doing there is saying, hey, give me the lexeme of data. And it says, that's a stop word. Now we know it wasn't before because it showed up in every single document. So see this sequence of events. I'm identifying what things aren't useful. Just add them to the end of the file. It doesn't matter where. It doesn't matter how. And they go in there. If you added another word, you would have to rerun, excuse me, you'd run an ultra dictionary. You'd rerun this. So in other words, it doesn't change anything. It's just, that's the example of touching something. All right, so we're five minutes from the end, so we're gonna stop here. I told you there's more material coming on. You can, so that if you think of how you can adapt a full text search, this section right here just focused on I can change the TS vector. So in other words, change what's stored in the database. I could also switch that and say no, I wanna change my TS query, which is to say if someone submits a query for the term data, don't let it show up. So let's try that. That is a function called TS query rewrite, and I'm sorry, we'll get to this and then we're gonna stop. It, the full text search function checks a table that we're gonna create, FTS alias, and it says hey, instead of this, use the alias. Well you put a null in the second column and it effectively becomes a stop word. So let us show, we're gonna create this table, full text search alias, boom. We're gonna insert, and what I wanted to show you. So that, there's my cast, I'm casting it directly to TS query as opposed to using the two TS query function. You'll see my warning or my notice down at the bottom. It says hey, the null, the empty string isn't developed. It's a warning or a notice. It's not actually an error. So now, it wasn't gonna work. So now let's look at this. I have TS rewrite. Now, remember we just saw in the previous by adding the word data to stop, English dot stop, I changed my TS vector. That stores it in the database. Same or equivalent functionality, I created a table and I tell my two TS query side, hey, look in this table to see what synonym I should use for data. Now the good news is that table actually has a lot of other capabilities. You can put anything on the right side. What it's saying is if the query has the term on the left, substitute the term on the right. In ours we used it for a stop word. That, what we did, little focus on how to get the files in there because you can't look at anything like this without a little bit of data wrangly. You saw it wasn't difficult. That's the piece I wanted to get out there. Maintain your full text search TS vectors three different ways. Create them with your queries. Create them manually with two TS vector or create them with a trigger function when you load them. Hey, you get two trigger functions built in and it's not hard to create a custom one. Compared full text and SQL, and what I wanted to show you, remember that term enumerate when we searched for enumerate and SQL, it doesn't show up. So you get functional differences on the existence and then we showed you can rank your results. Remember in our top five query, the unranked results, the number one answer there didn't show up in the weighted results. So in other words, if you're assuming your user is a squirrel looking at a shiny ball, you got no attention, you want the number one result to be relevant. No work and you got it there. You can also generate really good looking output or decent looking output suitable for presentation output. The title, how it's presented, very simple to customize. You can also add your stop words. In that directory so that shared or T-search data, you also have a thesaurus, a synonym dictionary, so you have a whole lot of power. Those files there, the mechanics are roughly equivalent. You're gonna go into that file and edit those. Go to the presentation afterwards, you'll see to rewrite queries at runtime and suggest spelling options. Go ahead. One, the problem is hard, so there's not gonna be a just do this and it'll solve. If storage space is not a problem, so it really depends on the bounds of your corpus, I would put them all in one place. I would separate the TS vector piece from the document store itself. Absent that, I'd create a view that presents them all as one because what you really have is the same SQL tools or the same Postgres tools you have if you said, hey, I have data in a bunch of different tables and I want it to look like it's in one place. But the killer is gonna be, can you afford the storage to duplicate your record? That's really what drives it. Unfortunately, there isn't a magic piece. Now, again, the worst sort of, as bad as that would get, if you take that whole thread of crap, I got data from all over the place, what am I gonna do with it? The end of that whole sort of crap is Lucerne or Solar or one of those other pieces because what they say, what they deal with that problem is, take your data from everywhere and give it to them. Does that make any sense? So it's not a great answer, but it's no worse than the external tool answer. We're done? Any other questions? Oh, there's the presentation is there and the data files and everything like that. Go ahead. No, I haven't. So one, no, because those are sort of beyond the scope of what this has. There are limitations, so I'm not arguing that the full text search has the full capability. Sort of what we're seeing is it is a 90% solution for 10% of the work. And applying Occam's razor, we have a lot of people or we run into people who are getting dedicated solutions who like me don't know what, don't know what all the pieces are. But again, those are dedicated tools, but you're right. These are some of the limitations it's inherent in. Yeah? No. Wish there was, but there isn't gonna be embarrassed to say I don't know. I know word proximity isn't there. The TS weighting functions have some of it built into their magic. The documentation says you can create your own. I sort of think that defeats the purpose. If you're gonna get that, I'd buy a tool that already did that for me. Word proximity is not there, in other words. And that's if you explicitly want, I want this word within five words of this other term. Anything else? Go ahead. Pardon? Oh, absolutely, absolutely. And let me give you as an example, we have a help desk file and the title of a ticket is a small document versus the whole body of a ticket. It absolutely, the beauty of it is, use it for a small piece, you get all the functionality and really almost no overhead. If you have a small document, you don't have to store the TS vector, converting it at runtime is roughly as fast as using a like operator. So in other words, you can no extra storage, no extra work. You can convert to full text search for small fragments and it doesn't really break anything. Very useful for that. And the beauty is not much work. Go ahead. Yes. No, there isn't other than clarity for this piece. So yeah, you're absolutely right. The whole point of it was to have an indexed TS vector. I did it in two stages. There's no magic in doing it in two stages other than to show you what it is. Anything else? Rock on, use full text search. Thank you.