 In this video, I'm going to give a broad conceptual overview of the Lucene Tech Search Library and I'll introduce basic use of its API. Lucene is a kind of information retrieval system, or IR system for short. An IR system offers search capability over a body of information, most commonly textual information, such as in a web search service like Google Search. An IR system collects a corpus of information, usually in units called documents, and processes these documents to construct an index, some kind of data structure that allows the corpus of documents to be searched relatively quickly. Without indexing, an IR system would have to perform searches by brute force, by reading through potentially every single piece of information in its corpus, and that obviously would be really slow when the corpus is more than trivial in size. Web search, for example, would be totally impractical if it had to rely upon the brute force approach. The core building block of text search in most information retrieval systems is what's called an inverted index, or sometimes called a postings file. An inverted index is some kind of collection that allows us to quickly look up a specific term and its associated postings list, a list of the numeric IDs of documents which contain that term. So, for example, if the term pizza is present in some of the documents in our corpus, the inverted index allows us to quickly access a list of IDs for those documents. Note that the documents themselves are not stored in the inverted index, and so must be stored elsewhere if we need to access them. How documents are stored and retrieved by their ID numbers is a separate problem. So, the inverted index is basically just a big map in which terms are associated with lists of document IDs. The important question then is what exactly is a term? You might naively answer that each word in a document is a term, but text is generally more complicated than that. What about punctuation? What about numbers? What about variations in spelling? Do we care about capitalization? Should we emit very common words like the? Should words with common roots really be separate terms? Should the plural cats match the singular cat? Depending upon exactly what search behavior we desire and what performance and storage use trade-offs we're willing to make, we'll have different answers to these questions. Ideally, users would be able to perform any kind of search query on an index very quickly, but this isn't always feasible. For example, sometimes when doing a web search, I want to find documents matching an exact sequence of characters, punctuation included, and only that sequence of characters. But for the sake of performance and limiting storage use, web search services like Google emit most punctuation marks when they index web pages, so such searches simply can't be done on their indexes. The process of selecting terms from a document is called analysis, and the core part of analysis is tokenization, in which we split our document text into substrings. Each substring, along with its position, character offset, and length, makes up an individual token. For example, if we tokenize the text Queen Victoria by simply splitting along white space, the first token is Queen at position 0 because it's the first token, offset 0 because its first character is the first character in the whole text, and with length 5 because Queen has 5 characters. The second token would be Victoria at position 1 because it's the second token, offset 6 because its first character is the seventh character in the whole text, and with length 8 because Victoria has 8 characters. Once we have a document broken down into a bunch of tokens, we can index it such that each unique token string becomes a term. Now if we wish to perform queries which depend not just upon which documents contain which terms, but which also depend upon where the terms are located in those documents and how frequently, then we must also retain each token's position and offset information in the index. The postings list would then look something like this. For each occurrence of the term, we record the position and character offset. Here we have two occurrences, the fourth token in the text, which starts at the 21st character, and the 26th token in the text, which starts at the 198th character. You may wonder why we didn't include the token length. First, few types of queries need this length information, and second, the token length can generally just be inferred from the matching term. When we look up the postings list for the term pizza, the matching tokens should all be of length 5. However, this is not always the case because analysis might convert text snippets that don't exactly match pizza into pizza. For example, an analyzer that admits punctuation marks would convert both PIZZ%A and PIN$ at symbol Z exclamation mark ZA to pizza. So, depending upon our search needs and our analysis, we might include token lengths or we might not. That actually covers the essence of building an index. Obviously, though, the actual implementation of indexing requires great concern with the precise index data structures because the details dictate how efficiently we can look up the terms in their associated postings lists, which is a very good reason why you'll probably want to use an existing solution like Lucene. Getting all those details right can be very tricky and take a lot of work. In any case, once we have our index, we can perform queries on it of several different kinds. The most obvious kind is a term query in which we look up all documents containing an exactly matching term. For example, if we perform a term query for cat, we get back all document IDs in the postings list for the term cat and just cat. Our results should not include the postings of any other term, whether cat, catalog, catamaran or dog, just cat, CAT, the full term character for character. Be clear, however, that again, depending upon our analysis, the term cat and the index might represent tokens in the documents that don't exactly match cat. The idea of a term query, however, is that whatever terms end up in our index, a term query does a precise match for one of those terms. A wildcard query allows us to look up partial term matches by denoting fill-in-the-blank sections of our terms. By convention, these fill-in-the-blank sections are denoted with an asterisk character. For example, a wildcard search for f asterisk t asterisk er will match any term that begins with f, includes a t in the middle, and ends in er, such as father, fatter, fighter, or freighter. The query cat asterisk will match cat, cat, catalog, and catamaran. The query asterisk cat will match cat, fat cat, scat, and copycat. Unfortunately, the algorithms that find wildcard matches are relatively costly, and so they are disfavored or disallowed in many search systems. However, finding old terms beginning with a sequence of characters, a prefix query, like cat asterisk, can generally be handled as a special case with a much more efficient algorithm. The same cannot be said of finding old terms ending with a sequence of characters, a suffix query, like asterisk cat. To look up all the terms with a particular suffix, you're stuck using the same inefficient algorithm for the general case of a wildcard query. The idea of a fuzzy query is that we want to find inexact matches of a term by specifying the degree of inexactitude to allow. This inexactitude, this level of mismatch, is measured by the degree of difference in character edits. For example, the difference between pizza and pizza with 1z is 1 character edit, the removal of a z. So if we perform a fuzzy query for the term cat and allow for two character edits, the query will match terms like cat, pat, path, bat, bath, act, catty, cape, cod, cod, zap, and so forth, because all these terms have two character edits difference from cat. The great thing about fuzzy searches is that they accommodate typos and misspellings. The downside is that like wildcard searches, fuzzy searches are generally much more costly to perform than searches for exact term matches. If we record term positions in our index, we can also perform phrase queries. A phrase query is a query in which we match on multiple terms occurring in a certain order or within a certain proximity of each other. Most commonly, phrase queries are used to find a specific sequence of adjacent terms. For example, we could perform a phrase query for the terms George and Washington adjacent to each other and in that order, such that a query will return the documents that contain the whole phrase George Washington, not just George and Washington separately, or in the opposite order. A range query matches all terms that fall within a numeric or alphabetic range. For example, we can search for all numbers from 100 to 200, which would match 130, 174, 120, etc. If we search for the range apple to orange, it would match all terms that fall alphabetically in between as they would in a dictionary, such as banana, fire truck, or obtuse. Optionally, we can make a range query inclusive, such that say a range search on apple to orange would also match the terms apple and orange themselves. Whether a range query can be performed with reasonable efficiency depends upon both the breadth of the queried range and how exactly the index data is structured. In Lucene, while range queries are certainly less efficient than simple term queries, they are still fairly efficient. Lastly, a Boolean query composites multiple other queries together, combining their results into one result set. The reason we call it the Boolean query, instead of just a composite query, is that we can use Boolean logic to include or exclude matching documents of each subquery. For example, a Boolean query could match all documents from one subquery, but exclude all documents from another subquery, so we could find, say, all documents which have the term cat, but which also do not have the term dog. So that about covers the possibilities for how we might query index to find a set of documents, but we're missing a vital aspect of most search applications. Finding a set of matching documents is only one half the problem. The other half is to sort those matching documents generally with the best matches sorted to the top. A web search query, for example, typically matches thousands if not millions of web pages, and so a web search would be virtually worthless without sorting of the best, most relevant matches to the first page or two of results. This sorting process is often called scoring because each matching document in the query result set is evaluated and given a score, with the best matches given the highest scores. In a typical application that offers search, search results are presented in descending order of score with only the top 20 or 30 matches shown, unless the user requests to see more. There are many ways to go about scoring documents as different search applications call for different approaches. Perhaps the most common approach, though, and the default offered in Lucene, is the vector space model using TF IDF weights. When described formally in terms of vectors and cosines, the model sounds complicated, but the TF IDF part is actually easy to understand. TF stands for term frequency as in how many times a particular term appears in a document. DF stands for document frequency, meaning the percentage of documents which contain that particular term. The I stands for inverse, so IDF is the total number of documents divided by the number of documents containing the term. The idea of TF IDF weighting is that intuitively, higher term frequency should result in a higher score because a document with many occurrences of a query term is a better match than documents with fewer occurrences. Likewise, higher document frequency of a term should result in a lower score because the more common the query term is found in other documents, the less special any particular document with that term. By lowering the scores for common terms, we give stronger weight to uncommon terms and appropriately return low scores for documents which match only common terms. If I perform, say, a web search for a common English word like car, even the top results should have low scores to reflect their likely low relevance. So to get a document score with TF IDF weighting, we simply divide TF by DF. As term frequency goes up, the score goes up proportionally, but as document frequency goes up, the score goes down proportionally. Usually, however, this formula is expressed as term frequency multiplied by the inverse document frequency, which of course is the mathematical equivalent. I'm not certain why we multiply inverse document frequency instead of divide document frequency, but this is the convention. If I had to guess at the reason I'd say it's because multiplication is generally a faster operation for computers and therefore this makes queries faster to score. Again, TF IDF certainly isn't the only basis for scoring, but it's the most commonly useful. Depending on your application, you may wish to incorporate different factors. Google, for example, owes most of its early success to PageRank, a system that scores web pages based on how commonly that page is linked by other web pages. Google's scoring system has evolved greatly since its early days, but PageRank is still an important factor.