 Before getting into details of how Lucene works and how to use it, we should describe exactly what Lucene is. In short, Lucene is a Java library for creating and querying textual indexes. Lucene is developed and maintained by the Apache organization, and it's likely the text search library most widely used today. If for whatever reason you want to use Lucene but don't want to use Java, you have a few options. A port of Lucene to C Sharp is available called Lucene.net, though it generally lags behind Java Lucene in feature parity by a couple years. PyLucene is not actually a port of Lucene to Python, but instead makes Lucene available to your Python code by embedding the JVM and the Python interpreter. Solar, another Apache project, provides Lucene's text search capability in the form of a server. By providing text search as a network service, Solar allows you to interface with Lucene text indexes from any programming language. Solar also effectively allows you to easily offload text search to separate machines, which is very advantageous for many applications. If you don't mind configuring and managing a server, Solar may be your best option even if you're using Java, because compared to Lucene, it provides some extra capabilities and simplifies some common use cases. In this video, however, we'll only cover the direct use of Lucene itself. When using Lucene, there are a couple other tools you'll likely need. The Tika library is another closely associated Apache project. Tika extracts the text content and made information from a wide variety of file formats, such as PDFs and other document formats. So, if you want to index files in a Lucene index, you first use Tika to get the text from the files. The other very useful tool is Luke, a program that provides a user interface for inspecting and modifying Lucene indexes, which comes in very handy as you develop and fine-tune your application search functionality. The logical units which Lucene indexes are not actually individual strings, but rather what Lucene calls documents. Each document is made up of any number of named fields. So for example, a Lucene document representing an email message might look like this. Here we have four fields presented in no particular order, sender, recipient, date, and message. The values of the fields don't actually have to be strings, they can be any kind of binary data. When this document is created and indexed, Lucene gives it a unique document ID number, and each field is indexed in a separate posting list. For example, the string I need that report ASAP is analyzed and for each resulting term the ID of this document is added to the message field postings list associated with that term. So for example, assuming report is one of these tokens, the term report in the inverted index will have the ID of this document and its associated message field postings list. Lucene is often called schema-less because we don't have to specify what our data will look like ahead of time. At any time we can add a document with fields of any name and the postings list for each field simply gets created as needed. So if, in our example, the term report already existed in the index but had no associated message field postings list, Lucene will simply create one as needed. Again, a key thing to understand when indexing is that portions of the data may get discarded in analysis, and so the inverted index generally won't contain the full original data. For example, if we feel Lucene's default analyzer, the English text, it was the best of times, it was the worst of times, it will extract just three tokens, best, worst in time, because that analyzer discards small common words and punctuation marks and drops the plural s at the end of words. Even when analysis happens to retain the full original data of a field in the inverted index, actually retrieving the data is impractical because it would require piecing together every occurrence of every term, and that would require scanning every posting list. After all, an inverted index is structured for fast lookups of documents by term, not fast lookups of terms by document. So in Lucene, actual field data is stored separately from the inverted index, in separate files ending in the extension .fdt, short for field data table. The fields of a single document are stored contiguously in this file, so it is generally more efficient to retrieve multiple fields of a single document than to retrieve the same number of fields from different documents. For the most common use cases of Lucene, this is the appropriate trade-off because usually the point of a search is to identify and retrieve a small number of documents. However, there are legitimate use cases where you might need to quickly retrieve fields from many documents. For such cases, Lucene has two mechanisms, one called field cache and a newer alternative, a special kind of field called doc values. I won't give the details here, but in short, the idea behind both of these features is to store values together by field instead of together by document. So for example, all values of the date field would be kept in their own collection, apart from values of other fields, making it fast to retrieve many date values of many different documents. Keep in mind though that for the cases where we need to retrieve field values from only a small number of documents, the conventional stored field mechanism would likely be more efficient, as it would consume less memory and require fewer lookups. Now, storing each field in each document is actually optional because somewhat surprisingly, it's fairly common to need a field indexed but not stored. For example, if we're creating an index of books, maybe we only want to index the content but not store it because we're storing the books elsewhere in PDF format, and so storing their text in our Lucene index would just be redundant. In other cases, we may wish to have documents fields stored but not indexed. For example, in a book search, even if we don't index the titles because we don't ever want to search by title, we'll probably want to store the titles. Be clear that a query on our indexed fields only returns a list of document IDs. To display a document's information in our search results, we have to be able to somehow look up that information from the document's ID. Thus, we would probably store book titles even if we don't index them because generally we would want to show the titles in our search results. Most commonly, a Lucene index is stored as a collection of files in a directory. I won't go into the details of these various files except to mention that Lucene's developers decided early on that Lucene index files should never be modified once created. The upside of this decision is that it greatly reduces the possibility of index corruption. The downside is that it makes updating an index less efficient. When documents are added to an existing Lucene index, the new documents are stored in a new segment, a logically separate set of index files. Each segment in truth has all the information of an independent index, each complete with its own inverted index of terms, along with everything else. So these segments can actually be queried separately, which in fact is just what Lucene does when we query an index. Lucene performs the same query on each segment, but collects all the hits into one result set. Thus, the separate segments appear to queries as one logical unit. Obviously, having to query multiple indices makes a search slower, and so Lucene endeavors to periodically merge segments together. For example, Lucene combines the data of segments A and B to create segment C, and once segment C is fully written to disk, Lucene deletes segments A and B. Unless you have a very large index and concerns about overly large files, the ideal number of segments at any moment is usually one. As you might imagine, though, the merging process is quite costly, especially for large segments, so we generally don't want merging performed every time our index is updated. So when exactly do segments get merged? Well, you can explicitly merge segments yourself, but merges may trigger automatically as well. When writing to an index, an instance of the merge policy class is specified, and when new segments are created, Lucene invokes the merge policies find merges method, which returns a list of segments to merge. The actual merging of these segments is performed by a specified instance of the merge scheduler class. The default merge policy is an instance of tiered merge policy, which prefers to merge segments of approximately equal size when it can. The default merge scheduler is an instance of concurrent merge scheduler, which will perform each merge concurrently in separate threads. These, however, are just the defaults. It's very simple to use one of the alternate merge policies in merge schedulers, and it's really not all that difficult to create your own should you have custom needs. Now, you must be wondering, if we can only modify our index by writing new segments, how do we delete documents from existing segments? Well, what actually happens when we delete a document is that the document and its data are not actually removed from its segment. Rather, a document's deletion is merely recorded in a separate file of the segment, and only once that segment is merged does the document's data actually get removed. So, documents are actually marked for deletion before they get actually deleted. In other case, once marked for deletion, a document will no longer show up in query results. You may also be wondering how you can update individual fields of a document. Well, the simple answer is that you can't. Once added, a document in all of its fields cannot be modified. What we can do, though, is delete a document and then create a new document to replace it. Unfortunately, this, of course, can be quite expensive as it requires re-indexing every field. It also requires this to somehow store the data for every field in a document which we might want to modify because we'll need that data to re-index that document. So, Lucene is not an efficient solution for storing data that requires fast, frequent updates. Recently, some Lucene developers have proposed changes to allow for updating individual fields, but the functionality is probably still a few years coming. Now that we've covered the basic capabilities and structure of a Lucene index, let's look at how to use the Lucene API. To create and modify a Lucene index, we use the index-writer class, and to query a Lucene index, we use the index-reader class. Here's a minimal example of using index-writer. Just like when working with a file, the index-writer operations may throw IO exception. When an IO exception actually occurs, it might be because of something simple, like not having permission to read or write the index directory, or it might be because of some more serious underlying issue in the system beyond your control. In general, there's not much you can do in your code to correct for such errors, except report the issue. In any case, to create an index-writer, we need two objects, a Lucene directory and an index-writer config. The Lucene directory class is actually an abstract class. Subtypes include FS directory, as in file system directory, and RAM directory. The FS directory implementation uses an actual directory of storage on disk, while the RAM directory implementation just stores an index entirely in RAM, meaning, of course, that its data will be lost once the RAM directory is closed. While RAM directory is sometimes useful for temporary indexes, obviously FS directory is the more common choice because we usually want our indexes to persist on disk. So here we're creating an FS directory using the FS directory static method open, passing in a string that specifies the directory of the index. If the specified directory doesn't already exist, FS directory will create it. In truth, FS directory is itself an abstract class with three concrete implementations, each of which have different performance characteristics. The simple FS directory class uses java.io.random access file, while nio.fs directory uses the newer java.nio classes. In theory, nio.fs directory should perform best, but due to a bug in the Java runtime environment, it performs worse on Windows. The FS directory open method knows this and will automatically choose the best implementation for your system. The third implementation, mmap directory, uses memory mapped files, which can improve performance in some cases. Unless you know for sure which implementation will work best for you, you're best off just using the FS directory.open method as we will in our examples. An index writer config object, as the name implies, specifies options for an index writer. Most options have a default value, but in the index writer config instructor, you must specify a Lucene version and an analyzer. Here we specify version 4.6, the most recent version at the time I'm recording this. It is a bit odd that you must explicitly specify the version of a library in your code, but the idea is that this requirement helps compatibility between different versions and prevents data corruption. For the analyzer, here we create an instance of standard analyzer, the most commonly used Lucene analyzer. Notice that the standard analyzer constructor requires a Lucene version as well. Because we pass the analyzer to the index writer config constructor, the fields of any documents we add with this index writer will by default be analyzed by this instance of standard analyzer. So once we have an index writer, we can add documents to the underlying index by creating a document object and passing it to the index writer add document method. Before adding the document object to the index, we add fields with the document add method, which expects an instance of the Lucene field class. Here we create a document with two fields, one called content with the text rubber baby buggy bumper and the other called author with the text Joseph Conrad. In both cases, we elect not to store the text. If we want full control over the options for a field, we can use the field class, but for the most common sets of options, we use one of the several subclasses of the field class. I won't go into the details, but here we use the subclass text field, which tokenizes the text and the subclass string field, which does not. So the text rubber baby buggy bumper will get analyzed while Joseph Conrad will not. After making changes to our index with an index writer, we must commit the changes to the index before they will show up in search queries. The commit operation ensures that all of the new data actually gets written to disk before returning, so a commit may be fairly expensive. To commit our changes, we can invoke the index writer commit method. If, though, for whatever reason, we wish to discard the changes we've made since the last commit, we can discard those changes Once we're done with our index writer, we should close it. The index writer close method commits all changes to the index, then possibly waits for some segment merging, and lastly closes all files the index writer had open. Recall that the merge policy decides what segments to merge, if any, every time new segments are created. So if, by creating new segments, the index writer happens to trigger some merging, the close method may take a while to return. Because of this, and because close also first performs a commit, it is often best to avoid closing index writer instances, and instead reuse the same index writer instance as much as possible. Now, for a single Lucene index, we can only have one index writer open at any one time. When created, an index writer grabs a lock on the index to prevent other index writers from using that index. In contrast, we can have any number of index readers open on a single Lucene index. Understand that this rule applies across the whole system, not just within a single Java process. We can have multiple processes simultaneously reading from the same Lucene index, but the index writer lock only allows one writer across all processes to access a particular index at any one time. So let's look now at how to create and use an index reader. The index reader class itself is actually abstract. We'll only cover the concrete subclass directory reader, which serves the common case of reading an index from a file system directory. Rather than directly construct a director reader object, we obtain one from the static method open to which we pass a Lucene directory object. Now, while an index reader has low-level methods for reading the index, such as for retrieving individual documents by their ID number, to run queries, we need an index searcher to wrap our index reader. Index searcher has several search methods, the most simple of which returns a specified number of the top scoring results for the given query. The results are returned as a top docs object, which is a collection of score doc objects, each of which contains a document ID and the score that document was given in the query. Here we're performing a search for the 20 top documents with the term doorknob in the field content and then printing their IDs and scores. The query itself is specified as an instance of the abstract class query. Here we use the concrete subclass term query, which requires a term object specifying the field along with the term text. The query types include wildcard query, prefix query, fuzzy query, phrase query, term range query for matching all terms in an alphabetic range, numeric range query for matching all terms in a numeric range, and boolean query for combining multiple queries. Because we discussed the essence of these different queries earlier, I won't belabor how exactly to use them in Lucene. Be very clear that the document collection returned by a query includes only the IDs of the documents rather than any of the stored fields of those documents. To look up the stored fields of a document, we can use the document method of our index reader passing in the ID to get back a document object. The get method of a document will return the name stored field as a string. Here we print the stored value of the author field of the top 20 documents matching our query. Understand that accessing stored fields is relatively costly, though as mentioned briefly before, Lucene offers two mechanisms for faster lookups, such as cache and dock values, which can speed up certain use cases. So that's the basic usage of index writers and index readers. The last thing we'll cover is a very important point about the relationship between the two. Somewhat surprisingly, each index reader always sees the snapshot state of the index at the time of the reader's creation no matter what gets changed afterwards. In other words, each index reader instance only sees the documents that were committed before its own creation, and even when documents get deleted by a writer, those documents will still show up in queries performed by any reader that was opened before their deletion. So anytime we want to query the up-to-date state of the index, we have to open a new reader. Because committing changes and opening and closing writers and readers takes a fair amount of processing time and IO work, Lucene performance is most optimized for cases where we don't need to update the index frequently, or at least don't need our queries to always reflect the updated state of the index. In recent years however, Lucene has attempted to rectify this deficiency with a feature called near real-time search, a mechanism to quickly and cheaply create new index readers that reflect the latest index changes. The trick is that index writers that write to disk now also temporarily store changes to an extra in-memory index, such that a reader can read this in-memory index to see all the latest changes, even those which haven't yet been committed, and a normal overhead of opening a new disk-based reader. In effect then, the index writer can cheaply produce index readers that reflect all of the latest changes, even those which have not yet been committed. To get one of these cheap readers from an index writer, we use the same directory reader open method, except we pass in the index writer and a boolean value. The boolean value specifies whether we need the returned reader to be up-to-date with the deletes since the last commit. It turns out that making the reader up-to-date with deletes may require some IO work, and so may incur some overhead. If you can't tolerate this overhead, but can tolerate results that may include recently deleted documents, then you should pass the argument false. Well, that all covers basic usage of Lucene, but of course there's plenty more to know. As I might cover in future videos, Lucene is very flexible, allowing you to highly customize every key element of text search, including scoring, analysis, querying, and even how the data gets indexed.