 Paščicola, bilo bude na slabrih, tako pozdravno so baro je Har Sandra Benedetti in today I'm going to talk about the neural search improvements coming with Apache Solar 9.1 and specifically talking about approximate nearest neighbors and pre-filtering. Welcome, and thank you for attending. A quick introduction about myself, and the company I work for. sem italijan, jaz sem tukaj na Traskanj Citi, v Italiji, pre-Roman, včasem resercije in dovolopčenju software inginir. Imam tudi direktor na moj kompani, da je to moja objeva. Imam tudi master degree in computer science in sem program komitim membera in number of international conferences in information retrieval, specifically the European Conference on Information Retrieval, the Special Interest Group, Information Retrieval and the Science. I'm an Apache Lucine and Solar Committer and PMC member. And I work also with elastic search quite a lot. So my passion is around information retrieval in general, integrating search engines with artificial intelligence and machine learning. And in my spare time I play beach volleyball and I go for snowboarding. So my company sees, we are at quarter in London. I live in London, actually, I've been there for the last 10 years. And we work with open source software, so we are open source enthusiasts. We just don't work with open source software, we are actually contributing back a lot. So we support the community, we contribute back code, we help people using mostly Apache Lucine and Solar, but also elastic search and open search recently. We also love research, so we are active researchers. We try to keep up with the latest papers and implement them where possible in Apache Lucine and Solar. Specifically, the areas of interest for my company are neural search, so the integration of deep learning technologies with search. The reason is called neural searches for neural networks. And natural language processing integration with search, so all the branch of information technology that deal with natural language understanding, learning to rank, which is the integration of machine learning to improve the ranking of search results in a search engine. And we've been also working a lot in search quality evaluation, so evaluate the quality of your search engine in a scientific way, so not just by trying queries and results. So this is an overview of what we are going to talk today. We are going to start with lexical search problems. So for lexical search I intend the traditional search when you just match terms from the query to the documents. Then we are going to talk a little bit about BERT and in general how large language models impacted the search and how this brought to neural vector-based search. Then we are going to talk about the Apache Solar implementation and the 9.1 release specifically, and some future works. So let's start with lexical search problems. So an example of lexical search, so traditional search problems, is the vocabulary mismatch problem. So this happened when the query terms don't match the terms in the document. So you have a difference between the vocabulary used at query time and the vocabulary used at index time. An example can be a query asking for how many people live in Rome. You could have two documents. One stating Rome's population is 4.3 million. And as you can see lexically, the document has just Rome in common with the query. Also another document could be hundreds of people queuing for live music in Rome. So this is a false positive. It has nothing to do with the query, how many people live in Rome, but it shares three query terms with the query. So people, live and Rome. So we can see that the first document is actually much more relevant to the query than the second. So this is an example of vocabulary mismatch. Another example can be for the query, how big is a tiger? So you may have a term like big, biggest, that match big. This is doable with traditional search engine using stemming. But you may completely miss another document, another relevant document that says panteratigris can reach 390 centimeters, noz to tail. So this is a false negative, because panteratigris is the Latin scientific name of tiger. And the lexical search engine may not know that panteratigris and tiger are the same animal. So this is another example of vocabulary mismatch. In this case, you get a false positive. In the previous case, you got a false positive, and now you get a false negative. In general, lexical search engines have problems with semantic similarity. So we are just matching terms. So sometimes you may have queries that are completely different semantically, such as how are you, how old are you, that share mostly the same query terms completely different meaning. Or you may have, on the opposite side of the spectrum, semantically similar queries that don't share any query term. So how old are you, what is your age? You may also have problems with disambiguation with traditional search, so what's the price of an apple share, or what's the price of an apple, clearly have two completely different contexts. One is more likely to refer to apple as a company, and another one is more likely to refer to the fruit. Another area where traditional search... This has nothing actually to do strictly with traditional search in comparison to neural, but with like the day passing, we realize that sometimes information is not just in one document, so you may have actually the information, some part of the information in the document, and some other part of the information in other documents. So neural search can help you in aggregating documents into a response. Effectively, when you have an information need, you just want a response, you don't really care about documents. Most of the time, it depends of the domain, of course. Sometimes you care about specific documents, but sometimes you just care about an information. And over the years, lexical solutions tried to solve these problems. For example, with manually curated synonyms, or hypernames and hypernames, so like specializations or generalizations of terms, stemming and lematization that are algorithmic attempts to solve the morphology changes in terms, like a tense of a verb, for example, or gender slash number variations of terms. And also knowledge-based disambiguation, so linking your documents and query to knowledge bases, such as Wikipedia, DPpedia, to disambiguate potentially, like the query terms or the documents. But in the end, it's always term matching. So what happened with large language models then? So first of all, a very short introduction about the way, different ways to represent queries and documents. So in traditional search engine, you represent queries and documents using a sparse vector representation. So you are encoding your query and your documents into a vector, a numerical vector. In the sparse representation, the cardinality of the vector is the number of terms in your dictionary. So for a document, you may have a vector that is pretty much all zeros, except the terms in the dictionary that appear in the document. And the same for the query. So the reason it's called sparse is because only few terms are in a document in comparison to the entire dictionary. So your vector is going to be zero, zero, zero, and then some ones. Potentially, you may encode in the vector the term frequency in the document. So the number of occurrences of the term in the document. So if a pool appears ten times in your document, the vector value at the position of the apple term will have ten as a value. And all the other terms that don't appear in the document will be zeros. On the other hand, the dense representation will have a fixed number of dimensions, so your vector will be shorter, and all the values are pretty much all the values, and maybe there are going to be some zeros, but pretty much all the values are going to be different from zero, and will be normally like float values. And this is the current approach used by neural search. And large language models are the way to produce those vectors. So what is a large language model? So first of all, it's a neural network pre-trained on a very large corpus, or very large corpora, so a group of corpus. This may be like Wikipedia, maybe the entire web, a portion of the web. Anyway, I mean a lot of data. So normally large language models are pre-trained on a very big corpora, and then you fine-tune them to satisfy a specific task. So with the pre-training, what you get is a, let's say, pattern recognition of a language to a certain extent in understanding to the level of how machines currently understand a language of a specific way of interacting between the terms in the language and how the grammar, for example, works in a language. So you capture meaning to a certain extent, but not related to a specific task. So large language models can satisfy many different tasks. They can generate text that may be used for essay generation, for text summarization, for translation, for many things. And one of the tasks, and specifically the task we are going to talk today, is for dense retrieval. So for searching and retrieving documents from the corpus. So what we are going to do is learn a representation for our queries and for our documents. And this can be done offline. So currently, even if it's quite expensive to encode documents as vectors, you can do that offline at indexing time. So it doesn't affect that much your query time performance. So that's decently usable nowadays. So very shortly, how you pre-train a large language model. You mostly, you may approach, for example, this problem as a masked language model approach. So you're hiding part of the text in input and the model will learn how to predict the missing word. And this happens in an unsupervised way on a large number of sentences. And an example of pre-train large language model is BERT and all the family of BERT models. There are like many others, like Robert, Eddie Stillbert, and many, many variations. But one of the first transformers, they are called transformers, sometimes like B encoders. It was just probably one of the first. So it's the most popular, but you can find many, many of them. So as I said, pre-training, this is normally something you don't do. So you just take a large language model that is pre-trained by a large organization on a large corpus. And potentially you can just use a model that has been already also fine-tuned, but fine-tuning is something that potentially you could do in your company, in your problem. And fine-tuning for dense retrieval work in a way that you want to maximize the distance in terms of score between a positive score, so a document that is relevant to the query, and the negative score, which is a document that's not relevant for a query. So you encode the vectors, you calculate the cross entropy between the scores, and then you try to extend as much as possible the difference. So you refine in various iterations the weight in the neural network to make sure that you have a biggest distance between a positive and a negative document. But anyway, this is just to give you an idea on how you fine-tune. So fine-tuning happens with a set of samples that you must design and implement for your specific problem. And there are many models already available, especially in English, so you may not even need to fine-tune for dense retrieval, but if you have a very niche domain, a very specific problem, you may want to do that. And if you're curious about the state of the art, there are various literables around for various languages, and you can also directly download models on your own. So the current integration with Apache Solar happens effectively after you encode the query and the documents into a vector. So you encode the documents outside Apache Solar and then you index the vectors already in Solar. And at the moment, because there's still, of course, development in progress, but at the moment, there is no model management for inference in Apache Solar. So you need also query time to encode your query outside Solar, get the vector, and then ask Solar for the vectors. And no update request processor at the moment, so what we are working on is a way to hide completely the neural side to the user, so you are going to just input text to Solar and maybe specify, your admin will specify at configuration the model to use and Solar will do all the rest of the work. So what is actually neural search then? You get input documents, text documents, you get the input query text, you use the large language models to encode the vectors, you index the vectors in specific data structures that are easy to be searched after that, and then you look for nearest neighbors of your query vector. So effectively, the similarity between the query and the documents is translated to a distance in the vector space. You may have studied a school different kind of distances between vectors, so there are various in information retrieval, the most used is the cosine distance, but also Euclidean distance is supported by Solar, and you are effectively just calculated the distance between points in a multidimensional vector space. And the closer a vector is to the query, the highest the semantic similarity. The initial approaches were using just exact nearest neighbor, so you were getting the query vector and a document vector calculating the distance and getting a score, and you were doing that for your query and all the documents in the corpus. This is actually quite expensive, and nowadays, researchers move to approximate nearest neighbor techniques, so you don't want to calculate the distance for the query and all the documents in your corpus, but you use data structures you built at index time to find the closest neighbors to your query. And these strategies are based on trees. There are various family of solutions. The most important ones historically are tree-based techniques, ashes-based techniques, and graph-based techniques. In Solar we use graph-based techniques. So how we model vectors in specific data structures to have a quick retrieval at query time. We use hierarchical navigable small world graph. It's normally abbreviated as an acronym, as HNSW, and there are also a couple of nice papers about it, so if you are more curious about how it works, you can take a look to the papers. I will give you just a quick understanding of it. So a hierarchical navigable small world graph is a proximity graph. So first of all it models the vector as vertices in the graph, and the links, the arcs between the nodes of the graph, is a proximity relation. So two points are going to be linked only if they are close each other. And it's called hierarchical because this data structure works on many different layers. So you have a first layer, a second layer, a third layer, and you go down the number of layers. And the reason it's hierarchical is because it follows an approach similar to skip lists. So skip list is an algorithm to get a nice insertion time and retrieval time in lists of sorted integers. And the way it works is that you have a certain probability for a node to appear on a high layer, and then you have a probability equal to one for a node to appear at the zero layer. So you will have all the vectors at the zero layer, the one below, and in a more difficult probability of having the same node in an upper layer. So this means that at the top layer you will have a number of nodes that is much lower than the number of nodes in the zero layer. And the reason instruction in this way is to have a quick and fast retrieval because you start from the top, you quickly explore the graph, and then you go down to refine the search of nearest neighbors. So effectively the longer edges on the top are for fast retrieval and the short edges on the final layer is for getting like accuracy in the approximation for the proximity of your nodes. In solar we use internally apache-lucine. Apache-lucine and apache-solar used to be the same project. So many of you are familiar with apache-lucine. So raise your hands if you ever heard of apache-lucine. OK, and maybe one. So apache-lucine is a library, it's a Java library for search engines. It's an open source library by the apache software foundation and it's the core internal of apache-solar. And how many of you have ever heard of apache-solar? Never use it. So apache-solar is a search engine built on top of apache-lucine by the apache-softer foundation. It's open source and written in Java. So lucine gives you the internal library's implementation so you will get like for example the vector-based search is implemented internally in lucine and then solar is the server that exposes this functionality. So you will be able to just get solar, spin it up and then you can interact with solar to HTTP requests, for example. So I don't know if any of you have heard of elastic search, for example. Elastic search is a competitor of apache-solar, is another open source project. Elastic search uses lucine the same as apache-solar. They are very similar as solutions. The only difference is that elastic search is from the elastic company. So it's open source, but nowadays it's a little bit more controversial. Apache-solar is fully open source. And the neural search capabilities in lucine were implemented starting in November 2020 with dedicated data structures and the latest version in lucine is apache-lucine 9.4 that also offers you the possibility of using potentially smaller vector from a memory outprint, so instead of 32-bit vectors, I mean, each value of the vector 32-bit is now the possibility of having 8-bit vectors. So if you don't have a language model that requires very specific values for a vector, you can basically don't waste memory for that. So you can specify, first of all, as I said, the vector encoding. So if you're going to use a single byte for value or four bytes for value, and the distance. So you specify the distance you want to use, so Euclidean, dot product, or cosine distance, you do that at index in time because you want to build the related data structures. So Solar implements neural search starting from Solar 9.0 and with the latest release, 9.1, we added some improvements. So, of course, as being open source, you're absolutely free to take a look to the code, to contribute back, and we have a JIRA project from the Apache Software Foundation, open to everyone, for tracking the issues to work on and the issues that we, committers, are working on. So the way you can use Apache Solar 9.1 to implement neural search, first of all, starts from defining a field type for your dense vector field. So the way Apache Solar works is using a JSON structure for your documents, where you have, like, a field name and a value, or multiple values for that field, and your unit of information is going to be a document, which, effectively, is a map key value, or values, or some multi-value that supports it. And Solar uses the schema to understand the type of data that you're going to index in that field. So it can be a textual field, it can be a number, it can be a geographical point, and specifically to what we are talking today can be a multidimensional vector. And in your schema, you specify the dense vector field and the dimension, so the cardinality of your vector, and the similarity function, like the cosine similarity, for example. You have also, like, more advanced parameter, like the way the data structures in index in time are built. This, for fully understanding them, I would recommend to, like, read the original paper or at least a couple of blogs about the way the hierarchical navigable of more work-graph work. But anyway, they are exposed in Solar, so you can control them to affect the accuracy and the index in time, and also memory, actually, used for building the graph on disk. So index in time, what do you do? You just push your JSON document with an array of integers or float numbers, depending on the kind of precision you need for your algorithm to Apache Solar. So you have, like, your JSON with the ID of the document and your vector field or vector fields. Potentially you have multiple fields in the index. You can do the same in Solar. You can interact with the search engine using a JSON payload. Nowadays it's the most used, but you can also use XML, if you want, or potentially the SolarJ APIs. Then, at query time, what you do, is you just run a REST-H TDP call specifying the KNN query parser. KNN stands for K nearest neighbor. So you specify the field where you have your vectors and the top K you want to retrieve. So this is going to return you the top K closer neighbors to your query. So the documents that are more likely related to your query. And then you pass the query vector. So just, it's pretty much a standard way to interact, very simple. You don't need to add many information in there. And for your own curiosity, if you want to take a look to the code, anyway, I'm going to share the slides after the talk, so you can also refer to the slides. But if you're curious about where this is implemented in Solar, being open source, you can take a look to the dense vector field implementation in the KNN query parser implementation. So you have the reference to the code. You actually find in the slides also the reference to the losing code. So if you're interested, that part, of course, is a little bit more complicated because you have all the implementation of the algorithm, like car. But if you are willing to, for example, understand it better or contribute, you can take a look. Another thing that is quite important is the ability of combining dense vector search, so dense retrieval with parser retrieval. So you may want, for example, to look for your top K nearest neighbors or documents containing a certain amount of query terms, for example, or potentially you want to pre-filter your corpus first with lexical search and then run top K on top of the reduced results. And you will get in response documents with the score, which is a combination of the lexical score and the neural score, so the distance between the vectors plus this score calculated by Apache Solar, which is the standard BN25, pretty much. It's a bit more complex than that, but in information retrieval BN25 is pretty much a standard approach based on term frequencies of query terms and document frequency. So how many times certain terms appear in the corpus of documents. Now, what was happening before in Solar 9.0 with filter queries was that you were getting a bit set of the results, so all the results matching the filter query. So filter queries in Apache Solar is something you use, for example, after the selection of a facet or an aggregation. So you are searching for hotels, for example, and then you filter by four-stars hotels. So you are reducing the result set using some sort of additional condition. And a filter query will return you a set of documents. Your query will return you another set of documents. Potentially you have other filter queries that will return you different set of documents, and what was happening in Solar 9.0 was just doing an intersection of these different sets. So you were getting the documents matching the query, for example, hotels with star rating 4.0 with the hotels in Tokyo, available these days. And you get the intersection. And this intersection potentially could be just zero items intersection. So with KNN search, so with KNR as neighbor search, doesn't work that well because you are looking for the top K closer vectors. And if you first run the top K and then post-filter with, for example, only a specific condition, you may get back zero results. Because maybe none of the top 10 documents from a vector perspective contains hotels, which are four-stars. On the other end, what is actually much more likely is that you want to effectively first pre-filter, so you filter your hotels for star hotels, and then you get among those hotels the top K. And this is called pre-filtering, and this is available in 9.1. In another functionality that we are currently working on is re-ranking, so you first, for example, retrieve documents using lexical search, so query terms just matched lexicaly, as traditional search does, and then you re-rank, so you can recalculate the score of the top K documents in that, using the vector distance functionality you want. We did some initial benchmark. It's usable for small corpus. It's actually usable in general. Of course, you can tune it a bit, so if you have billions of documents, you may need to do a little bit of fine tuning of your parameters to build the graphs in a way that is friendly with your memory in this space, but it's usable at scale. It's usable in production, so it's production ready. And the latest changes we've done with 9.1 also include the fact that you don't need to configure it. I mean, the configuration is a little bit easier now, even when you specify the more advanced parameters, and you can just effectively specify the algorithm you are using, which is by default, and the only current algorithm is hierarchical navigal of more world graph, and the specific parameter. So this is actually been released a couple of weeks ago, the 21st of November, and it's available for download and use. And pre-filtering is available as well in 9.1, so you are able to first pre-filter your documents using lexical search, and then just get the top K. So some of the things we are working on, and we're going to release with future version of Apache Solar, first of all, the encoding of vector values, so currently only 32 bits vector values are supported, and a maximum cardinality of 1,024 for a vector, so the longest vector you can have in Solar is currently 1,024 elements, each element 32 bits. And as I said, maybe you don't need that precision, so you may want to use less memory and less disk space, so we are going to support 8 bits elements for vectors. And another thing that is coming potentially in 9.2 is a way to hide the neural side of things. So your administrators will be able to push large language models to Solar, and then automatically, when you push text to Solar, text is going to be converted to the vector for the documents, so at indexing time, through an update request processor, for example, or at query time by a query parser. So the final user, or anyway, also the applications that use Apache Solar won't need to do the encoding, can just push the text to Solar, and Solar internally will convert it to vectors. I link in here also many additional resources, so in the blog of my company, we are anyway trying to update it as much as possible with the latest researches we do, and the latest contributions we do, and we also tend to write blog posts to help people using this thing. Of course, you have also the official Apache Solar documentation that I personally curate among other committers, but there are also various blog posts, so if you are curious to know more about this, if you want to contribute, for example, if you want to help in the developments of additional developments of neural search in Apache Solar, you are very welcome to do so. So this, of course, has not been a work just of myself, so a huge thanks goes to the Lucine community, which I am part of, but there are other people working on this topic, a colleague of mine, Elia Varchani, that helped me on these developments, Christine, Cassandra, and Michael for helping me in the review, and then final merge of the code. So thank you very much, and we have time now for some questions from the audience. So, arigato. So now it's open to questions, so if you have any questions, please feel free to raise your hand. Okay, thank you very much for the great presentation, one question from my side. So I think this was about the different languages, so is there any difference perspectives relative to this lively, like for Japanese language, or Japanese iconography, like kanji, characters? Can we apply this logic also? Yes, so there are, as I put it in here, there are, for example, various models already available on Aging phase, which is like a sort of repository for large language models. So the way you train a language model effectively starts from text organization that may depend on the language, specifically for Chinese, Japanese, and Korean. There are very specific rules, because we move from alphabets to logographic-based language. So you have some differences, but then once you do that organization and the text analysis, then the pre-training is pretty much similar. So what happens is that you may find many more models for English language, of course, but there are large language models already trained on Japanese corpora and fine-tuned on specific tasks. So there's support for that. So let's say that the specific to the language part is what happens at the beginning of the pipeline. So then you have a similar behavior between languages once you effectively split in tokens and trained. Thank you very much. You're welcome. Would you have any other questions? Thank you. Arigato.