 Hi everyone. Welcome to my talk using Apache OpenNLP with OpenSearch KNN Vector Search at the Linux Foundation Open Source Summit North America 2023. I wish I could be there in person to present this, but I am thrilled to be able to share it with you virtually. My name is Jeff Simrick. I am a Cloud Big Data NLP consultant. I am the current chair of the Apache OpenNLP project. You can get in touch with me in a variety of ways. The easiest way is probably through LinkedIn. So if you have any questions or would like to discuss anything that you see here or anything about OpenNLP in general, please feel free to reach out to me. I will be happy to help. Apache OpenNLP is a Java library for doing a lot of common natural language processing tasks. On the left side are some of OpenNLP's capabilities like tokenization, document classification, named entity recognition and language detection. It's a lightweight library. It does not have any dependencies and it does not require a GPU to train and use models. OpenNLP was first made available on SourceForge a little over 20 years ago in 2002. In 2010, OpenNLP joined the Apache Software Foundation incubator. In 2017, 18, 19, there was a lot of NLP fun stuff that happened. And then in 2022, OpenNLP 2.0 was released with support for using the Onyx Runtime to make use of some of the newer types of NLP models. So OpenNLP has a lengthy history, but it is a solid and trusted NLP framework that's being brought into some of the newer NLP technologies. Sentence Transformers is a Python framework for state-of-the-art sentence text and image embedding. So for this talk, we want to get vectors of sentences and ultimately search those sentences. And so one way that we can do that is through Sentence Transformers. Sentence Transformers, there's more information about it available at sbert.net or the paper that describes it with the link there. There are pre-trained models available on the Hugging Face Hub, so you don't have to train your own. You can just get started with one of the pre-trained models that are already available out there. And so with Sentence Transformers, we can take an input sentence such as I ate a hamburger and generate a vector for that sentence. And so with those vectors, we can do a lot of neat things, one of which is find similar sentences through search that we'll walk through in this presentation. So cosine similarity is a measure of similarity between two vectors. And if we can represent text as vectors, then we can use cosine similarity to find similar text. So in an example here, I am hungry as one sentence, and his name is John as two sentences. Cosine similarity allows us to measure the distance, the similarity between those two sentences. And we can then use open search as K and then plug in as a way to facilitate this. So we have Sentence Transformers, which lives in the Python ecosystem. And we have Apache OpenNLP, which is a Java library for NLP. So how do we use the Sentence Transformers in Python with the PyTorch and the TensorFlow models? How do we use that from Java from OpenNLP? One way to do that is using Onyx and Onyx Runtime. So we can take the pre-trained Sentence Transformers model, and we can export the model to Onyx, and then use Onyx Runtime to be able to call that model from Java. So Onyx is a model format that allows models to be used between frameworks and with some optimizations. So in the image here, you can take a model created in one deep learning framework, export it to Onyx, and then use it from another framework. So in this example here, we're taking the pre-trained Sentence Transformers models, exporting them to Onyx, and then using them from Apache OpenNLP in Java. And exporting a model into Onyx Sentence Transformer model is pretty simple. With the Python code here, it's just a couple lines. Model underscore name is the name of the model, as it is shown in the hugging face hub. And then we can just save from pre-trained, the model name from Transformers equals true, and we can call dot save pre-trained and give it a local path on our file system. So when this runs, it will get this model from the hugging face hub, export it to Onyx, and save it on our local file system in this directory that is in the Onyx path. So we'll have our Onyx model that we can then take a look at using. So once we have our model exported to Onyx, the question is, how do we call it from Java? We know we're going to use Onyx Runtime, but what are the inputs? What are the outputs to this model? How do we make use of it? There's a really useful app called Netron. It's available at netron.app. And with it, you can open the exported Onyx model and visually see what are the inputs and outputs for this model. So looking at a Sentence Transformer model that was exported to Onyx, we can see it has three types of inputs. It has input IDs, attention mask, and token type IDs. So over here on the right in the visual, we can see those. We can see the input IDs. We can see the token type IDs. Attention mask is there. The image is just too big and drawing a red arrow didn't quite fit on the screen. But those three are our inputs to the model. And then we have our outputs. We have an output down here. It's a three-dimensional ray of floats with length 384. And 384 is the size of the vectors generated for us by Sentence Transformers. So with this information, we know that we have three inputs, the input IDs, attention mask, token type IDs. We know the data types for each. Each of those are integers. And we know the outputs. We know that it's a float of three-dimensional array size. So with that information, we can take that into our Java app and use Onyx Runtime to use this model. So let's take a little closer look at what are the values of these inputs. So for the input IDs, they're integers that we saw. And so these, each value in this array is an integer that corresponds to a token in the input. So we get the input, we tokenize it. And then for each token, we look at the model's vocabulary file to determine the ID for each token. And each time we do that, we just put the ID in the array. So we are essentially just converting the tokens to integers based on the vocabulary file. The attention mask, one-hot encoding for the tokens we want to consider. Since we are creating a vector for sentences, we want to consider every token in the sentence. In some cases, you may not. You'll put zero in that part of the array for where you don't want to consider the tokens or for a different NLP task, there may be tokens that you don't want to consider. But for this example, we want to consider all. So we are using each one. Token type IDs denotes where one sequence ends and another starts. And so since we are only doing one sentence in this example, we only have one sequence. So each value in this array is one. There may be times in other NLP tasks, or even here, if we have multiple sentences, multiple sequences, we can use this to denote where one sentence or sequence starts and where another one begins. The outputs from the model that we saw, a 384 length array, this is our sentence vector. This is what we are after. We know we saw from NETRON that they are floats. And so each value in this 384 dimension array will have a float value. And that is our sentence vector. And so now that we know what these inputs are, how do we call them using Java, using Onyx runtime? So it's not very complicated. We have a map of string Onyx tensor, our inputs. And so this is our map of inputs. So we'll have three, the input IDs, the attention mask, and the token type IDs. For each of those, we'll create a tensor and put values in it. For the input IDs, we just get the IDs for each token. So we will have tokenized the text and use the vocab file to get those IDs. For attention mask and token type, we saw that those values are all just one. So these are easy. We just put one for each value of the array for each of these. For the outputs, we know we're expecting back a three-dimensional array, and we call it V, float. So we say session.run, pass at the inputs. And we know that that's the output. And so our sentence, our vector, what we're after is this V. The link at the bottom is a link to the open NLP source, where you can see how this is implemented and all of the supporting code that goes with it. But these lines here are the most important part. So now that we have generated our vectors for sentences, we know how to generate the vectors for our sentences. We want to be able to search. We want to be able to take these vectors, and we want to be able to find similar sentences, similar documents for each one. And so one way that we can do that is using OpenSearch. From OpenSearch's webpage, it's described as a scalable, flexible, and extensible open-source software suite for search. And so OpenSearch combines traditional BM25 search, but it also has capabilities for vector search through its KNN plug-in. So let's take a look at using that. All the source code and commands from here on out are included in the repository for this presentation on GitHub. Please take a look at that repository. All of the code and the steps are in there. So the commands that I'll walk through here, you can clone that repository and walk through those same commands. If you have any issues or questions about it, please don't hesitate to reach out to me. So to start, there is a Docker Compose file in that repository. Now if we run Docker Compose up, it will stand up OpenSearch for us on our local computer. Once that is finished, we can now create an index in OpenSearch. So using the command here, we are going to create an index called vectors. And we are going to create our vector field. So in the mappings, the properties, we're creating a field called ma underscore vector. And we tell OpenSearch that its type is KNN underscore vector with dimension 384. So this is the field in which we will index our vectors into for this index. So now we have our index and we have our vector field. Now we need to generate our vectors. So in that repository is a small Java app that includes the code I showed earlier for using generating the vectors. So we just need to build that and run it. So cd into that directory, run the maven, install for it to build it. And when you're finished, you'll just have a jar file that you can run. So if you run Java jar and the jar you just built and you give it the path to the Onyx model, it will load that model, generate the vectors for the sentences that are included in that small app and then write them to a file called vectors. If you take a look at the source code for that app, there are three sentences in there. George Washington was president, Abraham Lincoln was president, John likes ice cream. And so those are the three sentences for which we will be generating vectors when you run that command just above. You'll note that two of these sentences are very similar. That was done on purpose just for illustration so that when we search we can see that two sentences come back with scores more similar than the third. So now that we have run that, we have a file called vectors where our vectors are in it, we can now index those vectors into open search. And that is done using the command here. So we are going to make a post command to our vectors index to underscore bulk. And we are going to pass it the vectors file. And so what does this vectors file that we made look like? It looks a little bit like this. This is a shortened version to give you an idea. So on the first line, we say that we want to index into the vectors index of the document ID one and the field minus score vector contains the vector for that sentence. So this file contains each of those three vectors in each three documents. So doc ID one, a vector, doc ID two, a vector, doc ID three, a vector, all going into the my vector field. So now that we have our vectors index, we can do a search. So given some sentence, we can look at what are the most similar sentences that we have in this index for this given sentence. And we can do that using this command here. So we are making a get request to the vectors index underscore search. And here is our query. We are doing a K and N query. And we are say we're going to use the mount underscore vector field, the field that we set up the index. And here's our vector. So this vector is for a sentence. It can be any sentence. We could use that Java app which calls open NLP to generate a vector for any other sentence that we want to search for to find similar sentences to it. So when this command is run, it will reach out to open search, do the search and return back the IDs of the documents that are similar. And so here's the search results. We get back three hits, which was expected because we only indexed three documents into it. And so we can see that for each document, we get a score. For document one, we get a score of one. Document two has a score of 0.5. And document three has a score of 0.41. And so the reason this first document has a score of one is because the vector that I used in the search was identical to one of these sentences that we created vectors for. So that's expected. The vector that I searched for was an identical match to one of the vectors that we indexed. The second document has a score of 0.75. So you can see which two sentences here are which. And then the third sentence is 0.41. So we have our two sentences about being president and the third sentence about being ice cream. And so we can see how the first two sentences are more similar to each other than the third sentence. So to recap what we did, we saw Apache Open NLP is a Java machine learning language for natural language processing. It includes capabilities for a lot of common NLP tasks. And then the past year or so, it's been updated to be able to use some of the newer transformer architecture models via Onyx Runtime. So being able to convert those models to Onyx and then using from Open NLP. If you're a Java developer and you live in the Java ecosystem, you may have in the past used microservices for doing NLP tasks because it's hard to go across Java and NLP. But this is one method in which you can just call those models directly from your Java code. Open NLP also supports named entity recognition and document classification via Onyx Runtime. So those types of models can also be acquired from the Hugging Face Hub exported to Onyx and used from Open NLP in a similar fashion to what we saw here. We then used Open NLP to generate sentence vectors and we indexed those vectors into Open Search. Finally, we did a search to find nearest documents given a vector. So again, thank you. I hope this presentation is helpful and informative. If you have any questions, please feel free to reach out to me, connect with me on LinkedIn, send me a message. I'll be happy to help. If you have an interest in NLP, if this is exciting to you, we would love to have your contributions to the Open NLP project. Whether you're new to it, it doesn't matter. We'll be happy to help you get started. So take a look at the Get Involved page, reach out to us or reach out to me personally and I can help you get started with the project. Everyone is welcome and there are no contributions that are too small. All are very valued and the project appreciates everything. So again, thank you. Thank you for this coming to the presentation and thank you to the organizers for the invite.