 Hi, everyone. Thanks for coming to my talk. We'll talk about a recent journey that I went on and its goals about bringing some cool NLP stuff to Apache OpenNLP and then an example of using it from Apache Solar. So my name is Jeff Zimmerick. I'm a search relevance engineer at Open Source Connections. We do work around improving search relevance for organizations. So if your search results aren't good and your customers aren't happy, get in touch with us. I'm also the current chair of the Apache OpenNLP project. So you can kind of see where my motivation for this comes from in improving Apache OpenNLP but also trying to improve search relevance through Apache Solar as well. Please get in touch. If you're interested in this stuff, I would love to talk to you and collaborate. So on Twitter or LinkedIn, I'm happy to meet new people. Open Source Connections is where I work. A shout out about what we do. We empower the world search teams, host the relevance slack. About 2,600 people in there right now talking everything about search. So if you like search, it's a really good resource. We just wrapped up our Haystack conference not too long ago in April. It was a lot of fun in Charlottesville, Virginia. So everything search, check it out. And we're hiring if you're interested. So first, Apache OpenNLP. In case you're not real familiar with it, it's been around for 12, 13 years now. It joined the Apache ecosystem around 2010 or 11. And it's got a lot of capabilities. NLP stuff, tokenization, document classification, name entity. So a lot of things that you hear about in more of the modern Python type stuff has been there in Apache OpenNLP for a while. And in case you're new to it, Apache OpenNLP is Java. So can use it outside that Python stuff. It's lightweight. It has no dependencies, like literally no dependencies. The Maven dependency, I think it depends on nothing. There's nothing there. Even the recent log for J stuff, we weren't affected because no dependencies on log for J. So you can bring it in without a lot of worries around dependencies. It uses Percepton Learning, which is just a one-layer neural network. And CPU only. It trains and runs fast on CPUs, so you don't need GPUs and some of the other stuff that more modern and things that you hear about more require. I have a lot of stickers. If anybody would like a sticker, they look really good on your laptop. Please come get one afterward. Talk to me. You don't have to talk to me. You can take a sticker without talking to me. But I do have stickers. So the relationship between OpenNLP, Apache Lucene, and Apache Solar is like this. Solar depends on Lucene for pretty much everything. If you're not new to search, I'm sure that you're probably already well aware of that. You might not know that Apache Lucene actually has a dependency on OpenNLP. Some of the NLP stuff in the Lucene Analyzers project depends on OpenNLP to perform entity extraction and some other operations in there. So that's the hierarchy of what we're working with. Some history about how far back these things go. OpenNLP started on Sourceforge in 2002, so much longer than I said, about 20 years then. In 2010, it joined the Apache Incubator as a project. And in 2011, there was a ticket, Lucene 2899, that was opened to bring in OpenNLP's integration into Lucene. And if we skip forward six years to 2017, we can see that that ticket got resolved in six years. Imagine working someplace and you had a ticket there for six years. You just kept saying, next brand, next brand, next brand. But that's open source for you. Things change. We're all volunteers. But in six years, that ticket got done and OpenNLP was integrated into Lucene. With that done, things can start rolling a little bit faster. In 2018, Solar 7.3 came out and it included that functionality that Lucene now exposed. So now in Solar, there was a new OpenNLP update processor. So we could do some of that named entity recognition stuff inside of our solar ingest. So we had a lot of new capabilities. Really exciting there. And then I put a green box. All kinds of NLP fun stuff. We've been around NLP between the years of 2018. And today, it just kind of went insane. Everything changed. There was so much new stuff. Java kind of got pushed to the side. The new Python cool stuff, shiny things came rolling in. And the NLP world just kind of turned on its head. It went from being something that was very niche that you didn't use a lot to now it's used every day. And you can't, you can't avoid it, even if you tried to. So today in 2022, earlier this year, the project released OpenNLP 2.0. So the first major release since Apache OpenNLP about 11 years ago. So 11 years, we finally released version 2.0, our second major release. And in this release, we brought in some really cool stuff. So what's changed? Everything changed in NLP. And for me, as a person with a Java background, to bring in new NLP stuff, it all went to Python. You kind of felt left behind. Bringing the two things together was kind of a mess. But PyTorch, all those cool things over there, we kind of had to use it. We kind of had to learn it. But there's still a lot of applications for Java in the NLP world. So in Apache OpenNLP 2.0, the goal was to modernize Apache OpenNLP. I don't believe that Java developers and Java ecosystem should have to go out to Python. Nothing against Python. Love it. It's great. But I don't think we should have to jump through hurdles to get things working. So the goal here for 2.0, its initial release, is to bring it into the future so that it's just easier to use. So OpenNLP on the left with Java and its Perceptron algorithm, I like to think of it as a very dependable late 80s, early 90s pickup truck. You know, it's the kind of truck that you go out. It's probably going to start unless the battery is dead or something crazy. It's probably going to start. You know what's going to run. You know exactly how it's going to do. It's your workhorse, right? And then over on the right side, we have the hugging face transformers library in Python. And it's got its neural networks that are super deep, complicated. And I'd like to think of it as a rocket ship. We are we are just everything's moving quickly. There's a lot more stuff to learn about putting a rocket ship in the space. We all might not be able to go out and just use a rocket ship. We all know how to use an old 1990s truck. So it's just kind of how I think of it in my head. So I want to to bring those things together a little bit. And I want to have OpenNLP be able to use the hugging face transformers ecosystem, those models, and the deep neural networks and everything and turn OpenNLP, maybe not turn into a rocket ship, but make it more like a rocket ship wanted to give it the rocket ship power without the rocket ship complexity. So if you're familiar with NLP in the past couple years, you probably use hugging face transformers. And all of the models up there, perhaps you've trained your own. It's a lot of fun. If you haven't, I encourage you to give it a spin, try it out, see how things go. And so we have all of these wonderful models that the community has made available to us up on the hugging face hub. And here we are over in OpenNLP. And we kind of look at these models, if you're like me, and you look at them, and you just kind of you kind of salivate all these models are up there, and you, you can't use them and you want to use them. So let's make it so that we can use these models from Java Apache OpenNLP. So how do we do this? We have to have some bridge between the Python ecosystem and the Java based Apache OpenNLP. And this was a project, just a research project that last year I got interested in is what is out there that can build this bridge. It turns out there's a few things that can help you build that bridge between them. Some are required more work than others. Some, some were promising, not quite there yet. But what I found was really promising and works today is the Onyx runtime. So it's an open standard that defines models, how they can be used, how they can be stored and transmitted. So using the Onyx runtime, we are able to build that bridge between the Python ecosystem models and the Apache OpenNLP Java land. So here is a command on this slide here, Python command, using the Hugging Face Transformers library to take a model, the NLP town, BERT based multilingual uncased cinema, we like long names, I guess, and you know, we're going to take this model straight off the Hugging Face Hub that we last saw in the last slide. And we tell what kind of model it is, sequence classification, and we give it a name, and we hit enter. And this command goes out, it downloads that model for us. It converts it from PyTorch, whatever trained it into an Onyx model. So now we have an Onyx model. And we want to bring that Onyx model over and make it usable from Apache OpenNLP. So we're going to take that Onyx model. Now we have our deep, crazy complex neural networks. And so now we turn our 1980s truck into like the fastest land cinema, whatever that thing is on earth. So we have a truck, but it's it's powered by powered by rockets, right? So that's what we're going to do. So we have our Onyx model. Great. How do we use it? What do we what do we do with it? You know, I can't just say, Hey, Onyx model, do stuff, you know, it doesn't quite work like that. Luckily, there is a really cool tool out there called Netron. If you're familiar with it, you probably know how awesome it is. If you're not familiar with it, and you're you kind of work in this space, highly encourage you to go check out Netron. What it allows you to do is you can open up your Onyx model, and it will show you the inputs and outputs and what the model looks like. So here on this screenshot, when I opened the model that I converted, you'll see up there that it has inputs and outputs. And that is the very, very, very important information that we're looking for here. The inputs. How do we call this model? What do we need to give it in order to get stuff back? And in the outputs, how does the outputs come back to us? So let's back up one more. So we can see on the inputs here, there are three things listed. The input underscore IDs, the attention underscore mask, and the token underscore type underscore IDs. And then the outputs are the legit and it tells you the data structures that they are integer arrays. And the last one, the outputs are float arrays. So that right there is that golden information. With this, we can now load up our Onyx model in Java, and we can make calls straight to it. Notice at the bottom on the outputs over here, I circled it in red. The Netron app is telling us that the output array is of length nine is a float array of length nine. And we think about where does that nine come from? If you open up a different model, it's probably going to be a different number. Nine corresponds to the labels that were in the model when it was trained. So for this one, this is a named entity recognition model. And you know when you train a named entity recognition model, you tell what things are. Is it a person? Is it a place? Is an organization? So the labels here are in the bottom in the chart. The first one, zero, oh, what do you want to call it? Probably means outer means nothing. It's not an entity. It's not what we're looking for. And then you have be miscellaneous. So before miscellaneous, I inter miscellaneous be person, I person. So you have before person, inner person. So using those labels, you can say what an entity is in a piece of text. So if you saw my name, Jeff Simrick, you would say Jeff, be per, before person. My last name is Simrick, I, inner person. So that right there is your label for the entity. So the number in the logits, the output, the nine corresponds to each of those labels. So you're probably ahead of me here, and you're probably already thinking ahead. And you're probably thinking, oh, sure. So then that makes sense that each of those values in that array that comes out corresponds to the probability that that token is that label. And you're right. That's exactly how it works. So to know what those labels are, you have to know how the model was trained. Luckily, on the hugging face repo, if you go pull up that model, there will be a file labels. Sometimes people are nice. They put it right in the read me for you, and you don't have to dig for it. Just find it. You just have to know what those labels were, how the model was trained. And that will tell you the probabilities that come out in the outputs. What I just said about the labels being in the model config in this one here, this is in the config.json file that was used in training the model. Over here on the left in the JSON, you can see the the nine labels. And so that's how you know, oh, I'm a model output short. I'm I expect to see a float array with length nine, with each one of those corresponding to the index given here in the JSON. This format is referred to the IOB or BIO format inside health side beginning for the label tags. You'll see it in in LP stuff everywhere. So now let's call the inputs to the model. We we need to wire up our code. So we know what the model is giving us back. It's giving us back an array of some length that corresponds to our labels with probabilities. So how do we call the model? We saw earlier that NETROM was telling us that's expecting three inputs, input IDs, attention mask and token type IDs. So what are these things? What's it looking for? The input IDs is an array that it is where each each element of this array is the value of that token from the models vocabulary. So whenever a model is trained up on hugging face will be a vocab dot text might be named something different. It contains the models vocabulary. And in that vocabulary, each token in the model has a value with it. These numbers I just made up for illustration. But what we want to do is is the model expects that type of input. The model doesn't understand words. It understands IDs from the vocab file. So if this were a sentence, we would go for each word in that sentence was I'm having a great day, then we would say, Okay, our first word is I, let's go to the vocab text file, find the value for I and put it in our array to be passed. We would do that for each of the tokens in there. So just a little bit of work to fill up that input array. The attention mask is luckily that it's easier. So the attention mask just says it's a one hot encoding for the tokens we want to consider. So in this example, we're giving it a we're giving the model a sentence. And we want to find the names entities in that sentence. So we give every token a one that just means let's consider every token in there. If there's tokens you don't want to consider, then those values are just zero. And the model won't consider that. And all practical purposes, your attention mask will probably always just be one, unless you have a reason not to. For the token type IDs, they're used to separate sentences in the input. So you can think if you're doing multiple sentences or a document just got to be some way to tell the model when a sentence starts and stops. For this example, just doing a single sentence. And so every word in there has just the value one. The next sentence will be 0 0 0 0 1 1 1 and so forth. And then our outputs from the model. So so we wired up those three arrays. We have them ready to be passed to the model. The model is getting just to summarize the models getting the input IDs for each token. The model is doing its black box work, right, like models do best. And then the output, there's our array there with our with our float values in it. And it's made up completely made up numbers. But for each token in the input will have this array. And we can look at that array and then figure out what the model is telling us. So in this example, the sentence George Washington was president. The inputs corresponded to George 14851 is the ID of the token in the vocab file. Washington 10749 numbers will vary between models. It's not important. Don't don't worry about specific numbers. I kind of just made these up for this example. Attention mask and token type IDs both one. So there's our inputs. And now let's look at our outputs. So for each token George Washington was president, we get back that array with length nine. So for our token George, our first one 1.459 0.484 and so forth across there. Washington's got its index or it's a integer array of length nine was same president, the same. And down at the bottom, we have our labels. So if we look and find the highest value for each token in those arrays, it will tell us what what entities are in that sentence. So in this example, 3.447 is the highest value, highest probability. And it corresponds to the before person tag. So the model is telling us here that George is the start of a named entity. They're like, Oh, cool. All right, makes sense, right? And so then we look at the next token, Washington 3.424 is the highest value. Well, that corresponds to the inner person label. Excellent. So what's the model telling us? It's telling us that George Washington is a name entity in the sentence. Great. It's exactly what we would expect. The next one was the highest value or next token was the value of 3.497 is the highest. That corresponds to the label Oh, for outer. So it's saying it doesn't think that this token is a person place or thing or whatever the labels might be. And same for President 4.197 is the highest value, also corresponding to Oh, so it's also not a token or not an entity. So in this example, the model is saying George Washington is our name person entity in here. And the other two tokens was in President are not named entities. So what we have done what I have described is we have taken Apache OpenNLP from the Java ecosystem. We took a model that was trained in PyTorch through the Hugging Face Transformers Library. We converted that model to the Onyx runtime. And then we used the super cool awesome Netron app to figure out our inputs and outputs to that model and let the model do its fancy neural network stuff to figure out that George Washington is a name person. So now let's look at some Java code. So let's look at the code for which just ran through there on those fun arrays and things. It's remarkably simple, which is fantastic for people like me writing the code, right? So the inputs to the model we saw in our Netron app exactly what they were. So we're looking for a map string Onyx tensor, we'll call it inputs because I'm a very non creative person. And in this inputs we will put our input IDs, attention mask and token type IDs. And it's just a matter of wrapping them up into what the Onyx runtime expects. You'll see on the input IDs line over on the far right, there is a call tokens dot get IDs. And so that is just asking the vocabulary for Hey, what's the ID for the tokens? Attention mask, same thing tokens dot get mask is returning an array of one and token type IDs is same thing. You can see it's making a new long array. The length of the tokens where each value is one. So now we can send those inputs over to the model and get our outputs. So we just say on our Onyx runtime session dot run, give it the inputs, get the outputs the value. Value comes back as a three dimensional float array. The third dimension is the one that we're interested in because that's that array that we saw it's got the values in it. And so we just go through those values like we did on that last slide, pick them out. What's the greatest one for each label? So that was fantastic, right? So that was done in a sandbox, terrible written up code, you know, proof of concept type work. Now let's do the fun part. Let's integrate into Apache OpenNLP so we can get, you know, allow the community to use it. So OpenNLP has some interfaces for these things. Token name founder is the interface here from the Java docs for named entity recognition. Well, wouldn't it be fantastic if we could just integrate that Onyx runtime code right there inside of the OpenNLP interfaces? That way, places that already use this, like Lucene or other applications out there in the world that use OpenNLP, they won't really need big changes, right? Code to the interface makes everybody's life easier. So let's just create new implementations of these interfaces and use our Onyx runtime. One really, really, really cool thing that I really, really want to stress is, yes, of course, I'm a big Java fan just because I'm a Java engineer. That's all. There's some really cool stuff over in the Python world. And I have absolutely nothing against the Python world. It's a really cool world, too. One thing that you might have noticed here is that the model training does not happen inside of OpenNLP. We still have to do the model training, I shouldn't say step, we still get to do the model training over in our favorite PyTorch TensorFlow worlds. So this just gives you another avenue for using that model outside of the Python ecosystem. So we still get to train our models in PyTorch TensorFlow using transformers, convert it to Onyx, and then use it from Java. There are different types of models in OpenNLP, the perceptron I mentioned earlier, that you can train from within OpenNLP. But I just want to highlight here that it's really important to remember that all the model training here still happens over in the Python ecosystem, over in the PyTorch TensorFlow world. So I think it's really good to bring these two ecosystems together, the Java NLP ecosystem, merge it with the Python NLP ecosystem, instead of duplicating it with the PyTorch TensorFlow world. So I think it's really good to work in all of that nonsense. So now, we have it wired up in OpenNLP. We want to make it available to our downstream applications as well. So let's go look at Lucene. Maybe it won't take six years this time, right? So we look and find our integration point. In Lucene, in the analyzers, we have a module. Excellent. So that's our integration point there in Lucene. This work is still actively ongoing. If you're interested in Lucene, I would like to help me get this nice and tidied up for a pull request. Super be awesome. Really appreciate it. So we can beat that six-year record from last time. Just get in touch with me. But that's where this will be. And so once it's in there, anything that uses Lucene, which, Apache Solar, Elasticsearch, Amazon OpenSearch, it makes it available to all of those downstream applications as well. Inside of Lucene, in that module, there's a class called NLP, NER, TaggerOp, great name, right? Probably might be why it took six years, got confused. So this is where we'll do that implementation in there. So now, let's assume that it is in Lucene ready to go. Let's use it from Solar. My background, my current job is helping people tune their search, make better search relevance, right? So one thing in that toolbox for helping people improve search results is named entities. If you're searching for people, if you're searching for things, having some awareness in Apache Solar of those people and things can do a great deal to your search relevance. So that's part of my motivation is let's get this in here so that we can use it from Solar. So let's take a look at how we can potentially use that. So in this example here, this is a update processor factory in Solar. And what we're doing here is we are, we just call it onyx-openNLP, terrible for names, but hey, it works, right? And we are calling our, we are referencing a processor class in Solar that I added called the openNLP.catUpdateProcessorFactory, we like long names again. And so what this update processor factory does is it performs document classification given some text. So when I say document classification for this one, you can think of a sentiment analysis. The example here is taking in movie reviews and predicting if they are positive or negative. That's what this document classification model is doing. We could have also done named entity recognition the exact same way. Those are the two functionalities of openNLP that are currently implemented for onyx, named entity recognition and document classification. The goal being to expand that and keep going. So our inputs to this update processor factory, the model file, the path to our onyx model, the vocab file, remember it was important, right? So our path to those two things, source, the source and destination things here. Source is saying in my input document that I'm indexing in Solar, what field do I want to do my classification on? Let's do it on the overview field. So we're taking movies, sorry not movies, we're taking movies, and we're taking the overview of that movie, think you go on IMDB and you read the overview, and we're going to predict whether that movie is good or bad, document classification. And then in the destination, we are going to put the value into a new field in this solar document called onyx underscore classification. And so that's it. So you wire this thing up into solar, and you're ready to go. And it would have helped if I just would have had that up there as well. So the model, the vocab file, the schema field that we want to do the classification on, and the destination schema field, which will hold our predicted rating. So here it is wrapped up. So in the middle one, the top is what we just saw in the middle black bar of code there. This is the the solar field named onyx classification that we referenced at top type string index true, typical solar field stuff. But what happens is after we start in indexing some documents into solar here, we have we can if we look at look at our documents we do a search, we'll see this new onyx classification field. And we'll see that it has some sentiment on it. Very bad, for example, for this one. Interestingly enough, horror movies gave back bad and very bad. And pretty much all other movies were like good. So apparently, the sentiment on a description of horror movies is not just fun fact there. So so there is our classification of very bad for for this movie. And so what we did so thinking about wrapping up what we did here is that we expose the hugging face model in pytorch to Java through open NLP through Lucene. And now we can use it through solar. And we are able to execute a state of the art sentiment model in our Apache solar pipeline. So I said before you can also do named entity recognition. So we could have easily changed our processor factory up there to name entity, and pulled out the entities and put them into a field to make them searchable as well. How to help the projects open NLP Lucene solar they're all Apache projects of course. I'm the chair of Open NLP love to have your help everybody's help. No matter how trivial or how beginner you are I will be glad to help you get started and contribute. It's a lot of fun. And honestly there's a lot of work to do. It's a lot of good work. I want to I personally feel like there is a huge gap in some of the NLP stuff that's in solar like this. And I think this is a great way to to address that gap and also not just solar. But looking forward into things like open search. We can certainly tackle that next bring this capability on out. So the next steps here is to finalize these integrations with Lucene and solar implement those other interfaces language detection parts of speech, all kinds of good stuff there that we can use our onyx models for. And lastly, of course, the stuff that's not fun. But to document make it examples available. So people can get started and use it. Thank you all for coming. I'm really passionate about this. So it's great to talk about it. The code for this repo is up there on my GitHub. It works. It does work. You can run it. It will use that model, you will get the field there. And you can kind of see where I'm going with it. Moving forward. The read me in that repo has links to all the other repositories and stuff. So just follow that read me. You'll have this up and up and running runs in containers. So it shouldn't be in trouble getting started. The Lucene ticket. 10 621 is the one I created to integrate the new opening LP to stuff in there. So if you want to take a look at that or keep your eye on that one, hopefully, you know, don't check back in six years, if it will have it done, you know, a lot sooner. But yeah, so thanks a lot. Appreciate you coming. For seeing questions. I'm happy to chat. Yeah. Yeah. So the question was, is there any performance hit by converting the model from Python to onyx? So you used I'm not an expert in the onyx runtime. But I think they say that there are that the onyx runtime adds improvements to it. And in in in some of the technologies through like like Apache TVM and others where the model is optimized in that process. So I don't think there's a performance hit there. On the other part of the side of use of calling that model from Java, I really couldn't say you're going exactly where I've been going in making benchmarks. I have another project project repo up on my GitHub. You'll see it up there. It is a patch up in LP benchmarks run. I'm doing I think pretty much what you're saying. I want to take models that were trained in pad torch evaluate the performance there. And then do kind of the same thing the analog open in open in LP performance in terms of both time system resources as well as information retrieval performance metrics like precision and recall as well. So to answer your question, I don't think there's any performance hit from the onyx runtime. But from the Java side, I'm not entirely sure yet. Yes. So the the vocab file here and I think I skipped us took a slide out earlier. It uses the word piece tokenizer. So in that vocab file, it will be pieces of words and not whole words. So when open in LP does the tokenization and when it takes a token and then ask the vocab file for its identifier, it will first tokenize it using word piece. And so it might get back just, you know, a piece of a word pound pound signed and get that back instead. Yes, it is. The word piece tokenizer is there. It was introduced in 2.0. But if you need a different tokenizer, yes, you would have to implement it. All right, I have stickers. I think I want to sticker it. I'll tell you like, yes, the Go programming language. Not really. I've personally just kind of been focused on the Java side of it. It would be great to explore. If there's an interest there, nothing says that Apache open in LP has to stay Java. If there's a community interest to bring other languages in, that'd be fantastic. Yeah. Sure. All right. Thank you.