 Yeah, as just mentioned, I'm here to talk about natural language processing or in short NLP with J Ruby and open NLP My name is Konstantin Tenhard and I'm currently working as a Ruby developer for Flink You can find me on Twitter. My handle is at t60 and I'm based not too far from Ghent in Darmstadt, Germany, which is about a five-hour drive from here Before we dive into the more technical aspects of this talk I want to briefly talk about that language as we use it to communicate amongst each other Language has been around for centuries now and it is Excellent at encoding information. Even the most complex contexts can be easily encoded in information and shared with other people And if we think about How language is usually used The most prominent context is probably just talking to each other that of course works only if you happen to be in the same room And the information flows from the speaker To those who are listening If of course we happen to be at distinct places We need to help ourselves with the use of modern technology. We could for instance write text messages emails or talk on the phone The thing about the technology used in these scenarios is that the device does not understand at all What kind of information is read late? For the device language is just a series of sounds or a series of characters This of course changes with NLP NLP enables to under machines to understand language at a semantic level and I want to Give you a little example for one Possible application of NLP imagine two people trying to communicate, but they speak different languages and With the help of NLP we could for instance think about a system that does real-time Translation between the two languages enabling Those two people who couldn't communicate before now to communicate In their natural language and of course NLP also enables a new way to interact with machines We can simply talk to a machine and expect it to understand the information we try to convey You can think of Siri as an example There are several applications for natural language processing Machine translation. I already mentioned a prominent one is also text summarization And then there is opinion mining where you try to extract the author's opinion in regard to a certain subject from a given text These are all examples that are currently actively researched and these are also examples We won't talk about today because they are highly complicated and I wanted to show you something that you actually can try out yourself if you want So the two examples I want to talk about today is named entity recognition That is extracting references to locations or names of people out of a given text and keyword extraction that is Extracting the most relevant keywords or key phrases from a text Before we dive into these examples I want to quickly talk about what natural language processing actually is from a computer science perspective Natural language processing is a combination of many things For instance, there's the software engineering part Then of course, there is a lot of machine learning and statistics involved and finally since we're talking about natural language There is a lot of linguistics involved when you take classes at university about natural language processing You usually focus mostly on the linguistics and the machine learning part As most of you are probably application developers I want to turn it around today and focus on how you can use NLP in your own applications In that sense, we will focus on the software engineering aspect that said we still need to talk about some linguistic basics and I actually only want to introduce two concepts to you today The first concept is part of speech and you probably all know it from school Part of speech or word class simply denotes the syntactic function of a word in a given sentence The thing you need to be careful about is that a word can have multiple word classes Depending on the position in the sentence. Just think of the word fly It can be used as a noun or it can be used as a verb and the position in the sentence determines the word class The other concept I want to introduce to you is the concept of a word stem word stem is an artificial construct and You can think of it as a base form of a word and many word stems are Valid English words themselves, but there are also word stems that are simply a reduced form of a word that you couldn't use in a sentence Think of the word Negotiation and to negotiate They both have the same word stem negotiate But negotiate without the e at the end is not an valid English word yet They have the same words stem because they have they are highly semantically correlated Now let's talk about the technological aspects and let's start with why it shows Ruby in the first place to do NLP and Why I decided to use J Ruby Ruby I think is Undervalued when it comes to natural language processing many people turn to Python or to Java But I think J Ruby's or Ruby's capabilities in general regarding string processing are excellent for doing natural language processing Especially the pre-processing that you always need to do can be done quite efficiently with Ruby As for J Ruby the JVM simply is a high-performance platform that allows me to use true multi-threading and as many of the NLP tasks are quite compute-intensive that comes in handy as well and Finally probably the most important reason for me to use it is that I can interface with existing Java libraries and That is essentially what I did One of the most prominent natural language processing libraries written in Java is open NLP and it uses Maximum entropy classification as I said, I don't want to dive into the statistics or the machine learning aspect So the only thing you need to know about these kind of machine learning algorithms is that they are supervised That is they need to be trained before you can then actually imply them on your input data since finding Finding input data you can use for training is quite hard in the field of linguistics as it takes a lot of work to annotate texts for instance text with part of speech tags It is usually quite complicated But in the case of open NLP, we are lucky again because they provide pre-trained models and you can simply download them from sourceforge.net I Made it even a little easier for you and packaged everything you need into three little gems There's a wrapper around the original open NLP library and All it does is basically Do some of the housekeeping tasks it provides automatic type conversion and it unifies the interface to make Working with open NLP a little easier Then there are two separate gems that simply package the Model data that you can load to to train your classifiers for the English and the German language If there's one thing you take home from this talk Then it should be these three steps because these three steps are the essential building blocks you need To use when building an NLP application with open NLP No method of the problem you try to solve you usually start with loading the appropriate training data or training classifier yourself and once you have This model you can initialize the classifier That is used to solve a certain NLP problem and finally you can actually Perform the classification task and I will give you a couple of examples now So let's start with basic NLP tasks you always need to do When working with textual data that's the problem of segmentation and they're actually two instances of this problem as we will see and segmentation is simply a deals with Splitting a text into a sequence of logical units and as I said them multiple instances the first one is sentence detection think of Reading a large document into your memory and now you want to split this document into individual sentences You might think that this is actually quite an easy task because you simply have to look for punctuation marks and While this would work for the first example Ruby's easy Ruby's fun. It wouldn't work for the second example that contains direct speech and acronyms So splitting a text into sentences is actually much harder than it initially looks lucky for us open NLP provides us with the appropriate mechanisms To deal with this kind of problem and as I said there are essentially three steps in the first line We simply load the sentence detection model in this case for the English language Then we initialize the sentence detector that we will use to Process our data and in the third line. We simply perform the actual processing Provide the process method with a string and as a result we get back and Ruby array where each item in the ray Array is a single sentence The second instance of this problem is tokenization and tokenization deals with detecting word boundaries You might think that this is even easier than sentence splitting But I can assure you even in English There are some less obvious cases if you think of contracted forms for instance like won't I am or can't these are actually two words but would Would simply be tokenized into one word when you simply split whenever you find white space and then there are languages out There that have multiple valid word boundaries and There are languages that don't even have a visual representation of a word boundary So again Using a machine learning algorithm might be a better choice and once again open NLP makes this quite trivial Just like with sentence detection in this case we load the tokenization model again for the English language We initialize a tokenizer and we simply call the process method in this case We get back an array But the array won't contain entire sentences But tokens instead Another important pre-processing tasks is part of speech tagging I introduced the concept of a word class earlier to you and part of speech tagging is concerned with automatically detecting the correct word class of a given word and In many cases the result is given back in encoded in the so-called Pen tree bank taxat format. I'll give you a short example Again the usual three steps and what we get back this time is An array where each item corresponds to one token in our input data So in this case Ruby is tagged as NNP, which stands for proper noun The verb is is tagged as VBC which stands for a verb in present tense and the last word is simply tagged as an adjective We have to cover one last basic NLP task before we can move on to the more complex tasks and that is stemming I introduced the concept of a word stem to you stemming is the process of Automatically deriving a word stems from a given word and you could basically and the algorithms basically work by removing The ending of a word and in linguistics. We often refer to this at the morphological suffix so in negotiation the Ion at the end is basically the Morphological suffix that is added to the words stem to form an actual word and for the English language The most prominent algorithm is by far Porter stemmer It isn't part of open NLP But somebody wrote a little Ruby gem for that and it actually Simply acts like a core extension for the Ruby string class Providing it with one more additional method called stem and if you call it on a string It will give you back the word stem So let's move on to more advanced examples so far we have only done basic pre-processing task To extract actual semantic information and in this case named entities We have to go a bit further all the pre-processing has still to be done, but Now we can actually try and extract information That we can then further use for instance. You might be interested in parsing dates and locations of events out of a given text and That is what named entity recognition can do for you with the help of open NLP that is actually quite simple and The example on this slide is from when I first gave this talk at you Rukku in Athens and What we try to extract here is the location of you Rukku and In order to do so we load an appropriate model in this case A location model because we are interested in the location, but we could also be interested in The date or in both, but we would have to do individual runs for each type of named entity Again is it is just the same three-step process But this time the processes methods returns a rain an array of ranges and these ranges Correspond to named entities in our input data So in this case Essence is an named entity that just consists of one word, but of course there are references to names of persons or to locations that consist of multiple words and This is why you actually need to have a range that basically denotes the beginning and the ending of a named entity I now want to move on to the actual software engineering so far We have only seen bits and pieces on how you can do certain a natural language processing tasks But now we want to bring it all together and build an NLP enabled application The thing about natural language processing tasks is that they can often be split into several steps Which then are executed simply linearly and There's a concept in computer science called a processing pipeline And the processing pipeline is simply a set of software components connected in series and The output of one component access the input for the next component I wrote a little gem called composable operations that is a very flexible implementation of these kind of processing or data processing pipelines and To follow along with the examples I will show you in a minute You have to know two things about a gem it consists mainly about of two classes The first one being operation you need to inherit from this class if you want to build one software component That is part of a processing pipeline and composed operation is then used to assemble an entire processing pipeline So let's look at a typical NLP pipeline Let's say you have been running a scraper fetching data from the internet one of the first tasks You probably do is you perform cleanup Once you are done with removing all the unnecessary content you will start usually with sentence detection Followed up by tokenization and then POS tagging and finally stemming and lemmatization Once you at this point you can basically Perform more advanced tasks, but in general these are the basic Pre-processing tasks Everybody has to do when dealing with textual data So how could such a pipeline look in ruby well? Composable operations makes it very easy. This is actually working code So all we need to do is to build a class that inherits from composed operation and Then we use the macro method use to tell the pipeline which components It consists of and as you could guess Probably guess the order does matter. So in this case We first perform sentence detection then tokenization and then POS tagging As for the infom individual components, they are slightly more complicated, but I will walk you through the first one as I said an individual component inside a processing pipeline inherits from operation and The call to processes text You can think of it as an alias to Attribute assessor because it doesn't do much more than Generating a pair of methods and then every operation can have certain properties in this case We can tell the sentence detection operation whether it is a detecting German or English sentences The algorithmic core of the operation goes into the execute method and it's truly the only thing you actually need to implement So in this case when the processing pipeline invokes our operation It will simply forward the text into this operation and we then Construct our sentence detector and finally use the sentence detector to split the text into an array of sentences That said our next operation is now not simply invoked with With a single string, but with an array of strings So we just have to make sure that our operation is robust enough to handle both cases in this case We always assume that the input data is an array and if not we convert it to one and Then for every element in the input array. We simply tokenize the string after that we can move on to the POS tagging and What happens here is again, we iterate over the sentences and for each sentence We determine the individual tokens and then we return a pair Where the first element of this pair is the actual word and the second element is the POS tag As this is slightly more complicated I thought it would be a good idea to provide you with the actual output of the entire pipeline so providing it with the Two sentences rubies easy rubies fun You will get back an array of arrays of arrays and the innermost arrays are just the pairs of words and tokens the next automode arrays correspond to one sentence in your text and Of course the outermost array is just a collection of sentences Let's move on to the last example. I've got and Talk about keyword extraction The first thing you need to know about keyword extraction is that the algorithm I want to show you today is text rank and text rank is highly influenced by Google's original page rank algorithm and what page rank does it basically constructs a graph and Connects related pages With a link in this graph and then you can essentially calculate a rank and determine how important a certain page is in the World Wide Web and This idea is basically what powers the text rank algorithm But instead of pages we look at individual words or word phrases and there is one more concept From linguistics we need to talk about first and that is the concept of a co-occurrence It is actually quite a simple thing a co-occurrence in This example Ruby is amazing we say that Ruby and Amazing co-occur in our sentence with a word window size of one and what that means is that in between the two words Ruby and amazing There's simply one word and what you could consider a co-occurrence depends highly on the language You are processing and of course upon your own definition so you can choose the word window arbitrarily large but Usually you choose it between two and five and Once you have determined all these occurrences you build a so-called co-occurrence graph and in this graph every word that appears in your text is a note and Words that co-occur in the text are just linked by an edge Once you have your current current graph you can run the entire algorithm What we did of course do before building the co-occurrence graph was the usual pre-processing As the co-occurrence graph is constructed on tokens We of course first need to do sentence detection tokenization POS tagging the POS tagging is rather important in this Example as you usually are interested in nouns and adjectives when you extract keywords because they are actually meaning bearing Once you've done the pre-processing you move on to Calculating these occurrences as I just explained and then construct a graph and finally you run the slightly modified form of text rank and Once the ranking is done. You simply extract those notes with the highest rank Let's say you want to extract the 10 most important key phrases Then you would simply extract the 10 highest ranking notes from the graph Well, let's see how this looks in Ruby. I have a simple keyword ranking operation and There's one nice thing about composable operation. I want to mention here and a Operation allows an operation pipeline allows to be composed of other operation pipelines So you can basically build arbitrary complex data processing pipelines and in this case We simply reuse our pre-processing pipeline I showed earlier and then move on to the four steps I just mentioned that is calculating the occurrences constructing the graph ranking and finally extracting it While the internals of this code would be too complicated to show here on slides I encourage you to check out the repository. I Uploaded on github called keyword extractor and There's actually working implementation But just so you know, it is a protocol typical implementation So it's only meant to be played with but not used in production That essentially sums it up So thank you for listening to my talk about natural language processing You can find me on github and Twitter and that is essentially the list of gems I was showing today and I'm happy to take questions now Thank you very much The main restriction is of course that you usually try to avoid to implement your own NLP Libraries as they are inherently complex And what the open NLP does it hides all the complex machine learning operations away from you and provides you with a need interface You could of course do natural language processing entirely in Ruby and use standard Ruby But then of course you would have to come up with your own implementations The other restriction I was talking about is that Most of these algorithms are rather compute intensive and the JVM manages to truly Use the full capacity of your underlying hardware Does that answer your question?