 code and Python myself. I've done programming in other languages but not in Python. And I do have a lot of experience with NLP but this is going to be very introductory so I hope you're going to learn something and see if you're going to get excited about this field. So just a couple words about myself. I've done both a master's and a PhD in this area but I got into this whole field through linguistics. Very early on I decided I wanted to become a, languages was interested me and so I decided to become a translator so I enrolled at a university in Ukraine which is where I'm from. And I looked around the room and there were hundreds of mostly women wanting to become translators and I thought there's no way we need that many. And then a year later I was in a course mathematics for applied linguists and the professor said you know what there are those people who create translation tools so computer programs that would translate text automatically. And I said well I thought to myself well that means I'm basically in a, trying to become a professional in a field that's going to be done by computers. I better be joining those other people. So I've done this and I haven't regretted ever since. It just has been a wonderful journey. And I am an author of an open source algorithm called Maui and I started working on it when I came to Indiana about 10 years ago and working with open source has actually made my career. I recommend it to everybody to try and handle it. It's not as scary. After graduating I worked for a company called Finger here in Auckland and two years ago I started my own company doing consulting and our software development in this area. So this is what we're going to be talking about. Just recap about what we know about NLP. You said you don't know anything but I bet you know if you just think about all the science fiction books and movies you've heard about. And we're going to discuss how advanced are we actually now, how important are we today. Then we're going to talk about all the difficulties behind working with language and work through different Python libraries, how to do different NLP tasks. And I'll give you some hints of what to do next. So fiction versus reality. Many of you have watched Nightwriters and there is this car called Kit. He's very humorous. He has a dry sense of humor and he's a look offended. It has a personality. So how far are we actually to a talking vehicle? Well here is a picture from a Google, sorry, under the water. Something I've seen at the Google IO conference. Not this year but last year. So very soon pretty much every car will be with this technology inside it and you will be able to talk to your car and say where is the next, when is the Auckland Museum open and then take me there. So it will know from the context that you meant the Auckland Museum and give you directions. And I already use, when I search, when I'm driving and I search for an address, I would use the Google voice to actually help me navigate there. And it doesn't understand my accent. It does? Yeah. We're going to try something for sure. Translation. Well, those of you who have read Hitchhiker's Guide to the Galaxy may remember this thing called bubble fish, very odd creature that you put into your ear. And after that you are able to understand everything what people around the universe are saying in all their languages. Pretty cool though I can imagine uncomfortable. The closest thing in the real world is an application called word lens. And I think that now this is part of the Google translate where you just hover your phone over a headline and it would automatically translate it into a different language. And it would match the font used and this is what makes it a bit futuristic. They call it augmented reality translation and it does work really well as a part of the trick you can do. So not as far as bubble fish. But then there's this video and we go let you Google and watch it or maybe there's time later and we'll watch it in the end. But basically these two ladies using Google translate to order food from an Indian restaurant in California but in Indian. So they prepare all the things they would say and the food arrives. So if you Google extra, slam extra spicy then they check is it actually really spicy. When we talk about search there's this in Star Trek the library computer and it uses sophisticated AI to answer people's commands. So obviously the answer in real world to this would be Google and I just want to do a small test. What is the weather today in Oakland? Today's forecast for Auckland is 14 degrees with rain. What about Wellington? Today's forecast for Wellington is 14 degrees with rain. Are there any good restaurants around here? And I did it from home last night. It's just double check and I live on the shore and there is a close and McDonald's to my house. But they still suggested I should go to Key Street. I thought it was suspicious and then I did this. Close as McDonald's. Here are the listings for close as McDonald's. And I think now it's things that I'm at home so it suggests to me that closest to me would be because it knows where it's home. It's like you're supposed to be home. It's like you're supposed to be home. Yeah, exactly. Well on Google maps it's home is. Yeah, it's pretty much very close. Yeah, and I couldn't much feel the answer really well. And basically this guy falling in love with an operating system and this operating system goes to MENSA. She speaks in Scarlett Johansson's voice to start with. And she's also super available and always interested. So in reality, where this guy, Joshua, tried to do the same thing with Siri and she said my end user licensing agreement does not cover marriage. And doesn't that remind you of KIT with the dry sense of humor that Apple programmers have put into Siri? So yeah, no luck in love so far. So let's talk about our language and why is it complex to build such systems? Why have we not yet achieved the perfect science fiction? Well language consists of many different layers and when you talk you don't realize how complex it is actually. So it starts with genetics and speech sounds. So any system that is designed to understand human text, human speech, first of all needs to understand the speech sounds and differentiate them from other sounds. There is no meaning at this point. There is just the sounds. The meaning comes when you look at phonemes which is the area of phonology and different venues of different meaning and then they together build words. Morphology is a study of how words are created. So in English you would have different endings like ing. So if you add ing to run, you would get running so that meaning changes. Syndex is how you build sentences and phrases and semantics is all about meaning. It could be meaning of words, it could be meaning of sentences and pragmatics is a very cool one. Does anybody know what is pragmatics? So imagine you're out since most of your guys here. I'll give an example. You're out with your girlfriend and she's saying, oh it's cold. What does she actually mean? She wants your jacket. Or she wants you to close the door if you're inside and it's drafty but she says it's cold. So what she says actually doesn't mean what she actually is saying. So imagine computer trying to figure out something complex is there. So all of this needs to be encoded and it's super difficult. So there's some people here who said that they're pretty good with Python and let's say and please those who know stamina will be don't answer. But let's look at sentences and the similar task as finding words in a sentence. Any suggestions how you would do it in Python? What command you would use? Split. And then you split at space. What if your sentence is in Chinese? Are there any Chinese speakers? Would it work? It's hard. So it would require NLP. Yeah. It would require NLP because there's no space. And there's many languages like this where even simple as figuring out where are the words is not straightforward. You basically need a segmentation algorithm to identify the boundaries and we can use the dictionary but it's not going to be accurate so you need a more complex algorithm. And even in English if we look at the sentence like this, is it really a good idea to split at white spaces? Can anybody tell me why it's not a good idea? It's a hot dog split at space. Exactly. Hot and dog separately mean something completely different and if you split them you're going to just lose the meaning. And the same thing with Charles Feldman and Tony Island. These are named entities that ideally you want to keep them together as one unit. So even in English there are complexities. So now that we have the words, assuming that we know the words, let's think about how computer looks at the words and how it can figure out the meaning. So if we look at something like C-A-N, can anybody double meanings? Yeah. So what kind of meanings? A team can and the verb. How would the computer figure out? And then if we, and we will find out in a moment I think for now we're going to keep it a secret. Plans, sounds like a simple word, pretty obvious, but it could mean the airplane, the geometric plane or plane the tool. And in order to find out what it means, computer will need to look at the context. Like flying is an indicator that for a plane is the airplane. No, if you throw the plane around they are quite dangerous. Yeah, it's a good point. And yeah, the whole sentence, flying planes could be dangerous, again, could mean what you just said, or could mean these two different things. The activity of flying planes as dangerous for the person flying them were the objects as dangerous. Hello. So let's get into NLP with Python. You're just in time for the good part. First of all, what kind of things we can do with text? And here are some examples. So we can extract keywords and text. And this is what you do in your work you mentioned. We can categorize in a more generic way where we know what are the predefined categories and what a text could be categorized as. You can extract entities, like in that example Coney Island is a location where you can have biochemical entities, protein name. You can then redact them, which is the text with black spots. So if the text is sensitive you can automatically redact names of people who should be not disclosed. You can identify sentiment, which is the mood that the author is in the text and the genre as well. So it could be a roman or a novel or a news article or a legal document. You can automatically do all these things using Python. So NLTK as you mentioned, and it's a tool called natural language toolkit. And I highly recommend it to anybody who hasn't done any NLP because it's really, really simple to follow. There's a book, but there's also an online tutorial. And it comes with this tool. You can see a screenshot here. It's basically a shell that lets you download resources important to working with NLP. These resources could be corpora, and corpora is plural for corpus, and a corpus is a dataset basically. So when you work with text you need some data, and they already have some data. Some of it is annotated. It means humans have looked at it and said, here's my sentence, here's my next sentence, or here are my part of speech tags or entities so that you can actually test how accurate is an algorithm. Then you need models, and models are basically tools that were created using annotated datasets to help you solve a task. So you can see that, I think it's just too complicated, but if you just try using it you won't understand. So if we do something, a very common task when you work with language is to identify the core words or keywords. And you start by looking at words that can be removed, and these words can be called stop words. So stop words are words that don't change the meaning of a sentence if you swap them around. So even the acting in transcendence is solid with the dreamy depth turning in a typically strong performance. The same thing can be expressed using the same core words, but very different stop words. I think that transcendence has a pretty solid acting, while the dreamy depth turning in a strong performance as usually done. So all the blue stuff can be ignored. How do you do it in Python? In just a couple of lines. So this is in a Python shell, you import your stop words, which is a dataset that comes with an OTK, so you don't need to think about what kind of words are stop words. Then you initialize stop words for English. It works in other languages. And your sentence is best to represent those words. And then when you say print all of the words that are not in the stop words list, it prints you the four words. So pretty simple. So here's our can example. The way for a computer to differentiate can as a verb from the tin can is to use parts of speech. Part of speech is a category of a word. So flying is an adjective, land is a noun, can is a verb, visa verb, interest is an adjective. Here how Python is used to figure it out. In an OTK, you just import the NLTK. You can tokenize your sentence into words by using a command called word tokenize. Okay, I said flying is an adjective, but it's a special kind of cross between a verb and an adjective. But you can see here each word is printed out along with a part of speech tag. And for can, the tag is md, which stands for modal verb, a different kind of verb like shoe, the hood, a can. And this basically says there is no way it can be the tin can. And the way the algorithm behind it figures it out is by looking at the context. So if can precedes a verb by b, it is more likely to be a modal verb because it's just very common in English language can be as a combination. Again, very simple. I love Python for working with languages, just ideal. Be great for doing lesson homework. So there's a statistical modal behind that as well. Yeah. Most. And the grammar rules. Well, there are different approaches. And they're all like in the 90% of accuracy. So 95, 98, they're all competing to get the 100. But it's actually not that complicated to write one. So you don't get a score. I think if you, because NLTK is open source, and if you dig deeper, you can get the algorithm print out the score. I've done it before with other tasks where I needed the score to actually modify the code, but it doesn't by default. You just return to use it. So TFIDF is a statistical technique. Statistics is used a lot in natural language processing. And this one is used a lot in information retrieval as well. So in search engines, in hopes you figure out words that are particularly important in a document because a stop word list cannot capture all of the words that are not important. It kind of depends on your data set. So for example, if you're looking at all of the news articles, the words that can be ignored could be very different ones if you look at the medical data set, for example. And TFIDF captures that by combining the frequencies of words in documents in this formula, which multiplies terms frequently, which is how frequent is a word in the current documents compared to the document frequency of that word in a collection. You don't need to be able to implement this because this is how simple it is to do in Python. And here I used a different library called GenSim, excellent library for doing statistics in NLP related tasks. But I did use NLTK to get some movie reviews because I just wanted to use a data set with multiple documents. So we have a variable text and we put all of the words from the movie reviews as lists of words into this list of lists. And this is what we used to initialize a GenSim object called dictionary. I called dictionary here. Then a couple of commands later we have the TFIDF model. And this is an example of a model you can download using NLTK for other tasks. But you can also create one either from a data set like movie reviews or any other data set. It doesn't even have to be annotated in this case because it's just statistics of words. So then now that we have the model, we can ask, so what kind of IDF score inverse document frequency have these words? And you can see that something like film and movie because it used a lot by people who were acting movie reviews has a very low score and something like Jolene and Jolene and Jolene has a very high score. So if you have a movie review and it has a lot of mentions of film and a lot of mentions of Jolene, you can use this value to offset the term frequency and bring Jolene to the top. The next task is categorization and again this is where statistics use. TFIDF is something that's useful again to weigh each word that appears in a document. And when you have all of these ways for all the words, you can create a multi-dimensional space where each dimension is a word and each document is a vector represented in the space. So for simplicity reasons, only two dimensions are shown here, two words and two documents and a third document or search query for example. And you, a text categorization algorithm, you know, in OTK you can use the ones that they have such as names they classify and basically figure out given a new document which category is more likely to belong to. So if you have news articles and you've known before that these are all of the articles in the entertainment category and these are in the politics category and then a new article comes in, Obama and Hanover start trade insults and interviews. This piece of code could help you, just hold on, could help you figure out that this is kind of in between entertainment and politics because it's similar to both of them. How it can be practically used for a very useful application. Have anybody heard of 50 shades of gray or red? So this is actually an image from economists. And it talks about how there's only 13% of pages that talk about sex, the most interesting part, right? So there was a researcher who decided, well, I actually want to find those 13 pages automatically. She used, I think she used an OTK and what automatically categorized all the pages with an accuracy of 93%. And somebody else posted on Twitter, oh yeah, this is the only way to read this crap. Sentiment analysis. Let's say we have movies and people give movie reviews. They give them different number of stars and this can translate into sentiment. So a low number of star means the person did not like the movie. It's going to contain lots of really negative comments. And if it's 10 stars, they're going to be all excited about the movie. So sentiment, I use here different Python library called TextBlob. And TextBlob is a wrapper around an OTK and other libraries just to make it simpler to do NLP tasks. So here, again, on the three lines to find out sentiment in a sentence like I love this library. You get the polarity value and the subjectivity. So that's the polarity is the actual positive negative and it ranges from minus one to plus one. And subjectivity is the person saying something about their own opinion. And in this case, it is quite subjective. So I went to IMDb and I took reviews of this movie and put them into text files. Yeah? How would this deal with something like sarcasm and a line like that? Like what did that be considered in the result of sentiment? So the outer response of TextBlob is using is dictionary based. So it does not consider sarcasm at all. It just says love is a positive word and library is a neutral word and then it just averaged them. It's pretty simple. There are more complex algorithms. Do you know those jokes? Toni loves them. That's what she said. So there is a PhD researcher who actually wrote a paper about the computer algorithm that automatically can attach to every sentence. One form of sarcasm or humor. Any kind of interesting thing you can do with language, there's going to be a PhD researcher working with it. But what that makes me, even TextBlob is a very simple approach and I did not train it on anything. It just worked out of the box so accurately. You can see that the scores returned are in the same order from most negative to most positive as the stars and it didn't know anything about how positive the movie review was. It's surprisingly accurate. I want to talk about more complex applications. So when you take these tasks and build them into a more specific application to solve a particular problem. So let's say you are a, what are these people called? Like a movie producer and every day you get stacks and stacks of movie scripts and you need to very quickly identify, will people love this movie or will they not love this movie? And you just don't have time to read. So something you could do is let your interns scan and OCR them and then extract keywords from each of these scripts to figure out, does it match your interest? I've done the similar thing with the movie reviews and you can download the tutorial that works through the steps which involves getting rid of stock words, using TFIDF and some more things. But basically if you have this movie for rooms, these are the keywords you could extract with Python and it's not much code at all. It's pretty straightforward. So a quiz for you. What makes this a successful movie? So one of these columns is extracted from negative movie reviews and another from positive movie reviews. So if you have a script and it matches keywords in this column or in this column, it kind of tells you how people are going to take it. So is it positive or negative? Positive. Okay, let's go. Who thinks it's positive? And who thinks it's negative? Yeah. The majority win. That's Mars. Who doesn't like Batman? I think people love Batman, but it always depends on the interpretation. There are so many versions, right? I probably keep on saying this is nothing like Batman. So here we have a problem sorting out through movie scripts and a solution how you can do it using Python. Another important task. This is somebody else's project. This guy then says that he just loves craft beer, but it's just so hard to decide. You come to a shop and you basically have bottles and then you choose based on the design. Wouldn't it be nice that instead of this, you have a menu where each of these beers would be described in terms of its characteristics? It turns out you can get a lot of reviews of craft beers. Even most obscure beers have hundreds of reviews on this site called beer review.com or something. So he took it, took all these reviews, downloaded them. You can see it's all text, people saying bobbled, a hazy dark golden beer with a huge yellowish head. The aroma has lots of fruits, lots of spices and caramel. This is just one person's comment. He really wants an aggregate of what most people think this beer is like, to decide whether or not you should try it. So what he did was he removed the stop words and then he counted the three friends' names. I'm not sure even if he used TF by Dia, but he found that this was for a particular beer from review for a particular beer. But the thing's repeated. So you have coffee and a espresso, which are kind of the same. You don't want to put all of these into the same review because people don't have time to read this much. Or dark and black and brown is the same thing. Cocoa and chocolate. You really want to make this more succinct. So he used Jens Sim, another library I mentioned. And what Jens Sim can do is categorize words into what they call topics. And each topic is basically a dimension that using pure sadistic just can be extracted from text. So as a result, he then had the name of the beer and four most important topics. And then he argued in his talk that, wow, it didn't work 100%. But in many cases, each of these four responded to the ingredients of a beer. So you make it out of hops and mouth and yeast and water. So another co-application. I promised a story about neuro-linguistic programming. So I was fortunate to go to a conference in California where so many really cool people were invited, including Andrew Eng, who is a machine learning and NLP guy from Google who now works for Baidu. And he's one of the people I follow and look up to. And I was the creator of the Russian search engine called Yandex. And I thought, well, such a great audience. What you could do there at that conference is run your own session. So I decided to run a session on NLP to get a chance all of these people in one room and kind of talk with them. And I called my talk, are we NLP? Are we there yet? Question mark. So I'm in the room by myself. We think we'll be able to show up. And the first person who comes in is a guy who runs a music festival because they're all introduced in the beginning of the conference and I knew that he's not a computer scientist, not a linguist, nothing related to this. And I was like, so you know this is about natural language processing. And he goes, really, not neuro-linguistic programming? And I thought, what have I done? So I said, well, you know, it's also very interesting, but he's like, no, no. There was another session and I was just tossing in between them. I'll see you later. So he left and the next person comes in and goes, is this natural language processing? And he said, yes. Oh, thank God. And all the people started pouring in, including Andrew Ang and all the other people and they all were really relieved that it was natural language processing that we were talking about. And this was, I think, four years ago. And yeah, the question was, is this technology actually viable enough, even though it's so futuristic and even though there are so many complexities in the language, to solve real-world applications? And the conclusion was, yes, it is, particularly when you use a lot of statistical data-driven techniques. And this is what Google does. They have so much data that they can create perfect translation tools, perfect search, or nearly perfect. And something they've mentioned is deep learning. So four years ago at that conference, deep learning was as the thing to do in NLP. And only, I think, two years ago it started to come out in the media as a new trend. The good news is you can do it in Python. There's a library called Piano, which lets you do deep learning. So these are libraries that I recommend to try out and update for really basic stuff. GenSIM for statistical operations on large document collections. Text Blob for sentiment and other things. Scikit-learn is a very good machine learning library. And now we remember that this text, The 50 Shades of Great Project, was using Scikit-learn. It's basically like NLTK, but for machine learning. In just a couple of lines, you can do complex stuff, and you don't need to know how machine learning works. So if you want to know more, follow me on Twitter. And you can check out the video from last year, and there's also a tutorial link that I mentioned before, where you can go through the code and try things out. Any questions? So, good application was not the... Kind of like a spam detection. You would build a classifier to decide throw or not throw. And you would... The key with classifiers is to decide what actually makes, would make the algorithm decide. And with spam or not spam, it's often about what words are used. And then you can include in your set of features other things like... I don't know, what are trolls normally look like? I think they create those fake accounts with zero followers, so that could be another insight. So some of the information could come from things that they're actually saying, and another information could come from other signals, non-textual ones. What do universities use to process essays to make sure they're not a copy? What do they use? So, like, yeah, detection. Yeah. Is it a natural language tool they use? Yeah, so I think they look at similarities in how phrases are structured, because if you're smart, you wouldn't copy the exact sentences, but you mix up them a little bit, maybe change the stop words. So if you do stop words, and suddenly you compute the similarity, basically between two documents. It's similar to duplicate detection. And similar to search, in fact. So when you search for something, except your search query is not a couple of words, but the whole document. So when you're running over large data sets, can the Python library handle running over several computers, or is it a single process? How does it map those large data sets? I think GEMSIM specifically is designed to scale over multiple, you know, scale well over multiple cores. NLTK is really a toy library. So if you, for example, that movie review data set has 2,000 short texts, quite short.