 Good morning everyone. So am I audible at the back and not too loud not too low. It's cool. Okay, so my name is Ashutosh Trivedi and Today's talk is solving logical puzzles with natural language processing, but before that I want to share a story Back in 2013 when I was a grad student at triple I.D. Bangalore. I was I didn't know about PyCon at all And I actually didn't know about open source at all. So by coming here There were a few of my friends said there's a conference called PyCon and I said why there is a conference for a language Is it really required? So when I came here and I saw that how a community drives the language and for betterment of all the language then I got to know that yeah This is something even I wanted to go and I got involved with few of open source projects And it was a great learning experience. So I just want to give a round of applause to all those all those developers out there and and Awesome conference like this. So yeah So before I start the talk, I want to just brief give a brief introduction about What this talk is about and what this talk is not about because in natural language processing It's actually very heavy research Topic and there is lots of new research coming along every day. So all the things which I'll be talking about Natural language processing are the things Which were in research papers just either One year back or this year So things are pretty fresh and pretty new and people are still experimenting with it. Nothing has been went into production till now. So and Since it involves So there is a club between machine learning deep learning and natural language processing, which is the current research status So I won't go into mathematical details. The talk will actually motivate you to Start working on it start exploring these things and do some cool stuff around it. So I start Let me just have an idea of the crowd. How many people are actually working with natural language processing? Machine learning Deep learning Okay, cool. So I'll be going through basics and if you have any problem We'll be taking questions at the end and if you want to do math, I'll be outside. We'll sit and do math okay, so So I start with what is actually natural language processing and how does it start it? So when When we so language was always there with us Okay, so language was always there with us as soon as we got computers We wanted to understand language Programmetically why it's a very obvious answer because computer needs an interface and what is the best interface rather than our own language? Right, so moving forward We built something which was there that How can we understand language? Programmetically so that we can do automation and we can have computers do stuff around natural language. So these were the things which people were doing and Which is there right now? So what we can do with? natural language since our grammar has a rule so we can actually follow those rules and Say we can we can we can compute part of speech tags for each word if if so I'm taking English language as The basic and all these talk will be focused on that and there will be different rules and different set of things around different Languages in German and or Sanskrit which are heavily grammar oriented so So we know our part of speech. We also know our vocabulary our vocabulary is growing every day, but it's still We can store all of that and n-gram is a thing which Which we can apply on any language n-gram can be a bi-gram diagram or More than that so it won't consider a word as a word every word will be broken into if it's a bi-gram then two word two character words will be bi-grams across the sentences and Word style is Capitalization or non-caps all caps these kind of things and now detection and we this is a complicated problem But still we can detect nouns so these are the things we were doing to identify that how we can do some thing around natural language So given a sentence Can you process that sentence and how can you process that sentence so that you can get some information out of it? And if I'm talking about information what computers can do they can check numbers So every information which is which is there which is called unstructured data, which is text image video So as this talk is focused on text what you can do with Text and how can you convert it to numbers so that you can process it right so Let's let's let's go a sentence if there is a sentence. Let's go word by word So currently what is state of the art two years ago was researchers was actually doing this They were calculating so if you go word by word you have current word you have previous word next word current word n-gram previous word n-grams and pause text surrounding pause text word shapes surrounding word shape and Anything around it so There are lots of things you can do with it and you can actually convert everything into numbers So by numbers I mean indexes so a current word is an index in your vocabulary Similarly your next word even a pause tag you can you can create an index so it's it's all getting into that and This all will generate Some information and that information is used to understand that sentence So if a sentence has ten words for each word will calculate this much information create a feature out of it and do something So if I say feature let's let's say there is a sentence and It has all these features so we calculated all these features and they are in numbers So what we can do with that If I if I go forward with machine learning model So let's say I just want to know what is a sentiment of that sentence So for that I need to tag few sentences before so that we can learn right So suppose sentence one will calculate that This is the vector if I if I have n features so it will be a dimensional vector and that n dimensional vector has a tag Okay, and we have some weights associated with each Feature so every feature has some importance to identify the tag so as soon as we We got exposed to more and more training data These weights will be adjusted and this weight will be actually the weight of hyperplane separating all those Separating these two classes so with each with each training set that hyperplane will adjust and try to fit the model so that The negative ones will be at one at one side of the hyperplane and positive ones will be at the one's other side of As soon as you finish training your model Once you get the sentence a new sentence you again calculate the features just plug in those weights and Your hyperplane will tell you that it falls into positive category or or a negative category So it's a very simplistic way of explaining a machine learning algorithm how it works and how it will it will work on a unstructured data so What are your thoughts on that what we are actually doing is through natural language processing and our set of rule with lots of processing we calculate data features and Through machine learning we just cal we just optimize that way we just learn that hyperplane right So these two fields if they combine You can understand your data better But there is a problem is it scalable? It's not at all scalable. There will be new languages and our language vocabulary will increase and We are talking about AI here machine learning is is one of the field which is right now leading in AI it's one of one one school of thought so What is wrong with that the wrong thing is we are doing lots of processing and that processes are actually dumb processes So if you're calculating Part of speech tag for a sentence. It's a process It's a standard program which run and which will fail if you have any non-standard thing similarly other features suppose I'm calculating 50 features under and if you if you assume every feature is a process We are 50 processes running generating some data feeding it to an optimization algorithm Which will run the your hyperplane so it's it's not at all scalable plus It's not iterative as soon as you have new data. You have to again do the same thing So so something is wrong with that and researchers also find out that yeah something it's going wrong And we are not scaling up in NLP So whatever the research was there for last 10 years was actually superseded by last two years research Which was introduction of deep learning into natural language processing So we hit the basic first. How do we represent a word? So when I said we'll process a word and If you process if we process a sentence and I say we will process it word by word So there is a inline Problem with that our word representation are actually very weak So how do we represent a word suppose I say we'll build indexes So if we have a vocabulary and if if there is a word which is called nice and it comes as a 30th Index so we'll create a one-hot vector. So this vector is called one hot It's it's zero zero zero and one at the place where where it's indexes So 30th index is one rest all are zero. So dimension of the vector is your size of the vocabulary So this is how you represent suppose a word is nice, which is at 30th place and what is Good, which is a 45th place. So With this representation, can we say these two are actually same? We cannot so the main problem is how do we represent the word our word representation does not convey any information about that word we calculated we calculate information about the world by the surrounding surrounding neighborhood words so if we if we improve our Representation of word will be saving a lot of computation and we'll be doing some more analysis on top of that So there was an answer for that earlier and people people actually thought about it and they created wordnets so wordnets are Tree or a graph you can say and they can they actually capture the hierarchical information about all these all these words so So if you have awesome good nice, okay It's a hierarchy and they are connected and their distance tells you that how good they are connected and their announce also But there is a problem with wordnets too. It's it's a lot of manual labor by linguists So they create it over the period of time and Princeton has highest for that right now You can access it and I'll be showing python how we can do it with python So wordnets has problem for it's actually a lot of manual labor scalability issues will be again there and As soon as you have new words, how do we associate those new words to previous words again? You have to do a lot of computation and there will be a lot of validation from Linguist and they may not agree. So if lol lol has to come into wordnet So how do you connect it and few people will not even agree that it's not even a word? so there are lots of things has to go in into that which is language specific and The biggest problem is nouns we have we are generating nouns every day So how do you how do you incorporate all the nouns in that and how do you say that? This noun is actually similar to that now So all these problems are there I'll just I'll just give you a demo of wordnets through NLTK is natural language processing library in python through NLTK you can see that They have this wordnet saved into that in corpus and you can access those wordnets Okay, so So, so this is how we can we can we can access wordnets and What net has lots of information and lots of rules in that so you can define is a relationship Which is called hypernames. So if I say so this demo shows that panda and And Okay, so let's say So that that is not rendered So let's say we want to say that how many synonyms are there for nice and it will show you the hierarchy of all those synonyms similarly for these nouns also and It also tells you that what is the region and domain? So there are lots of information in wordnets you can get out of it So what is the atomology of of a word and how it is connected to? Which region it is connected to so Paka is in Hindi word. So its region is India so these kinds of information you can take from wordnets What I can do is I'll I'll switch to these demos at the end so that it will be straightforward So again Let's let's get back to where we were so wordnets. We say that we have some problems with what net So let's let's think about how do we associate with the word? Right if I if I say a word there are lots of things in our human mind which goes inside it so We associate it if it's a noun we associate it with a person and we we create a context around it, right? And that context could be a taste smell time visual or feelings So there are lots of hooks with the word which which gives us that okay this word represent this so What we can do programmatically so there is a code which says that You are an average human of the five humans around you and there was similar quote in a natural language processing Which says that you shall know a word by the company it keeps so That that is for there to create Context so how good how easy it is to create context we can actually create co-occurrences matrices and It's one of the most successful ideas of NLP so how do we Calculate co-occurrence matrices is is this way so suppose we have a corpus and that corpus represent these three sentences Which is I like deep learning I like NLP and I enjoy fly so in that If we create a matrix which has every single word from that so the dimension of the matrices again and cross and if you have n words and We are creating this co-occurrence matrix as one neighbor co-occurrence matrix that means Every word which is just adjacent to another word will give a count one so and increase the count one So you can do that for higher Context also so if that matrix is there and So I we represent our complete corpus in that matrix again There is a problem with that Yes, we can do some Analysis on that matrix again. We have converted our corpus into a matrix numbers But and we have represented all the all our text data into numbers which actually capture the Context of the word again. It is very high-dimensional if you have millions of If you have millions of words again, it will be a million cross million matrix and you won't be able to do much processing on that and If we run and this matrix will be actually sparse So by sparse, I mean there will be lots of zeros because not every word is associated with every word and they don't come together So with lots of zeros if we run a classification algorithm, there will be a problem I'll show a demo of co-occurrence matrix, but at the last because I have to again switch switch back these things So what we can do with co-occurrence matrices is we can reduce the dimension of a matrix always Right. So if we reduce the dimension the matrix will be smaller and and it's always If you have an high-dimensional data which which shows the dimension which we understand and if you reduce the dimension There will be some latent information there. So a dimension can be a Dimension of similarity. So it's it's quite abstract and we don't we cannot understand which dimension it is But we can reduce the dimension using SVD. So SVD is singular value decomposition How many people know here SVD? Okay, great. I won't go into math of that So SVD is a method by which you can actually reduce your matrix into a smaller matrix And it will be having low dimension and those dimension will be representing the most important features of of that matrix So any matrix you can actually represent in it into These three matrices and you can reduce the dimension I won't go into math and this demo I'll show you later how we can calculate SVD through Python and how we can represent the word in low-dimensional space So again We have another problem with SVD So the problem with SVD is calculating the SVD on that high-dimensional vector high-dimensional matrix and If you have again billion words Converting into million words will again not solve problems, right? So billion by billion matrix if you convert it to million by million matrix It's still million by million matrix and you can't Do away with that. So again inherent problem with SVD is calculations and iterative On this again if I have a new sentence in my corpus I have to calculate the SVD of the whole corpus again and then do this stuff So this is where last two-year research has contributed This is the algorithm called word to work. It's by Google and by Thomas Mikolov and Jeff Pennington and there is another research happened in Stanford by Richard Sosher and Yosha Banjo, which is the which are very famous guys in NLP So what they have done is instead of we calculating the lower dimension matrix Why don't we learn the lower dimension matrix? As it is we don't go into high-dimensional matrix at all and we learn the low-dimension matrices directly So they calculated they they devised this algorithm called word to work or graph vectors global vector of first It's actually very fast and you can do a lot of things with that So what do we do in word vectors is we set right we can understand a word by the company keeps so instead of going to the Singular value decomposition and calculating the co-occurrence matrix. Why don't we just predict the Context so there are two school of thoughts in word to work again say two algorithms Which is when it's called skip gram model So what this skip gram model does is suppose you have you have you take a sentence And you just say that I look at this sentence with the context of five so if you if I have a center word I look into three words on the left and three words on the right and if I have a word and if I start predicting the words in my context so that As much more I'll see those words together with that sentence Together with that center word will be will be having its probability increasing So we what we do is we maximize the log probability of the context of the word so I'll represent it in Pretty simple manner suppose. This is the sentence and so sentence is very long Say it has lots of words w1 to w2 L word of blue hundred, but I'm only interested in a Context window of three. So I'll take this word wc and what I'll do is I'll create I'll run a log I'll I'll I'll know optimization algorithm, which will increase the probability of these And sorry, which will increase the probability of these context words It's all going Highly mathematical and it's actually very difficult to give a demo on how we'll be calculating it But I'll be uploading all the ipithon notebooks, which which can actually do all this stuff And I'll show you a demo also, but that will be that will not be too much informative in this talk But I'll be outside and I can I can go through all those things. So what's good in word to work is First of all, it is unsupervised. We don't have to tag our data. It's just dumb text You just give it to that it will go word by word and predict the probabilities of which are the words which come with that and And what happens with word to work is they actually Create a low dimensional space. So that lower dimension space is it depends on you which how much dimension you want you 1,000 dimension or 300 dimension. So that will be will be our vector instead of a complete one hot vector Which is the size of the vocabulary. So these lower dimensions space are Can be anything it can be a space of say a dimension of similarity This dimension only shows the similarity between words or it can be a dimension of sentiment this dimension shows that These words are actually aligned with their sentiments or they can be all caps dimension So in that dimension every single word will be capital or or it can be anything So it's up to you to figure out and explore what what other lower dimension space are actually talking about and they are abstract and very hard to find out so It was there that word to work actually captures the dimension of similarity, right? So A dimension of similarity is pretty broad. So it can be similar in anything and all these Algorithms they capture syntactical information as well as Semantical information. So if I say syntactical information singular or plural, so After that after all these algorithms we we we calculate a vector right for every word We have a vector which is lower dimension of vector say 300 or thousand dimension vector. So In word to work you can you can get a vector out of every word So we said that our representation was wrong and we were not our vector representation of every word was not caring And any information but after that our vector representation of every word has a lot of information. So if I Substact that vector of apple minus vector of apples. It is exactly similar to vector of car minus vector of cars And it's actually similar to family minus families. So that means it said so if I if I Substact one vector from another vector. It will be a dimension again It will be a vector right and if you go into same dimension That dimension is actually capturing a singular plural relationship. So And there are lots of interesting things you can do with that Vector of shirt minus vector of clothing is equals to vector of chair minus furniture So shirt is a type of clothing and chair is a type of furniture. Similarly. This is very famous example Sorry, there is a typo. So vector of king minus vector of men is actually equal to vector of queen minus vector of woman So it is actually calculating here that All these semantical information which is not any way in syntactical information. So how do we calculate this dimension of similarity? so here the The topic of the of this talk is Solving logical puzzles through natural language processing, right? So the logical puzzles which which you can solve are analogies and odd manouts through that So how do we solve analogies? So suppose? And how do we solve analogies through our brain if we have not seen that word We cannot solve its analogy problem and if we have not seen that word. We cannot solve its odd manout problem. So so it's actually the this limitation applies to that algorithm as well, which we human as and So so if if man is woman and what is analogical to If man and woman and then what is analogical to king and what right so that will so answer is queen So in word to egg you can just simply say At king minus men plus women minus so that will be queen So if you can see that man and woman direction is exactly similar to king and queen direction So if we have these three information we can directly jump into queen So in this way you can solve your analogy problems and in this way you can also solve your odd manout problem. So how do you solve odd manout problem is? is Suppose when we were training our work to work model when we are calculating all these probabilities that which world comes together What we can do is we can just infuse some bad data random text not even random, which is wrong in grammar, right? So that is called negative sampling. So we are giving wrong data to the algorithm and we'll say that this is wrong So we just start that this is say all the nice words are one which is grammatically correctly correct all the Wrong words are zero tack to zero which are grammatically not correct. So algorithm will run learn that these words should not come together Which are which are not the part of the grammatical structure So if we if we go into odd man out, it will say that these words have not come together in any dimension So this word is actually out of that. So I'll So I'll go through all these Demos now so that we can see that in action So I was talking about SVD is the font clear Okay, so these are three words. I like deep learning. I like an LP and I enjoy flying right so using numpy linear algebra library we can directly calculate SVD and These are our words right? I like an LP deep learning and all these words are there in in this in this thing and if I create a There's context matrix manually you can always create it Programmetically and if you calculate SVD of it. So this you this this matrix, which is a context matrix X is actually Is now break into three matrix you SNB we? Dash and so you has all these eigenvectors So if we if we show these eigenvectors on this two-dimensional plane we can see that These are now coming in some patterns So this is very small and it will actually not capture any pattern if you do this for large data It will capture a lot of patterns. So those kinds of patterns you can see here so So these are T as any Visualization which is used for visualizing High-dimensional data. So you say you can see that parliament and elections are coming together and Banks and reporters are actually ministry and union. So All these things So this is how we'll calculate SVD and move forward, but so I'll go into demo for word to work so python has an awesome library called jensim and you guys can go through that it's very simple and Radhim Ririk is the author of that so in Word to work is actually written in C So he has converted into scython and it's actually very fast and you can do Distributed programming also because word to work is limited to high volume data So Google has Google has given a trained model of word to work, which is hundred billion word model it has 300 million unique words and It contains almost everything we speak today even Indian words So you can check your name there whose name you are similar to so So this is how so you just load that model and that model is there in binary form And if you load that model and you say that you give positive as woman and king and negative as man So you'll you'll get that queen as the highest number. So this is how you'll solve your analogies problems And so this is your so it has another method called doesn't match which is odd man out so if I say breakfast cereal dinner and lunch so which of these does not fall into same category is cereal and and You have similarity is there and It's a dictionary. So if you want to see that what is the raw vector of this computer? So this is the vector. This is 300 dimension vector So once you have vector you can do all this stuff you you want to do with your machine learning libraries So you can learn you can learn Sentiments you can do name entity recognitions and all and And It's high and it is completely semantical. So sushi and shop and Japanese and restaurant are actually similar and Yeah, this is this I just cooked up. So because we were talking about modi engagement a lot So if I say which is most similar to modi. So these are the results were to a give So it depends on your data. So this data is actually last 10 year Google news data set And they trained upon it and it's a very huge model. It takes around four or five GB of RAM in your computer Okay, I'll quickly go through one of the things which is about what to work is Or maybe I should just Okay, so if you want to train your model So you have you have to create your you have to create your text to To to sentences and then you can train your own model and then do do queries on this So I have other things which I can talk offline and I'm open for questions right now if you have This is more of an announcement rather than a question My name is Amit Rao and I'm a freelance technology consultant with a background in Research and development in natural language processing. So I have created a opens space slot at 1230 for people are interested in the natural language processing and Python so as NLTK toolkit and I think there's a opportunity to create a community of people are interested in NLT So those of you interested, please Come to the first floor at 1230 and we'll brainstorm on how we can contribute to this. Sure. Sure. Thank you Hello. Yeah Thank you for the great talk. I have two questions Okay, the first question is when we are talking about word to wake you said it is based on Context aware predictions, which is maximizing the likelihood of a word nearby, right? So the idea that was responsible for the word wake word to wake was that the Original co-occurrences model was very used to store, right? It was n-square implementation So can you tell me how exactly they are storing the context out of a word in word to wake? So what to work is is a neural network. So the algorithm is neural network So initially you give one hot encoding as your first layer second layer is your dimension suppose you want 300 dimension vector you Train it to just 300 dimensions and after that you just output that word again. So Once you are calculating gradient descent and that time you'll calculate gradient descent or any optimization on the weights of maximum likelihood of your context Words, so I'm sorry. So so you'll save that as numpire is Aren't we actually constraining the model to understand the limit of context out of a word? There's a possibility a word is associated with another word which we are not allowing it to train on because we are putting a constraint of So so that depends on your computational power Okay, so your context is free. You can take a hundred Hundred left and hundred right word context and you can increase your dimension to say two thousand or five thousand So that is that is to cable that that is a parameter, which isn't there in One last thing. Yeah, when you were talking about the vectors and n-square co-occurrences So you gave an example of say the word is nice and it is at the 30th index So they'll be a one on that point. So that will generate n-square matrix, right? So right now I was thinking if we take a state of single vector where Words are index with their index number only so nice is a vector that is connected directly to a position 30 Linear index right and to associate a word we can add it index we can add the index to it Suppose there are hundred words and this thing is at 30 So it's index to it is point three and we want to associate it like nice weather where there is at 70th index So it is point seven so we can add 30 and point seven so 30 point seven the number itself is Some way signifying that nice weather is a context Okay, so I mean, I don't know I just came up with it But that is a very nice view of linearizing a n-square order problem Okay, that is that is that is the way and people are actually doing it But it's very limited to your context if you train a model that is only limited to your own context It's not a generalized like where to work that you can do a lot of things So in your view nice and whether you are giving the weights You're not learning the weights automatically out of the The corpora right It's your view so I can always disagree with your view that don't associate nice with weather I associate awesome with the weather and I give point seven. So there are these kinds of problems are there, okay? One more question out here Up here on your right. Okay. Hello So one of the things you talked about right now using these vectors, right? So you're associating some words in its neighborhood and you're predicting a likelihood that the similar words will appear next to it Correct, right. So how do you handle multiple meanings, right? like Nice right nice could have three or four meanings it could be nice as in nice It could be something talking about a nice level where it does not be good like the CPU nice level Or you could be talking about the nice biscuit, right? So when you create these vectors, how do you account for multiple meanings of words, which is fairly common apples apples also very common example, yeah a that and be computational performance versus It's like accuracy right versus a posting list and the wordnet with links on the graph What would it state of view? So these problems are there. So disambiguation of the word So this problem is there but what you can do is once you have these word vectors Which actually have some information about that word you can again train them. So word vector is not a trained model It's actually words. So whatever the words you are saving in your vocabulary as one or encoding or in wordnets You can have what to work and then again train your disambiguation algorithms Just to sorry to interrupt actually well like really short in time. You can take these questions afterwards Okay, so just don't be married because you're really short in time. Okay, so Thank you for such a nice talk and we have a round of applause. Thank you