 So hello everyone, and thank you for attending this talk I know that the title may seem a little buzzwordy, but I hope that I will present more down to the ground view so first few words about me I Am a back-end engineer in the machine learning team. So besides of the problems with the regular Engineer in the distributed environment. I also have to Face and solve the problems of machine learning in co-operation with our data science Well, if you want to talk with me, I will be really happy to discuss the machine learning Engineering stuff by phone, but also my hobbies like 3d printing and home brewing. So What even is an LP as a natural language processing? We may think of first Processing of human voice and the voice recognition, but I will not touch the stuff here today but I will focus on the processing on the already text that we have textual data and Why natural language processing is hard first language is ambiguous if we say that hey, I had a sandwich with bacon It's hard to say Whether we met Kevin Bacon for a lunch or we had a sandwich of pork meat Second texts are compositional Characters composed words and then words composed sentences and finally paragraphs and whole books and here the problem is that Even if we had the same letters that compose two words, let's say burger and pizza They share none of the characters, but they still carry the meaning of junk food But we cannot complied them only on based on their characters. So that's a few reasons why this is hard If you want to learn more about the traditional nlp approaches and more Overview of that on the opinion itself. I highly recommend you to listening to the last year Last year a lecture about introduction to sentiment analysis with spacey The slides will be shared with you later. So I will have a lot of links to the further reading Then what do common problems we have in nlp because well if somebody already Solved the issue we are having and we can generalize to that we can either use the ready solution or at least be inspired by it so the common problems are for example the current classification and This also includes document sentiment whether for example the reviews on our business on the website are positive or negative because we If we have thousands or millions of them, it will be hard to classify them by hand, but also offer attribution Who wrote the document and this is so exciting because not only by the words used but also by the way they Construct the sentences and also like very practical use case whether the email we just received is a spam or not Or whether it's important or not Another common issue is sequence to sequence and this includes but it's not limited to translation like Google translate Summarization of text like we have a whole article and we just want to create abstract of it just to know if we would be interested in it and also response generation We really need to feature in Gmail when we receive an email We just have one of few possible responses just to tap and respond to our sender Another common problem is information extraction. We have a sentence Jimmy bought Apple shares and Jimmy bought an Apple and we want to know the Apple refers to the company or to a fruit and This is re useful in search engines or I would say in ads because when we have an ad We would want to know it would be relevant. Should it be relevant to the fruits or should it be relevant to iPhones? So these are only few the most common ones I would say but also you you will learn if you want that there are a lot of more than a lot of them Why neural networks are good for an LP? Text carry a lot of features and we can extract them by hand we can label them by hand We can in sentiment analysis, for example find and hand-pick the words that conclude that the sentiment is either Positive or negative whether the if somebody you mentioned is terrible probably they mean that our business is bad and The idea is that the neural networks will learn those features on their own and as the practice shows they usually do I Will focus and show examples from this quote-unquote real life problem the IMDB sentiment analysis Why I put real in quotes because in real life We do not have such beautiful data sets. These are 25,000 highly polar movie reviews and first our real avi data will usually not be so polar and not so Pure we will have usually a lot of more a lot of noise there. So we'll have to leave that but as a Exercise, I think it's really good. I Will focus as a metrics as on a accuracy later and also as the cost I will Work on the training time Why training time not like the number of parameters because right now when we are paying for servers on working on our computer That's the thing that we are concerned most about and when we have complex networks It won't translate directly into that But the downside is that in the future when we have better ways of parallelization or better algorithms We may find out that the networks that are really expensive to run right now will be cheaper in the future So like with everything it depends what do you choose? Our task definition we have a movie review We want to put that in our neural network and we want to decide whether it was good or bad So looks simple enough, but there's a catch on the review itself because We since the neural networks are basically matrix Multiplications and additions and also activation functions. We cannot throw text directly there We have to we need to have some numeral representation and To obtain that numerical representation. We first need to have some features So this is example of the text like we can see a big disappointment Incredibly bad very pretentious. It's highly poor Polar like I mentioned But first to use this text as an input we need to do something good we need to translate first in features then into into this vector I will focus on this simple sentence what possible we can extract a quick brown flux It looks like we only have few words, but let's see what we can do with it First we can tokenize that sentence and by tokenize I mean split into chunks for the English most often they will be Words so we have the tokens like app and then the quick and brown and fox and let's focus more in the last one But if you work with another languages, you may find that it's not so easy and especially for German Because they glue words together. I do not know German, but I know that there's the case But you can use a library like some piece from Google to try and live with that Or try some other money. So get back. Let's get back to the word fox What do we know? I Only touch the cable Okay, let's focus about the word fox. What do we know about it? If we use classical nlp models statistical ones, we can extract information that it is a noun Or we can label them that by hand, but we can automate it Also, we if we use the word net database, we can extract that the word fox belongs to since that Since that is the wider meaning of the canine. Canine is like dog ish animal Let's say simplify probably if there's the biologists, they will be angry about that, but Okay, what else when can we know about this word? We can extract its stem stem is the core of the word and we can extract its lemma Lemma is a basic form of the word and for that simple case, they will both be fox if we have any Access to the whole corpus of the data. We can also calculate the term frequency inverse document frequency it tells us basically how Important that this word is in this given sentence Just to simplify and it can be another feature So now for this specific token, we can have one two three four five six possible features I will focus on the word itself now, but remember that they are here and they can prove Useful you can also create syntax parse trees from the classical nlp models and they can also Boost your accuracy when you want to represent a word or a bench of words in the Way that now our new law neural network will understand we can use for example the bag of words is the simplest possible Representation that I know first we construct a dictionary Here it is constructed from the sentence a quick brown fox jumps over a lazy dog And for each word that occurs in our sentence we say in one and For each word we do not have in our dictionary in our sentence. We put zero and We also usually reserve one of the tokens for the unknown words And now since we already have a representation We can work with some network first the most basic architecture is fully connected neural network We have multiple inputs then we have Hidden layer or not and then we pass our values through it The important Part here is that everything is connected with everything. So we will have a lot of operations and this is example network It's constructed by Keras. We have only one input One hidden layer and one output. So this is a very easy and After we train that network What will happen with the first layer because it is as big as our dictionary here. It was 1000 words on this IMDB data set and Each row will contain now Really dense representation of a given word as it is all as it is very often called embedding So our first layer will construct embeddings for this for words in our dictionary and What we can do with that we can use it as a features in different models, but we also can visualize it I reduced Dimensionality from 64 to 2 by Disney and we have this beautiful scatterplot But what information can we extract about it as it was also stated in the First keynote we can look at the similarity here I looked at the similarity to the word ridiculous and the closest words to it are waste boring wars wars So our network without even knowing the meaning of the words learned that they basically mean that the review was bad On the other hand if we choose the words fantastic the nearest neighbors are excellent 7 probably like no on the numeric scale from 0 to 10 simple 8 amazing and so on and If we compare where these two I would say clusters are They will occur in totally opposite Points in space on the one side. We will have the representation for the positive sentiment on the other hand for the negative So this is nice tools always to show to your company Hey, because it looks awesome if use the tensor board for visualization you can do that Disney already there you do not have to do it by your by hand earlier and You can move around this interactive also 3d graph pros and cons of fully connected network with a bag of words as a representation It's simple. So it's cheap and fast to train one epoch took about a second or two in collab. So It's you can iterate on it fast and that's also a really good Upside because then you can conduct experiments really really quickly We always look at the whole text because at every word It's kind of interpretable because it's so simple that we cannot we usually can explain why The given result was chosen But the downside is that we can get close state-of-the-art My big my best result was about eighty nine percent current best result is ninety six percent And we also do not carry the order of the words because it's just a bug just a set We'll lose that information. So how can we fix the thing about the order of words? Let's consider this to reviews. I love the cinema the movie, but cinema was terrible I love the cinema, but the movie was terrible if we put them together in the bag this two sentences becomes the same the representation for this to will be exactly the same and if we Had two sentences like this in our training set It will be only as half as bad because we'll produce I would say undefined results for this kind of sentences But if we had first one in training and then the second one in our production Then we would conclude that well it is definitely a positive review. So we have to watch for that one of the ways of anticipating that is to use a Sequence of one whole the vectors. So instead of smashing them all into the one we just concatenate the sequence and One like word in practice if you plan to do it would implore if you plan to do it that way use Sparse matrices from scikit-learn for example because sparse representations will be much more memory efficient here hmm and If we have another sentence like a quick brown vixen or vixen is it's a female fox It's not in our dictionary. We will need to assign the unknown tack to it But since every single word will go into this unknown bag It may be not so good for our performance Instead of doing that we can assign for example either the sin set for that word if it is a dictionary you can try that But we can also try assigning the specific parts of speech that this given word represent and this actually improved my results even in Especially when I had a very small dictionary of 1000 words it boosted I think from 86 to 89 percent. So really good if you want to work with small dictionary and I also played a little and created model only on the part of speeches and I know it doesn't make sense but It actually had better results than 50 percent. So it wasn't totally random. It's had like 60. I think so I think the Outcome is that people when I are angry and do not like something will write in different style than people that are happy with something And this is example network. We have our input to the slayers and output simple enough Another way to approaching well since The part of speeches for unknown tags worked. Maybe we could assign them to every word maybe it will improve something and Then we will have even bigger dictionary because we'll also want to include this include the part of speeches there But for me it didn't help anything But remember that was only my case on this AMD before you it will Make improve something Another way of Representing these features like we first we had this sparse representation of matrices and here we can have a dense representation So to each word we assign much more much smaller vector, but not only one hot values but whole range from minus one to one usually and Also, usually it will have this vector will have length of one because then we have Calculate this cousin cosign similarity easier and better and Here I also created with the embeddings for the words and then with the parts of speech another model Why five thousand sixty input because we have five thousand words and about 60 possible Part of speeches from the spacey and then by thousand because this it was how big my Review can be Also, the problem is that I had to put those sequences to work with those network Pad so either extended by this token pad or cut them if they were to look So we list some information pros and cons of fully connected networks with Sequence it's still simple. So cheap and fast to learn still under two seconds order of words matter now They are still kind of interpretable but we can't get placed the state of the art 0.96 and Words at given position matter more what they mean by that because of the how neural network works If the word bad occurred here at the first position and then it occurred at the second or third It will be treated complete not completely But it will be treated differently and that also may be a problem and negations are thanks to that hard to catch Not impossible, but hard if you want to learn more about basic I think that Andrew and G deep learning course is the best way to start. So you have a If we have a review this movie was not good And we have a negation like I mentioned it's hard to catch We can use few things to anticipate that we can use the tool called work to phrase. It's from work to VEC repository and We can also use similar to engensim, but the upside of the one from the work to VEC It's written in C and it's blazing fast on the few million or sentences data set it Works with in seconds in Gensim I wouldn't I wasn't patient enough to wait for the result and what to phrase or Gensim will produce This sentence is so now we have another word in our dictionary instead of having not and the good separately We will have them together. It works by looking how often this word Occured together and how often they are separated simply Okay, but the other way of anticipating that issue is for example to use convolutional neural network CNN's I know that they are used Usually on the images and today I learned that can be also used on audio transcriptions on or on audio in general But we can also apply them to text Let's think about this Review this movie was not good What CNN's will do well will have this sliding window of one Let's say neuron that will first conclude the representation of the words inside that window so For this example, I choose the window of two words. We will have this matrix that will will multiply our representation of the This and movie and will create the representation of this movie together then for Movie was I will have representation for those two words and so on and so on and we have everything and this part of the operation is called convolution and Then we do operation called pulling so we reduce dimensionality and pulling does not go in sliding window it goes into It goes over the multiple representations and reduce dimensionality. I think that's better to show Basically, then we have final representation of the triplets of words and on top of that we will use Standard fully connected network you can use either max pulling so in each dimension We will pick a month maximum value or average pulling So in each dimension, we will calculate the average of the value in vector. I Heard that for text. It's usually better to use max pulling, but I think it's always best to try both and find out what works for you It's as it's in just another hyper parameter for the convolution You have the window size how big it is and also the stride size So how many words you will jump if here the stride size was one and the window size was also was two so We move our window size to every every one word if we had the stride of size two Then we would only have our presentations for this movie then was not and It may work with bigger strikes, especially when you have long texts So if you want to work on whole paragraphs, please consider that and That's the simple architecture and that's another architecture. I came up with Mmm pros and cons of CNN's they parallelize nicely and have fewer much fewer parameters Done the fully connected neural networks. Usually of course, it depends how you build stuff order of words matter finally position of words also matter and If we want we can create a network that will look at the whole sentence. It's not so easy in practice, but Interior we can and if you want to learn more about the stuff another further reading understanding convolutional neural networks for NRP I also recommend reading that so now recurrent neural networks Remember how in CNN's we had a presentation constructed of the two Words together now we will have a representation for the world this then We Move to the next word we create a representation for the world movie But we also take the previous word into the context Then we create a representation for the world was but we also take into account the previous words and We always use that same matrix of weights and When we get to the end we'll have a representation of the whole sentence and This is really really nice because in theory we catch the whole sentence We know what is there and we can work on that but the problem here is We usually We can have problems with vanishing gradients and I will talk about that in a second and Fully connected on top and now we can create a prediction We can also stack those layers like a regular network and when we have a Review like terrible I loved her previous movies the word terrible that indicates sentiment is at the beginning and When we get to the end our representation will have to go through very deep network And we'll have to deal with problems like vanishing gradient when it will totally vanish and we'll lost the meaning of the word terrible or exploding Because we are always multiplying by the same matrices of weights to anticipate that we can use Bidirectional RNNs. So first we'll go front to back then but back to front Merge in some way the results either concatenate some whatever and fully connected network on top. It should work So pros and cons can give better results. We look at the whole sentence, but they are hard to train because as You may see the network will be as big as your sentence Is or Whole review. So we will have to deal with training very deep networks and this is really really slow If you want to learn more about RNNs, I put a link to the Stanford lecture about it So how to what is another way to anticipate that for getting we can use LSTMs or GRUs and Unfortunately, I won't get into the detail here because well the architecture of the neurons is Recomplexed this is LSTM. We pass the state and We all not only pass the representation, but we also pass the let's say state of the cell and These are GRUs. These are a little simpler, but the important thing is We do not only carry then representation for what happened in the past in the Vector, but we also will carry a state and Since it's more complex and we have more operation and gates that remember or forget things We won't be necessarily forgetting stuff So since it's not always the simple matrix multiplication like it was in RNNs We will contain the info about the words at the beginning But again, as you may see the design is pretty complex So when you are training network, we have to do a lot of operations a lot of back propagation So it will take time but they can give best results and Until the transformer came up They were most of the time state of the art and we can create Even in Keras the architecture that will look at the whole sentences. It's hard, but it's possible hmm and Here another lecture from Stanford and link to a blog post about understanding LSTM networks Now going back to the result of my experiments They weren't really successful. So fully connected network with a bag of words Achieved zero point eighty nine accuracy why LSTM were really near it was eighty eight percent, but the training time was Sixty times higher. So it's not always worth into to throw yourself into the Most complex architecture at the beginning I think that it's always best to start with something simple and then iterate and compare with that Because with the simple architectures, you will also gain the quick inference time, which can be really useful and If you are I barely scratch the safer surface here And if you are interested in the machine learning in context of an OP I highly recommend that book It's I think not only the one to read but also the one if you want to work in that to have because I Find myself often going back to reading specific parts Because for example, if you know how the size of the window in work to back works with the produced embeddings You can find it there So that will be all and thank you