 Hello. Am I audible at the back? Yeah. So I'll start. I actually did not even expect these many people because even I don't reach office at this time. But yeah, I have a crowd. So I wanted to start with a quick show of hands. How many of you people know about natural language processing? How many of you have actually worked on it or used it? And how many of you know what embeddings? Okay, very few hands. And about deep learning, how many of you know the basics of deep learning? So I'll demystify these terms, deep learning and natural language processing. And I'll try to give an intuitive understanding of why deep learning works and how do you use it in the context of NLP. So let's get started. So internet nowadays is a huge source of data. Every question that you have, an answer lies there in the internet. And the reason for that is the number of, the sheer number of users there are who are continuously generating data. When users browse websites, clickstream data. When they buy something, it's transactional data. But undoubtedly the largest amount of data on the internet is in the form of natural language. And NLP is about that. It's about computers being able to process that data, computers being able to understand that. And if you can effectively do that, the number of applications are actually limitless. I've listed down here some of the applications. Now, search, when you think about it, is all about NLP. An ideal search engine, what it will do is it will try to understand the query of the user and then map it to the relevant document. Customer support. Today, thousands of people send emails to companies and a team of hundreds of people go there and understand the email and try to solve the problem. Even that has a lot of scope of efficiency. Question answers, like I said, answers to most of our questions are out there on internet. Sentiment analysis has also become a big deal nowadays because companies want to understand our users liking my feature. If a customer is dropping off, why is he dropping off? And people are tweeting about it. They're blogging about it. They're literally shouting on the internet and it is being captured across various databases, but we need good enough techniques to understand sentiment analysis. And social data has so much of knowledge, like if, suppose for some reason Flipkart has a technical issue, people on Twitter will tweet it and we should be able to identify it from that data. We should be able to know about cases like net neutrality that people are not happy with this by automatically analyzing this data. I can go on and on about such examples of NLP, but you can use it in personalization, information extraction and so on. So NLP is basically the science of deriving meaning from natural language. And it has initially when computer science started, people thought, well, NLP wouldn't be that big a deal. I'll pass this sentence into some sort of a semantic data structure and I'll be able to run queries on it to understand language. But soon they realized that it wasn't as simple as that. People moved on to rule-based system, so they ended up writing a rule for each sort of thing and they also realized there that you cannot pre-think every possible rule. And what happens is when you put too many rules, these rules start to conflict with each other. So some good systems are there, still ELISA, chatbot and all, which are based on that. But the biggest breakthrough in machine learning came in the late 1980s when people started to apply machine learning-based statistical methods for this. Now you go to your Google inbox and all your spam is automatically cleared out. That's the power of statistical machine learning at work. You go to Google News and all the news talking about the same thing is clustered together. Even Google Translate till a while back worked on the principles of statistical machine learning. What it basically does is you give the computer the algorithm a bunch of positive and negative examples and it will find statistical patterns in that. So let me take a few examples for people who are new to NLP to appreciate why it is hard. So sentiment analysis is a problem that I've been working for a long while now. So I'll give some examples over there. This is an easy statement. Flipkart is a good website. You can just count the number of positive words and say that this is a positive statement. This one is slightly harder because receive, I didn't receive the product at some time. Now receive is not a positive word in any centi-word net or anywhere else and didn't receive is actually negating that. But a good statistical classifier will still catch this because it will see the biogram didn't receive occurred many times in the negative tweets and classify this statement as a negative statement. There is a problem of rare words. Suppose in your training data you never had the word shoddy. So the classifier wouldn't know that shoddy is actually a negative word and it will say this is probably a neutral statement. There are all sorts of things in social media. There's SMS lingo, there's misspellings and so on. So if it is not in the training data, it would be a problem for the classifier. And then of course there's the hardest problem for all is sarcasm. Well played flipkart. So statements like this even humans who do not know what IRCTC is would have a problem classifying it as a negative statement. So the bottom line is that NLP is not that easy and the kind of systems that I talked about in my first slide are not really there. The reason is accuracy is hard. We understand that NLP is hard but the end user doesn't and if you make mistakes over there, end user feels there are bugs on your website and so on. None of the e-commerce website has a good review summarization system and all. So how do we go about increasing the accuracy of these methods? This problem let me to try this exciting new technique that has been coming up a lot nowadays deep learning. A bit of overview about deep learning, it has shown some amazing results and it has dominated pattern recognition in the past few years. If you see ImageNet it is a database with around 5,000 classes and deep learning has an error rate of around 6% which is next to only humans which have around 5.6%. Facial recognition also these have near human accuracy. Those of you who use Google phones would have noticed that speech recognition accuracy has tremendously increased in past few years. Even deep neural networks are in work over there. I have some examples of the hard ImageNet problem. Now images, humans find it very easy to classify but if you look at them at raw pixels it is actually a very hard problem and they are doing it perfectly for so many classes of images. Even when they make errors, it is sensible errors. Even humans might classify the first example as snake or the second example as dog. So I come to this question of deep learning for NLP. NLP has its own specific problems but when I talk about applying deep learning, people sometimes say it is a hype, some people say it is overly complicated and so on but I think numbers speak best for themselves. What we observed on our internal data set was an accuracy increase from 85% to 96% in two class sentiment classification. I did not believe these results when I first tried them. I thought there might be a bug in my code or I might not have split the data sets probably and so on but when I actually looked at it, it was this significant. There was a 73% error reduction which is actually a lot. Generally, we see incremental gains like 2% increase, 3% increase but this was actually significant. The most amazing thing about this was that there was no sentiment-specific feature. I did not handle negation specifically. I did not give it sentiment-specific lexicon or anything like that and still it learned all of that. Obviously I applied it to many other tasks. I applied it for topic identification and I applied it on various data sets like tweets, reviews and emails. So it worked very well in all these cases. Most of my talk will be about text classification but in the end, I will give some hints on how to solve other problems using neural networks as well. Many of the problems that we receive right now can actually be reduced to tweet classification. Let's say you have to prioritize email, you have to do sentiment analysis. A lot of NLP stuff can be solved by training a simple classifier. So the burning question is why does deep learning outperform statistical models? Let's say you have a data like this. You have a bunch of positive and negative examples. So lay down like this. What a statistical classifier essentially does is that it learns a decision boundary here. Now, this could be anything. Let's say you are classifying users as male and female. So the way statistical classifier will work is when a new data point comes, you can classify it as male or female based on the side of decision boundary that it lies in. The problem is that most of the raw data is like this. You have images, you have data and you have sound waves and so on. So how do you actually get from here to here? The answer to that is features. I'm sorry, the slide has some issues. So features actually transform input data into a space where a classifier can learn a decision boundary. Now, most of you who would have worked in machine learning would know the pain of feature engineering. So we spend, first of all, spend most of our time collecting the data, cleaning it, and then we spend most of our time feature engineering. After that, it's very simple. You just learn a classifier. And let me give an example for features. So let's say you have some data, you want to classify gender as male and female. So you can plot the browse counts in male store and female store. So that will be a discriminative feature for this data, and you will be able to learn a decision boundary on that. So features have to be manually identified. But deep neural, the power of deep neural networks is that they identify these features automatically. And sometimes they'll, like in case of images, they'll obviously do a better job at it than humans because they're looking at the data. So what is a neural network? This is an example of a simple three-layered neural network. You give a pass on the data to the input layer, and whatever, if you're learning a classifier, output layer will actually be the labels that you've given to your data. And hidden layer has the representation, hidden representation of data. I don't have enough time to explain this fully, but what essentially happens is that there is these lines that you see between layers are weights, and each of the neuron is doing what a simple classifier does. And when you actually train your network, what you do is you pass it the input data, you set labels around the output, and there's an algorithm called back propagation, which learns all these weights inside it. The power of deep neural networks is that higher layers from higher levels of abstraction. Now image has always been a good example to explain this. So let's say you're training a classifier which can recognize objects and animals and so on. So what the first layer will learn is essentially very basic things about the images, basic things like edges, and what the second layer will learn is probably more complicated shapes like circle or simple geometric shapes and so on. The next layer after that will learn progressively higher features like maybe furs or feathers, and the one beyond that will learn maybe complicated structures like eyes and ears. So this is actually how even the human visual cortex works on images. We have layers called V1, V2, V4 and IT, each learning higher and higher level features. So this is actually why deep neural networks work so well. And images, you cannot do that good of feature engineering because you'll need new feature for every new class. You'll have different features for images, for cats, you'll have different features for something like an ambulance and so on. So that's why in image recognition now deep learning has replaced all the techniques. People directly start with these. Another reason why deep learning works so well is this concept of unsupervised pre-training. Now a lot of data that we have, labeled data is only a small part of that. Let's say I'm doing sentiment analysis, I'll have say what, 2,000, 5,000, 10,000 labeled examples. But there is such a huge chunk of data in the form of Wikipedia on web. And deep neural networks have a way that they can learn and generalize from this unsupervised data. They can catch patterns, grammatical patterns, meanings of words from that data. So here is a graph, the blue line on the y-axis you see classification error as the network trains on an x-axis. So when the blue network is the one in which, which wasn't unsupervisedly trained and black network, black line is from the network which was, had unsupervised training. You can see that it generalizes, that it generalizes very well. This thing is not there in normal statistical classifies because you just ignore the unlabeled data. You only use the labeled data in most cases. So we saw why deep learning works. I gave you basic two reasons that higher levels from higher forms of extractions and unsupervised pre-training. But there's specific problems when you apply deep learning to natural language. Specifically that first of all, natural language can be of any length. I mean you can have a word with five sentence with five words or 50 words. And how do you pass that data to a neural network because it will always have fixed number of neurons in the input layer. Another problem is data sparsity and so on. But let us first go back to statistical models and understand how they work in case of Lp. Let's say you have a tweet like Flipkart is better than Amazon. You first would convert it into a bag of words representation which will have unigrams and diagrams of these words. And then you'll apply something called a one-hot encoding to encode this data. What this one-hot encoding is doing is suppose Flipkart is present in the sentence so the index of Flipkart will become one, rest all will remain zeros. So I have an example vector over there. Now this is the data that you actually pass. This is your feature and you learn decision boundary on a data like this. So obviously there are a lot of, this is a very simple approach because there are so many problems. First of all word ordering information is lost. So Flipkart is better than Amazon and Amazon is better than Flipkart have completely different meanings but they'll have the same representation if you're using unigrams. Data sparsity is another problem. Now you have so many zeros and so few ones. You'll have around 20,000 zeros and few ones. How will you learn a decision boundary in a thing like that? Another thing is words are atomic symbols. So now cat and dog are same distance apart as cat and ambulance. Although cat and dog have so many features in common. Both cat and dog are animals, both cat and dog are pets. So they are semantically similar but this guy will treat them as just an index in a vector. So it becomes very hard to find high level features. To solve these problems people use other techniques. They use features other than bag of words like senty wordnet and brown clusters, POS tags and so on and they give some incremental performances. So we come to this problem that how do we encode a meaning of a word? Word is more than just an index in the vector. How do we represent that? One idea can be represented using synonyms in wordnet but obviously if you treat these words adept, expert, good, practiced, skillful as the same, they are very different. You cannot say that they are exactly the same and there is no way to get a distance between them or something like that in senty wordnet. So the biggest breakthrough in applying neural networks for NLP was word embeddings. So here is an example vector for cat and dog. So because they are semantically a little bit similar, the cosine distance between the vectors of cat and dog here will be less and in one or encoding it's the same for every word. So it doesn't matter. So this word vector somehow represent the meaning of the word. I'll explain that. I'll make substantiate my statement. So how this word embeddings are learned is, Bengiu in 2001 was training a language model using neural networks. Language model is, you can think of it as something like auto-suggest on your keyboard. So you type a sentence, the cat is a, and it predicts the next word. That is a basic language model. So they were training that using neural networks and they couldn't pass words directly so they created some random embeddings. Those embeddings changed when back propagating and that was actually the first example of word embeddings. When they, when they opened those hidden layers and saw those word embeddings they discovered that they had beautiful, they had really beautiful properties. So this is an example of closest words found in the word embedding space. So if you see the word France, it's closest to European countries. If you see Jesus it's closest to God and other such words and so on. So it has learned that from the data that these words are actually semantically similar in their interchangeable. They capture relationships. So if you subtract the vector of the man minus man it is same as aunt minus uncle or queen minus king. So it has the information of gender somehow and they have captured so many different relationships. France to Paris that's a country to capital relationship. Microsoft Windows that's an operating system and the amazing thing about this is that we never trained the network to learn this. It automatically in the process of learning a better language model. What the network was doing was simply predicting the next word but to predict the next word effectively it had to learn all these things effectively about human language. It had to learn the meaning, the relationship, gender and such semantic attributes. This is another example superlatives. They have a similar sort of angle in that space. If you subtract king minus man plus woman will actually give you a vector close to queen. So you can even do some analogy questions. The bottom line here is that first of all word embeddings are trained in a completely unsupervised way. You don't need labeled data. The data sparsity problem is essentially solved because now you're not dealing with huge number of zeros and ones. You have a lot of floating values. They've done some sort of semantic hashing. They're representing the information about meaning of the words. Semantic hashing is basically in that space. Semantically similar words are closer. And they're freely available for out-of-the-box usage. You do not have to have a GPU cluster to begin to start use them because Google has actually released word embeddings trained on Google News Group and Wikipedia and so on. You can get the word for cat, for dog, the word vector for that. Now word embeddings are good but they're not directly useful are they? I mean we're not interested in meaning of a word. We're interested in the meaning of a sentence. So how do we represent that? Now this is not a solved problem till now. There have been a number of approaches and we're still researchers going on in this. A simple approach can be word vector averaging. So you just average out the word vectors of each words. There's a huge loss of information over there. It's not a good approach. But even that gives performance sometimes equal to statistical classifiers. There are better approaches like weighted word averaging but even they don't work that well. So here is an image by social who has done a lot of work in how to compose word embeddings. So usage of language has some sort of a recursive structure. So what you do is you take word embeddings for each thing and recursive neural network is by the way different from recurrent neural networks and compose them. I won't be talking much about this because most of the work that I have done is on convolutional neural networks. Convolutional neural networks are excellent feature extractor in images because features are detected regardless of position in an image. Colibar back in 2011 had the first idea of applying these for text. Let's see how CNN basically works. This is a GIF explaining that. So a neural network will basically scan through the image and what this gives is temporal insensitivity because now if a cat is present in the top right corner or the bottom right corner it will still be detected because you are basically convolving and this yellow weights that you see are like a simple neural network. And this is the model that I use CNN for text. This is the overall architecture of the network and so I have laid down the word embeddings over here. These are the word embeddings. This is the convolution layer which is doing what the GIF just showed. Convolving over you can think of them as 1D images so it will convolve over them and this is like a normal deep neural network. I'll explain this thing in detail. So just to say so you will pass the word embeddings over here, the label over here and all these are weights. All these are weights which will be learned through BRAG propagation. So let's say I have a sentence called cat sat on a mat. I lay down the word vectors for each of these. Now I have to do some sort of composition. So let's say I have a window size of 3. So I'll convolve over this word with this window. So convolution operation that we are using right now is simple concatenation. So I'll concatenate these vectors. So the cat vector sat on is all concatenated into a single vector and then we're doing it a multiplication with weight matrix. These weight matrix are the weights of the convolution layer. So even these are learned when the network is actually training. So this multiplication gives us again 3 by 9 into 9 x 1 will give us a 1 by 3 vector. So this is how you can, this will be the same for even if you increase the window size. So you will pass over this window cat sat on, sat on a mat and then you perform something called max pooling over it. So the first vector over here represents the trigram cat sat on. The second vector represents sat on a and the third represents on a mat. On this you perform max pooling. And by the way this is a toy example because in real life you'll have word emitting that 200 dimension or 300 dimension. This is just for simplicity sake. Max pooling what it will do is it will figure out the, it will just take the maximum of each vector. So here it's 0.46, here it's 0.81 and 0.40 and you, this can be considered as the most relevant vector of the sentence. And after this you perform, you pass it to a multi layer perceptron to, which is a neural network to finally get the output which is neutral, which will be neutral in this case. This is how the basic convolution neural network works. Now actually I and many other people lose the intuition at this max pooling layer. Now what is happening? You understand that this is representing that and this but what happened over here? So we created a debug console to actually understand that. Let's say I was intense. I have to say website experience was okay but you guys did a really shoddy job at delivering the products on time. Now this is a large sentence but your sentiment will essentially depend on the words over here, really shoddy job. If you just quickly look over it you'll catch these words and you'll say well this is a negative statement. This is exactly what the neural network is doing. So the number, this feature count here is the number of features that came from this trigram. By the way we are using 3, 4 and 5 grams in this thing. We're not using 1 and 2 grams. So really shoddy job did really shoddy job at. These are the things that it learned most of the things from. It even learned experience was okay because even this gives sentiment but these things overwhelmed it. Actually each of these we are actually working on it and these will become green, red because it doesn't show right now what it is saying about it. So what I showed in the animation was a simple thing but this is how it works in real life. You have window sizes of 3, 4 and 5. I just showed the window size of 3. I'm sorry about the presentation. The static mode, there's something called static mode in which we do not change the word vectors. In non-static mode what happens is back propagation runs on the word vectors as well. So word vectors change their value as you after you train them. There's a multi-channel mode in which you use both static and non-static mode and we even adopted the model to do multi-class classification. These are some of the results that we obtained. So first three are on Flipkart's internal data sets showing how the accuracy increased compared to normal statistical models and the rest two are from external data sets that we just tried and you can see we have applied it to a lot of problems. Sentiment analysis and topic identification and we've applied it on variety of sources, Twitter, emails, movie reviews and it works better than statistical classifiers, sometimes significantly better than statistical classifiers. Here if you see this is a statistical classifier which has used so many features. It had used lexicons and this is a paper by NRC Canada and we outperformed it just by using a simple neural network which knew nothing about restaurant reviews. No feature engineering nothing like that and it still outperformed a heavily feature engineered algorithm. SS2 is standard sentiment data set around on the net. Statistical classifier gives 79.4. Although some deep neural networks give 88% there. 87.5, it's nearly that. And these are other data sets on Flipkart where we obviously saw performance increases. Here is some examples that you might like. So this was the earliest sentiment classifier based on naive base and this is the convolutional neural enterprise. So in the first example, a naive base classifier overfit on the words the best and it ignored the word sold out or out of stock and gave this as a positive example. Here, max pooling what survived was sold out and out of stock. In the second example is the rear word example. So now shoddy would be caught because the word embedding for shoddy would be closer to some other negative vectors. So it will learn that shoddy is a negative word even without explicitly being trained on the word shoddy. Overfitted last year, so we had a data set of big billion day and a lot of people tweeted negatively on big billion day. So the naive base classifier actually learned billion as a negative word and CNN is not prone to such problems. What survived was good sign. So drawbacks and learning. So the main problem of neural networks right now is that it's computationally expensive and there are no ways around it as such you have to use GPUs as most companies are using. We could train on our data sets on CPUs but for larger data sets for experimentation we use GPUs. That's how you scale training and there are other problems like multi CPU parallelization is not their ingredient descent in Tiano. So all these problems make it incredibly hard right now. It's in a very nascent stage. There are no good libraries for this right now but what are used are Tiano, Pylone 2 and Torch. Torch is increasingly gaining popularity but my current work was in Tiano. How to scale prediction? Prediction is slightly easier to scale because it's not that computation heavy and we're actually just using a simple Tiano server for that. This is a cartoon that I said that if you did not understand anything in the presentation but you're interested about it you can go and read about it. I did not have time to cover it fully. Another thing is I have completely open source my code. So this is a generic library you can go. You can put your own data sets. You can try out. You'll have a state of the art sentiment classifier if you have good data of data. The link is in my slides as well. You can reproduce all these results. It's heavily derived from all other open source works as well. In the GitHub page I say that what exactly are the improvements I've made on the other codes but it's basically easy to use utility. Just give your sets and it will work. So I said I talked a lot about text classification but since my talk is deep learning about NLP there are a lot of problems beyond text classification also. So unsupervised learning and so to solve the composition problem there are basically three approaches right now. convolutional neural network is one recursive neural network is other and the third one is recurrent neural network or LSTM type of cells. So how they work is something like this machine translation and chat and classification these networks are performing well. So you pass it one by one in a recurrent nature. So this is actually the same neuron and this edge is going over here but we folded it out on the time. So these models are also giving some very good performances in sequence to sequence sort of problem like machine translation where you don't have a single sentence but you have to translate it into another language and suppose you want to learn a chat bot or classification also these models work well. So this is also an exciting area to try on but again libraries and good are not there. So yeah that's it. Any questions? Mono. Hello. Yeah. Yeah. Did you ever come across this semantic training where we train based on grammars and find the semantic relation between words and you compare it with neural network training. Okay you're talking about the linguistic approach where you first so where you first passed the sentence grammatically according to POS tags. There are also libraries like Sempre which Stanford has come up. Stanford NLP. Yeah. So I mean these gave incremental performance like when I said features other than other than POW so you pass these things as features. So actually one of my colleague tried it and what 85 to it the accuracy increased around 3-4% in that but these gave much more performance increases and the biggest problem is in Indian data you won't find grammatical structure. People don't write it that well so these perform don't perform well specifically on Indian data. Those grammatical approaches. So it's a representation I mean it's it's hard to say I mean if you were to represent the word cat can you think of a vector now cat has many different properties right. So it has actually learned them while training right. So it it represents many of the things but you cannot actually give it a label that this vector is this this vector is this because it has automatically learned that. But you can run experiments like the cat and King example to see that it has now gender is not a property single property because you're subtracting the vectors. So it has represented gender in that whole vector space somewhere but we can run experiments to figure these out but it's not very obvious that which is there. You'll have to run experiments to get that. Yeah. Hello. Yeah. What to make. Yeah. We even try training on a flip card data set our own word embeddings but they didn't work that well. They directly the what what Google has trained I just reuse them. But you can train on your own data set also but this performs better because it's a huge data. Yeah. There is a recent paper by Coakley and others which have some data is based on RNNs. So they have some very exciting results. Hey, good morning. So so the interhuman disagreement rate on sentiment analysis is 16 percent. It's here. I'm here. OK. Yeah. Yeah. The interhuman disagreement rate on sentiment analysis is 16 percent. Right. So how did you achieve 96 percent accuracy? I think that's a little bit no very clear. So actually it's not that how do you have references for that 16 percent interhuman? I mean that would be specific to a data set. You cannot say in general on some data set there might be 16 percent. Even here we observe things like that but we basically removed all the ambiguous example from training. In testing also we had a redundancy. So to measure ambiguity what most people are thinking was taking up to properly label the data. So even the examples in the rest 4 percent are actually ambiguous examples where some people might think of them as positive some people might think of them as negative. And 96 percent is two class positive negative. In the three class we have around 87 percent. It's not 96. Positive negative neutral class. In general you need a large data set to for this to work right. So what is the kind of guideline on that? How much data is required for you to use deep learning? Actually you do not I initially thought that you require a lot but it's not the case. So because word embeddings have actually been trained on such huge data what we had in our sentiment data set was just 1500 examples of positive negative and neutral. It wasn't a huge data set. Word embedding the one Google provides are generic. I trained them on Flipkart data but maybe because it wasn't so huge those word embeddings did not perform that well as the one on Flipkart on Google Word 2 performed. Because it catches generic patterns across things. Word embedding require huge data but using word embedding you can work with small data. Yeah, there are other implementations of Word 2 also. Hey Shankar on second. This is Vignesh here. So my question is with respect to the corpus. So for example if you consider Sachin Tendulkar he is regarded as the master blaster as well as the god. Like when you consider that when we pass each token or rather the annotators of a sentence how do you what corpus did you use that would say Sachin Tendulkar the master blaster would be the same. Like if that is the question like how do you cater such words because they are proper nouns. So that is like again a very linguistic sort of approach but so word embeddings work on the basis of co-occurrence. So when you say Sachin Tendulkar is a cricketer as well as a god they will be used in similar contexts. Sometimes people will write Sachin Tendulkar is god sometimes people will write Sachin Tendulkar is master blaster. So while training word embeddings since it is based on co-occurrence of words it will catch that master blaster and god in context of Sachin are similar words. That's how word embeddings will catch that thing. So the cosine value between those two words would be this. It would be placed together. Okay. Thank you. Okay. So is this a target dependent sentiment analysis and if it is target I cannot catch it. Can you repeat it? Yeah. So the sentiment the results that you presented right first and second row. Yeah. So is it target dependent? Yeah. And that one is on Flipkart Twitter data set so it will catch words like delivery delay and so on. And we train SSD2 is movie reviews data set. So that is like a generic sentiment classifier. Okay. And how do you figure out that the target of the sentiment is indeed split by there could be multiple entities in the same tweet or same sentence. Right. So how do you know that this target is actually not something else? That's a good question. So we are thinking of solving that by. Yeah, actually that will require a special formulation of the problem right now. The network catches is a general a general sentiment. So even if it is positive for Flipkart or Amazon it will give it as positive even from Flipkart context it might be negative. But that's an interesting research area as well. How do you solve that? Yeah. How do you how do you get embeddings for the words which are not in the corpus? Which is not in so which is not in the training corpus or which is not in the I mean let's put everything together. So if it is not there in word embeddings we'll randomly initialize a vector. So even though the word vector is not there there will be a random vector but all these weights will be learned corresponding to that word. Now there might be many words whose word vectors are not there. But still it will learn but it's better if you you can even start without word vectors. You can give a random vector to anything and will still give 85% accuracy. But word embeddings just increase the performance. Yeah. I have two questions in in large sentences where there are multiple entities. For example I don't know Flipkart has a very poor returns process but Amazon has a very poor catalog depth. How good has your experience been? And the second question is what about sarcasm? So like I read a tweet and no political affiliations here which said PM Modi will make a great business development manager for the Adhanis. Yeah. Would your experiments capture the sentiment of something like that? Yeah. So the first question about that I said we haven't solved that problem yet. We can think of approaches like there are again linguistic approaches but negates the things you divided into two parts and do separate sentiment analysis of that. That is one way to solve those things. And sarcasm still these networks won't catch until you explicitly give them a lot of sarcasm examples. And sarcasm also kind of has patterns. It will be highly positive and there are sometimes weird things like IRCTC might come generally in a negative in that example in generally in a negative context but you coupled it up with a positive thing. So that's how sarcasm might be caught. OK. Yeah. I had two questions. Question number one is when you decide the sentiments do you also scale them in a rank between one to five one being the least and five being the best or it's only positive or negative? It's only positive or negative. Right now the three classes positive negative neutral. So the accuracy will decrease on fine-grained sentiment. Now humans humans also like they said have an ambiguity in positive and negative and when you increase the number of classes that ambiguity becomes even further. Someone might say this is very positive someone might say positive. So you'll see accuracy decreasing on that. OK. Question number two was when you extract features from a given document from a given comment. How do you take care of phrases to two words which actually mean one? Phrases right now should be handled by the shingling that the CNN is doing. Right now we do not come up. Actually so if there are things like phrases we pre-pre- during preprocessing stage we sometimes so an example of phrase would be well done if you say well done. So well underscore done but that is not being done in this network. OK. Thank you. That will be the last question. Hello. Yeah. So I have two questions. First how big was the training corpus and how was the accuracy calculated and second is how big was it how big was the cluster on which you train and how much time was it taking to train on that cluster. OK. So we have formed experiments on a lot of data sets I'm assuming you're on the you're asking about the Flipkart sentiment data set. So that was 1500 training 500 tested that is all we could label with all the redundancy. It was small and for three class classification and on this small corpus 2000 tweets we had trained it in two hours for a single configuration and what typically you do is you try a bunch of configurations. So that will take even longer to get the optimum configuration. And there was a one more question. OK. Yeah. Yeah. We use so I used a personal GPU laptop for that and GPU we got 30 X performance increase.