 Which was supposed to start at 1.40 but since it is clashing with too many talks We thought it'll be better to move it to 3.25 So the birds of feather for a product and AI will happen in first floor at 3.25 p.m. Thank you. The general notices apply as well. Just please don't bring in refreshments cups anything into the hole Make sure the Wi-Fi hotspots are all off and please put your phones on silent So that's all the announcements for now over to Anuj Gupta So I am Anuj. I'll be today talking about sarcasm detection In context with sentiment when it comes to sentiment. It's an important subset of it and How can one build a system that can go about understanding what is sarcasm? The agenda for the day I'll be I'll I'll I'll be introducing the problem I'll be introducing the motivation why it makes sense to build such a system in a in in organization Why is it a? difficult problem Then we will look at certain approaches Result results and the drawbacks of the current literature So what is sarcasm? It is essentially a way of communicating Something which is caustic which is is a weak Way to to communicate a negative connotation to it So if you look at this Warren Buffett talking about livable wage, it is a it is a sarcasm because Warren Buffett is is almost a symbol of capitalism and he he he talking about a Livable wage is an example of it To add to another example is Been I have been given a slot right right after lunch to have a technical talk And this is what every every every speak speaker wants so to to define sarcasm come from from the Greek word Sarkasin which basically means to to to eat into something to tear into something It's a nuanced form of language Where it is it is a deliberate attempt by the speaker to state opposite of what is actually implied So this looks like a great fun is a sarcastic way of communicating that I'm not liking What is happening right now? From a business point of view why is sarcasm an interesting problem, so Organizations continuously tap into social media trying to understand opinions of customers around around product services and and Giving real-time customer support and there are a lot of CRM's which are available in the market to To support organizations in this endeavor and you take any CRM That is available a key offering that you will come across is sentiment analysis now Most sentiment analysis systems are extremely bad at understanding sarcasm. So Here I have a sentence. I love the pain of breakup and What I'm showing you is the if the demo from Stan Stanford and Alan Which are almost like the two two of the best that are available out out here and both of them believe that this is a positive statement and both are getting confused by the word I love the moment they say love They believe that it's a positive way of saying things now Because most sentiment analysis systems go down when it comes to dealing with sarcasm and We are talking about Twitter as the domain This often leads to scenarios where the teams are not able to understand and then respond to it You know in appropriate manner, which often leads to PR disasters and This is something no organ no organization would love from a research perspective If you look at the various problems that are there in NLP Some some problems are easy some some problems are hard it is widely believed that things like question answer summarization question machine trans translation belong to the category of hard hard problems and sarcasm detection is one more addition to this and Any progress that we as a community make in order to to build something that can deal with this would actually be a progress also for the NLP in general and It's only recently that people have actually Solved sentiment to a very good degree So now people have started to look at more nuanced aspects of sentiment like us like aspect based sentiment analysis and sarcasm detection So be it be from a business side or be it be from a research side It is a problem that is worth spending time and energy into and This is what we'll be looking at So what are some of the key characteristics of sarcasm? so It's a sharp bitter cutting expression. It is a surrogate in nature and By definition the sentiment of sarcasm is always negative You would have never heard anything saying that I I gave a compliment Sarcastically, I mean it is never a positive connotation What makes it difficult? It is a deliberate play on the part of the part of the person so the person plays with the language and the nuances of the language to To be able to communicate something so sarcastically So it is subtle in nature It is not that you will have very clear indicators in texting. Hey, this is sarcasm It will be just a play of language where maybe a word is here or there or a Punctuation or a phrase Even as humans To understand sarcasm is not easy as a matter of fact, there are studies which Compare human cognition with sarcasm understanding to believe that people who are good at understanding and creating sarcasm are cognitively better than others When it comes to Twitter There are additional challenges. We all under friend. It is a short form of language at most 280 characters The spelling mistakes the slangs there are acronyms ever-evolving vocabulary so Sarcasm when it comes to Twitter as a source it actually ups only from there So what is the best base? What is the business problem treatment that you typically look at? Build a sentiment analysis system, which is capable of handling Circassum now in this diagram this this is a typical way that you handle you have a module up front which understand whether the in the inbound text is sarcasm or not if it is we know by the definition It's a negative sentiment if it is not then we then we go ahead and do sentiment analysis on it so for the rest of this problem or For the rest of the star. I'm sorry. I will focus only on So I'll only focus on the part of given an inbound text How how do you detect that it is sarcasm or not? The scope is an important problem He is an important aspect of when dealing with this. So if you look at this sentence This is very difficult to decipher that this is sarcasm You need to understand that she refers to to Hillary Clinton You need to understand that there are facts hidden in this statement That she is running for the presidential candidate. If if she wins, she'll be able to enter Oval office There is a history to it and that history does not bring positive Memories to her This requires lot of Complex work and it is beyond the scope of this talk to be able to a to address Sentences which are of this this complexity so we will deal only with those sentences where all the information that is needed to Detect whether it is sarcasm or not lies within the sentence and it is for this reason We chose Twitter as the source of data. How did we go about building the data set? It is primarily based on the source. So we looked at certain hashtags that are used extensively We looked at certain handles that are known to tweet out lot of Text which is sarcastic in nature So collecting the positive class what belongs to sarcasm was easy What doesn't what is not sarcasm is the harder part that proved to be much harder because everything else is not sarcastic and That domain from which is not everything is actually very very broad So this proved to be much difficult task So what we did was to we also took certain public data sets Filtered out data from there to be added to the class of what is not sarcasm over and above what whatever data that we collected post cleaning we got close to 100k data point 50k per class and We built a test test set which was close to size 20k, but from a different time timeline We actually wanted to understand that how how the vocabulary and how the underlying Distribution plays on Twitter. How does that affect the whole thing? So the training and the test data were collected from a different time timelines What is available in literature on this? This problem has been discussed as a matter of fact Professor pushback spent a good time in the morning talking about sarcasm detection, but So a lot of the work that is there in the literature actually Talks about hand coded features. So it talks about things like Unigram by grams emoticons capitalization positive sentiment in proximity of a negative situation It talks about features based on seek So people have argued that a sarcastic sentence would have certain words that will come on come very often and some words Which will come very rarely so so people use things like count of the gap between very frequent and less frequent In another important aspect is what is called in con con con gritty in con gritty in English means Something which is in stock contracts stock contrast of its background so so a person with the plum face and Very thin body is an example of what is in Congress The so people look at features like number of words which have one sentiment Immediately followed by another word which has to come completely the opposite sentiment While these are great features they work beautifully From a product point of view you want to build a more generic system Something that can sustain a more wider Inbound traffic. I think I need to be closer to this Yeah Okay, I need to be on this side. Sure. Thank you so So as I said most of the feature features that I spoke are spoken about Tend to be specific to the data set that you are working with we wanted to avoid this this thing and What we started to look at was can deep learning help in some way and the reason being that you do away with the Feature engineering. I mean we all understand that you you you give a data of Certain input space and your model is able to transform the feature space Whereby the separation is much much more easy. So for this we first set up a base baseline Being dealing with language. The first thing we looked at was certain language language models like like LSTM and and But we never had enough data to be able to train them well I mean at the end of the day, we only had 100k data points which whereby the RNNs were overfitting very very quickly and All we could do was close to 68 percent Instead because we wanted to reduce the number of parameters We went ahead and looked at CNNs for sentence classification And we actually got a far far better number from there Once we had the baseline in place The idea was how do we push from there? So if you look at literature People have strongly talked about three clues one is sentiment One is emotion and one is personality So what do I mean by sentiment so? Most Circustic sentences will show a shift in sentiment. So I love the plane I love the pain present in breakups, right? So it starts with something like a very strong positive Connotation I love something but it very quickly goes into a negative side and This contradiction between love and pain of break up. I mean this contradiction is a hallmark of sarcasm So then the natural Thing that will come to your mind is it makes sense to pick sentiment clues and use that as input to the sarcasm model So people have already done this kind of approaches This actually but they typically use things like sentiment lexicon, which will be essentially nothing but a But a dictionary of words and send and phrases Where for each of them you have the sentiment and then you would give certain numbers that will go in as features We instead Went about doing it slightly differently. We actually used a deep deep network CNN to extract out sentiment features And I'll talk like how we did that The second part is emotion So emotion is all about feelings happiness anger jealousy grief And one can have simultaneously many emotions Send sentiment on the other hand is a more objective outcome of emotions, right? It is either negative neutral or positive sarcastic sentences are rich in emotions My stellar programming life job offer control C control V resignation repeat I mean we all relate to it in some sense, but It has a lot of emotions to it. There is pain. There is sadness. There is anger. There is disgust the idea was can we use these clues and To help out in understanding what is sarcasm So similar to sentiment we trained another network which will try and understand Only the emotion part of it and give input from there the third pieces was personality The central idea is that People have done studies whereby they have been able to show that certain people are far better at understanding and expressing sarcasm as compared to another so who you are Will greatly influence the likelihood of you being sarcastic and Now how do you model this behavioral aspect of it? people have shown that if you Look at the history of the communication from the person Then you can actually pick this kind of a signal and the signal helps very strongly in understanding sarcasm Now how do you do it for a Twitter? I mean at runtime you have a sentence right and you have the user you From whom this tweet has come Now either you go at that point of time pull in the history of the of the user and then pick the behavioral clues or You have a database up front saying that this is the entire history of the of the user and then pick the clue from there Either of them will have huge implications on the throughput and storage and Most engineering team would hate to do anything like this because this would mean a lot of overhead Both at runtime as well as at compile time. So we never went down this thread I mean we knew up front that this is gonna be a be a engineering challenge So so we took a con conscious call of not going this way So we actually dealt only with sentiment and emotion So what is the largest solution that we had you have a text that comes in You have a model for sentiment. This is what I have shown in the green box You have a model for only the emotion which is the blue box and we have our baseline model Which is what we we trained on the Circassum data now the sentiment model was trained on a sentiment data set and not the sarcasm data set So we built another data set this this was primarily from public domains Which was all about sentiment dealing with negative positive and neutral We had another model which was trained on a data set for emotion So we talk about six emotions and we have the baseline model And what we do is we actually take the features from each of this so all but the last layer Whatever is the output of that that we take it as a feature and we combine the three So we did experiments with both taking the baseline and not So that is why it is in a dotted box And then that becomes the feature vector with which we have worked and this then goes into and another classifier Come to the details now So sentiment model so the larger picture is clear right So we have a sentiment model emotion model and a piece and a baseline So the sentiment model is what I'm going to describe now in detail So we built a CNN for this. It's a pretty simple CNN. It's a tool to layer architecture We have the data which we took Clempf to organize this was the words were mapped to embeddings I hope everybody is aware of what is an embedding embedding is basically a high high dimension vector Which pretty much represents the meaning of the word, right? And so this the word vectors were used to initialize the the embedding layer Okay, so we took three classes negative positive neutral We had a public data set and we added some custom data To be able to match to our needs The CNN that was trained was all about 1d convolutions How many of you know about CNNs for text classification Kind of 5050 so maybe I'll just spend a minute on that. So the idea is pretty simple So CNNs on image is basically a 2d convolution where you move across breadth and the height of it In this case you actually move only across the height of the Input what is the input you have a sentence You tokenize it you map each token to the embedding So here I have I like this movie very much and I have taken a five-dimensional embedding space So I have seven words and five. So this is a seven seven cross five any Any kernel that is going to operate on it will always have the width equal to the dimensionality of the embedding So a so here I have two filters, which are of size four into five Two which are three into five and two which are two into five You do the same convolution of operation as you do in images You take the vector from there you apply the max pooling. So in the last from each we are We have the maximum value of the of the vector We combine all the max and that becomes our feature vector. This is the central idea of what is CNNs for text and Then you you have that for your classification one versus zero So all we did was we took sentiment data built a CNN in this fashion and trained to predict negative positive on your daughter Likewise, we built the emotional model. So here it's about We will use all but the last layer of it to extract the features It's exactly the same flow for this work. We used glove embedding from Twitter. So Glove has has multiple Embeddings that are available. There is one which you've trained on close to two or three three three billion tweets and we actually use them directly to Initialize the embedding layer and then it's all about predicting for the six classes anger discuss our price Sadness, joy and fear. This is There are a bunch of public data sets that are available for this We augmented this deep data sets with some of our own data and then again very similar to how we built the sentiment model the network was pretty simple if it's not the kind of CNNs that you see in Computer vision. We never had the data at that scale So it's just two layers of Convolution max pulling then fully connected and then the soft soft max these are the various kernel sizes that we used typically three four five six and Then depending on so for the first one, which was the baseline trained on sentiment model So it was only two outputs the On the sarcasm model, sorry the the sentiment was three And the emotion was six. So accordingly is the soft soft max So out this work, we never fine-tuned the embedding layer. The embedding layer was frozen I mean this this was our first attempt at the problem So so so we tried to keep things as simple as possible and see where it takes us So Again just re-treating the key takes so the green box is a CNN model for sentiment Blue box is a CNN model for emotion There's a baseline and we take the features and from that the feature of vector was fed to a bunch of class classifiers In one we tried end-to-end where it was a CNN the other of logistic regression and the third was an S PM and the baseline gave a Okay, result but clearly when you augment this with sentiment and emotion features we actually saw substantial increase So the best that we got was close to 88 Which was fairly decent now from the part of Future work you can actually go ahead and try a lot of ideas Train your own embedding because it's Twitter the better you are likely to do better But you have to deal with the problem of out of vocabulary words very often So it's better to go towards and to towards n gram character embedding Try and collect more data that actually proved to be one of the more challenging problems in this project and then try and get back to RNNs and see how well they work I mean in principle, we never had the enough data to be able to say with very surety that this is gonna work or this is RNNs are not not gonna work Electing negative data is still a very tricky part of it because it's a huge Domain and you actually would like to sample your data with the pretty uniform consistency across this but that is a difficult thing What we saw was Adding public public datasets of sentiment So there are a lot of datasets which are tweets and these are labeled as negative positive and you're pure neutral When we added those to this the performance increased dramatically Because it's not just about sentiment clues, right? It is also about how is that related to sarcasm? So we needed a lot of data Which is strong in sentiment, but it is not sarcasm, right? Otherwise the model was doing pretty bad How do you bring in user behavior? I think that's more of an experiment that we would only do sometime much later in future Because as I said it has engineering challenges which which we could Which has serious implications from a product point of view. So actually we never went that way What are the key learnings from this? So one it's an important problem from a product point of view as as we are we mature from a data science part of Perspective having just a sentiment module is not gonna be enough Especially if you're dealing with what public domain is like Twitter Facebook and other at some point of time you will have to go and address this Lot of data that is gonna come in you are likely to do very bad because there are long-term dependencies There are subtle variations There are references to historic context As of today, we do not have the techniques which can actually address them in a Reasonable fashion as well. I mean forget about being great Most sentiment analysis systems are extremely bad at Understanding sarcasm. It's a very simple experiment go build a data set on sarcasm And then you have to try and run through a couple of sentiment analysis tools and you'll see what I'm talking What we worked with was it's it's a deep learning So we tried to automate it as much as possible in terms of feature engineering But still there is a domain understanding of that any sarcasm will have sentiment and emotion component through it And that actually helped us a lot Most important part was Both the features that we were taking were from pre-trained convolution So there is there are I mean computer vision is is full of such examples, but there are very limited examples of how Pre-pre-pre-tained CNNs can be used in language understanding here We actually found that this that actually helped as a big time Yeah, how you collect your data is a very very important aspect of This particular one because what is not sarcasm? You need to have a very calm comprehensive set With that I'll end my talk happy to take up any questions So hi, this is Deepak So you said about convolution neural networks for text Just wanted to know you said you only have a Window which is vertical in nature right like a filter So why only vertical and why not a filter like like in images? Okay, so in images if you look at what is the image the data itself is rich is Organized in space if you look at a embedding vector I mean if you are looking at say a 300 dimension then you can't say where is a more critical part of the world The vector as a whole represents the word and its meaning Right, so which is why the convolutions and if you look at language the central idea is the The units of language which are words or characters they compose to give meaning So it is better to look at vectors which are adjacent to each other in their full full view That is why you have what are called one deconvolution for text Okay So I just want to understand like all the corners are you have to be a little out. Sorry. Can you hear me now? Yeah, yeah, so just wanted to understand are all the kernels 2d or like after first convolution layer you will have So so everything is a one deconvolution here As I said the idea essentially is still that you the meaning comes from words that are in prox proximity of each other So which is why all everything is one day Hi Just coming back to the the comments you made on Negative data's data points So just wondering if you could elaborate a little more on how you went about collecting that data because as I understand it's important to have negative sentiment Tweets but not sarcasm So one as I said was a strategy where we took public data sets for which we already had sentiment and That that became a part of directly of what is not sarcasm So we actually went through the these sets to understand that none of them is a sarcasm and we put that The final data set actually came through a lot of iterations Whereby we would train something evaluate how it is doing look at what it was failing and then from there make Derivations of what are the kind of things it is getting confused with and Then try and pull data off more of that type and add to it But if you think carefully the eye the ideal way of solving this is not a binary classification We're saying sarcasm versus not sarcasm. It should actually be a one-class classification I know what is sarcasm and the model is only about about that So when something comes in it only tells whether it belongs to the class sarcasm or not Right, but one one class classification techniques as of today. No, they actually never took us anywhere I think we have one more So I have one question like did you drive POS tagging like did you append it the Append the input with POS tagging I'm sorry. I'm not able to hear. I'm saying did you I tried Try appending the input with part of speech tagging that could have So did we try putting part of speech tagging? We did try initially, but we did not Experiment it with thoroughly to be able to see its effects You're tagging also for small text like tweet with all kind of language Throw thrown in it itself was not working. Well, and then using a signal from there You're not we actually did not see games from there another thing is you mentioned that In normal sarcasm detection the sentence has a There's a change of nature and sentiment like from positive or negative, right? so I think just an intuition like bi-directional RNN or like In the phenomenal paper the sequence to sequence translation. They tried inputting the sentence in reverse If you try here inputting Reverses and provided as a second input, right? But for those kind of architecture we would need much more data than what we had currently Lot of our time went into to building this data set itself Which is why we did not go that way when we experimented with RNNs They they will tend they were overfitting very very quickly And we understood that we'll have to spend a lot of time in data collection Rather than just being able to build a solution at least to begin with but yeah I completely agree that a bi-directional LSTM will should do at least in tea and tea intuitively Good good job at this. Thank you. Any more questions? I think I'll have to cut you there. Thank you so much Thank you so much, Anuj So next on stage