 Thank you Hello, and welcome to my talk about introduction to sentiment analysis. Thank you for already Removing four or five slides because it was a perfect the perfect definition of sentiment analysis. So we can skip a little bit. That's nice To the microphone and I should stay inside the circle. Thank you very much Yeah sentiment analysis Who has been here yesterday when there was a linguist talking about sentiment analysis I Think you're going to have a great combination with these two talks because I'm going to talk about the code So the developer few on sentiment analysis So what are basically the topics I'm talking about a little background what what kind of data I'm and we were analyzing What I have done before I jumped into this topic a Little bit about what is sentiment analysis so we can cut it short now Then we look how Then we look into extracting parts from a text like sentences separate words Then we look at the approach how we can Assign an opinion to these words using a lexicon based approach and that already gives some Results when you want to find out how somebody feels about something But there's also a little bit more about it when you have to combine certain words And we are also looking about that and if there's still time We'll elaborate on a few special topics like how to deal with slang terms, but we can easily skip that It's very simple It's only one slide per topic and you can easily look it up But I found it useful to have it there because it might save you time if you Still have to if you have to do a sentiment analysis on your own All right So let's jump into it some background about me I'm a software developer and basically everything that's related to it So I also do a little bit requirements engineer in software architecture testing If it has to be a little bit of project work And I worked in various areas in hospitals in banking and currently I'm working in e-commerce I have a master's degree in information processing science. I maintain a few open source projects So if you ever feel the need to read Epstick data from the mainframe. I did a package for that. Yeah so basically nobody wants that I guess and I'm also a co-organizer of a local Python user group in Graz the city where I come from and there's a home page where you can find out more so Banking hospital e-commerce why am I talking about sentiment analysis? well, I needed a topic for master's thesis and Friend of mine had just founded a startup and he said hey We have this kind of data feedbacks from in keepers or for in keepers from the guests and it's text and it's unstructured We want to analyze it. Hey, you could do a master thesis about it. And that did a quick glance Yeah, there's people who write about it. There's books. There's even pythons So we can have regular expressions if there's nothing else and there seems to be a couple of libraries like spaces and NLTK Yeah, good enough. Let's do it So that's how you jump into sentiment analysis. You don't need more The company is basically a startup they as I said already You have guests that goes into an in and they can give in feedback But different to public platforms like trip advisor the feedback only goes to the innkeeper So they have a mobile website where the wool is posted in the in and if they type of feedback The innkeeper can see it on an application It uses Django in the back background and the front end is in angular and if you want to find out more there's a home page about it and the feedback Basically the interesting part of the feedback is unstructured text feedback that answers questions like how did you enjoy the visit? What did you like about your last visit? What can we prove? What else do you want to tell us? So this is very Broad and you don't really know what they are going to answer about So here's a screenshot. That's how it looks when you enter such a feedback. It's an Austrian company. So this is German and well the food is not really German curry boat and knudel. That's that's Austrian terms anyway Technically its language independent and the innkeeper wants to find out if he gets lots of feedback What are really the pain points in my Business, what are the people complaining about most and also what are the people happy about the most? So these are my usp's I have to preserve them and you don't want to wait through all the feedbacks So he just wants to give a get a quick glance and then maybe manually drill down into specific deep feedbacks that complain about certain issues So sentiment analysis also has a couple of related applications if you have service ticket systems If you have want to preprocess custom email and automatically assign it to the appropriate Place if you want to extract information product refrews or as we heard yesterday social media So what is the sentiment analysis? Yeah, we already heard it. It collects opinions from text written in natural languages and stores it in a structured way. That's basically it Sometimes it's also called sentiment detection. Sometimes it's called opinion mining. There are slight differences, but Generally, that's that's the same thing So you can do it on three levels Essentially, there's a document level when you have a big customer review and you reduce it to three stars Then you know, okay, the customer is not very happy with it So there is this product and he's not very happy with it That's the sentiment, but it's really useful because you want to know what is he unhappy about you want to find out Is this yeah, well examples when when a phone Does not have enough power and and it quickly runs out of power Then you have to improve something about that and it's different to to a phone. That's just too slow. You have to Take other steps for that. So you want to know where to have to improve it And to get that you can start on a sentence level So you have a document with many sentences and then you split the sentences and you extract an opinion for each sentence For example, the schnitzel is too small for a hungry student What's a schnitzel? That's a schnitzel So if you ever come to Austria, you should taste something like that. It's not very healthy, but very tasty if it's done well And you can go even further and go down on an aspect level So you say, okay when I get a feedback like the schnitzel tastes very well, but it's too small Then he's talking about the schnitzel But the aspect of the taste which gets a rating good and then there is the aspect of size Which gets a rating small or too small or bad bad So the taste is good. The size is bad. It's the same schnitzel and What's an opinion? There is a deluxe definition If you're a developer you can already see the fields in the table here Basically, the schnitzel is too small for a hungry student. What does it tell us? There's a target entity the schnitzel There's an aspect of it the size because it's too small The sentiment about it is bad. The opinion holder was against called Hans Meyer and he gave the sentiment at the end of April There's a reason for the sentiment Why is the schnitzel bad because it's too small and there's also a qualifier. It's not generally too small It's just too small for a hungry student It might be big enough for an office worker who hardly burns any calories the whole day So it's just for students and that's from a book from a guy called Leo And he also has another one because the first one is really unwieldy and you really need that much information and it's also very hard to extract Basically an opinion has a topic. Okay. We talk about food. It doesn't matter if it's schnitzel or spaghetti. It's just the food The sentiment is bad. There's an opinion holder in the date and time and this is enough to figure out Where are the pain points? Where are my usps? What's what makes my business special and where do I need to improve? So basic workflow, how do you get how do you make a sentiment analysis? And this if you want to move more bodies check the Presentation from yesterday you collect the data you preprocess the data and remove data You can't process or clean up data that that in a state That's easy difficult to analyze then you analyze them and then you interpret it interpret the results and act on it So how does this look in our example? Well, we already have the data They are in a database nothing much to do preprocessing yes, there was a little bit, but it's only towards the end of the presentation and nothing exciting The main focus is the analysis This is the main part of this presentation and when you want to interpret the results and the excellent in this in our case It's the innkeeper. So we just give him the data to make the correct decisions All right, so enough of the pleasantries I'm a developer and I said we're going to see code drumrolls So we have sentences and tokens If you want to split the document into sentences and tokens. Yeah, that's easy, Biden You just look for a dot split it with the string function and you have the sentences One large sentence and then you get Yeah, and then you split it into three sentences and if you want to have words then you split it on the on the blanks That's easy, right? That's how everyone would do it Well, not quite Because what about the sentence like this the waiter was very rude e.g. For example When accidentally opened the wrong door, he screamed private So there's indirect speech that has abbreviations in there And if you lose use this very simple split function Then you end up with a mess of letters and it's not really a sentence anymore So spacey to the rescue what does spacey offer? Well, if you want to use spacey you first have to load a language in this case English And then it's pretty easy to to take a Text and split it into sentences Basically these three lines of code you make a document out of your text Which is a spacey internal structure and then you just iterate over the sentences and that's That's it and as you can see these are very nice sentences now it even preserved the dot If you want to do the same with words, yeah, then you can iterate over a sentence and then you get the words from it There's no the example of the nasty sentence so it even can deal with the nasty sentence for example the e.g. Is still separated If you want to split it into word, yeah, just iterate over the sentence So actually Word is a term you shouldn't ask a linguist what the word is as far as I know because they Start to get aggressive. I think they have 20 different definitions for it At least I couldn't figure out one, but spacey doesn't have words. It has tokens So what's it token? It's basically a Container class where there's attributes So you have the word itself the one it found in the text like tastes But it also can compute different information it can find the base word Which is called lemma and the base word to taste is taste and also knows that it's a verb So that's the role in the sentence. So if it's a noun a verb if it's an ejective spacey can find these things out and This concept is called token attributes and several attributes exist in two forms like the POS Yeah, Boba is the cold POS by the way, it's not piece of shit. No, it's part of speech tagging That's that's what it's about. So I had to look it up when I saw it the first time So do you have a variant with an underscore which I think is hard to see But this should be an underscore but the font somehow destroyed it And this is the text part and then you have a variant without an underscore And this is just a number the reason for that is the text variant is easy to read and the number is Easy to store and fast to compare so it's performance the linguists usually use the number variant and only print the text variant and there's a couple of functions and and utilities to convert Around between this so spacey does have a couple of limitations You have to be aware of so it's not perfect because this is really difficult is what what spacey does here tokenizer sometimes Other tokenizer uses probabilistic models, so if There's this taste as a verb and there's also taste as a noun and sometimes it can get it wrong Yeah, that just happens and also lemma and and the part of speech can get wrong So it doesn't get the correct lemma for example when I have German tish like table It thinks the lemma is tishen, which is not really a word, but never mind. So If you really run into trouble because most of the time it doesn't matter Then you have to build your own model and spacey provides tools for that So But how do we actually find these topics and ratings now that that's what we are looking for a topic like food a rating like bad Topics. Yeah, you have to define your own topic system There's several ways to do that you can see you can look what other people are doing you have Topic modeling which is in the whole area of expertise and there's tools like Chensim We don't but my experience only works if you have a lot of data which we didn't have and you can build a tech Cloud look at it or you just can Ask the main expert what is important to you and in our case. We just did all of this and mix the result together So the topics we used for our Innkeepers was the ambience like what's the decoration the music the light What's food and beverages? Everything is related to eating and drinking hygieny like toilet or if something smells bad or it's dirty You have to service with the waiters are they polite are they friendly? There's also the value. So how is the price and are the portions big enough and With that we jumped into it So this is easy to represent in python. I've used an enum There's just the topics are just enumerated in Ben's food hygieny. There you go With a sentiment Yeah, that's that's a bit more difficult literature says yeah, you can do whatever you want You can do a five-star system. You can yes. I know great. I suck You can also do numbers from one to ten or a number between Zero and one whatever rocks your boat We decided to go with this so it's again an enum and essentially we have good and bad, but we have three different variants we have somewhat good Really good normal good and very good and the same for bed So usually I think in hindsight we would have it would have been enough to have four values paid and very bad and couldn't very good, but that's what we ended up with and And yeah, now we want to assign sentiments to words For that we need a lexicon So what's what's in the lexicon basically? It's just a table where you have certain words You can assign a topic to it. You can assign a rating to it, but it doesn't have to have a topic or a rating So it's it's okay if it only has a topic or a rating if it has neither There's no point to add it to the lexicon but Yeah, and we also added added a variant where we said we allowed regular expressions So if there's schnitzel or there's sure schnitzel both is about food and doesn't really matter So how do you get the lexicon? There's that that also depends but in our case You take words that are obvious for example from the menu. They always have to talk about food on the menu You find the most common words in existing feedbacks and see how they relate to the sentiment and Then you have very quickly you have a first version of your analyzer and then you just throw it at the text And if you find the sentence where you cannot extract the topic or a rating then you look at it What are the words in this particular sentence and if there's something interesting you add it to the lexicon so that's a A simple approach to do it if you want to model such a lexicon entry in python Yeah, it's again mostly a data container because we just have a word rating and a topic so This is essentially what it looked like We have a lemma a topic and the rating as a parameter to the constructor and a Little bit of messing for the regular expressions But what we really want is we want to find out if lexicon entry matches a certain token We found in our text so we have a matching function here where we can pass the token and this essentially Compares the token with texts it derives from the lexicon So it looks at the plane token as it is in the text when it can find it It compares the lexicon entry with the lemma then it does up a lower case transformation It looks at the regular expression syntax, and you could go even further. You could do fuzzy logic whatever so at the end this matching function It's almost finished, but the screen is a little bit too small Returns a number between zero and one and that's it so for every token you can check the lexicon entry And the lexicon itself is just a collection of tokens So basically a list and what you really want is I want to find the top in all the lexicon entries I have I want to find the one that matches my token the best Yeah, so this is even simpler you have a constructor which just initializes the entries Then you have an append function where you can append entries Usually you get them from a database where you're from a comma separated values file or from an excel file Whatever we just used a LibreOffice actually not more politically correct and If you want to find the lexicon entry you just iterate over all the entries Compute the score for it and you remember the one with the best score And if you find one where the score is one then you can already terminate the loop Not otherwise you have to in the worst case you have to iterate through the whole lexicon Performance bias. This is not Very efficient so this can be improved, but then you have more code and it was fast enough for us So this is how you build a small lexicon for our simple example as I said we got it from the database It's not pep 8 the indentation, but I think this in this case. It's easier to read Yeah, nothing special So you have a word waiter and the waiter is about the service and then have a word like tasty It's about food and it also has a rating good or The word quick which is always nice doesn't have a topic, but if something is quick, it's good so here we have the first Part of code that iterates over the sentences and extracts the lexicon entries and what looks at them So it says okay the word there. I don't know anything about that music. Yes. This is about the ambience and Loud okay, this is bad. So if the music is loud the ambience is bad And here we go that's basically our first simple sentiment analysis So I have a sentence the music was very loud and it says yeah, the ambience is bad The end yeah, not quite What I think I want to talk about because I think you have to deal with it one way or the other is intensifiers dimmishes and negations so intensifiers and dimmishes are basically words that Modifier rating dimmishes are slightly so if something is slightly bad It's not that bad as the normal bed and intensifiers is for example very bad really bad terribly bad terribly loud and if you find one of these words then You get a different rating than with the word alone loud is bad very loud is very bad You can represent this in python easily with simple sets and maybe some uppercase lowercase transformations and then you can make a function to check if the token is a dimmisher or intensifier So yeah, that's how you find out if the word vary is an intensifier. Yes, it is So it's simple enough So what's interesting about intensifiers? The dimmish rating so you need a couple of functions in the end you want to have You have a rating and you want to have the dimmished variant of it so this is basically a little bit math with signum and plus and minus and If you have in arms, then it's even less code And that's about it. So you have a function that you can okay, you can see that. Sorry That was an example. So if I call dimmished for good, I get somewhat good as a result That's an example for a dimmisher Negations so negations turn the sentiment to the opposite for example a word like not So if I have tasty, okay, it's good not tasty It's bad So it turns it around and it can of course be combined with intensifiers and dimmishes So if you have very good Rating is very good. But if you have not very good It's it's not the opposite of very good because it's not very bad if it's not very good It's actually somewhat bad That's something you have to keep in mind. So it's not just Turn it from minus two to plus two Or in this case minus three to two plus three. You have to do a little bit of magic Negations are easy to represent similar to dimmishes and intensifiers and if you want to negate the rating That's just a typical mapping problem. So I get the rating very bad and the negated version is somewhat good and Two of those combinations are pretty hypothetical. I don't think you actually use them Something is not somewhat very bad. You usually don't say that. But if it comes along, we turn it into good and again Now this fits on the screen So we have negated rating from good And it turns out bad and the negated rating from very bad turns out somewhat good Not very bad is somewhat good All right, so we have building blocks now for Words basically so we can look at a word and we can say we can tell About which topic it is if there's a sentiment in it We can figure out if it's a dimmisher if it's a negation but that's Not enough yet because we we don't can't can't operate on the single word level anymore so we need to add this somehow and combine this and One approach would be to just use a list of tokens and mess around with it But spacey provides something that is very nice for this task. It's called spacey pipeline. So what's what's a pipeline? spacey basically When when you throw a text at spacey it has separate steps where it tries to find out what's a sentence What's a token when it tries to assign this part of speech tagging and things like that and It's Something you can extend And what you also can extend if you have tokens you can add additional attributes to it so there's an article explaining this very In depth, but I'm just going to look about at the things that are important to us So we have a token and you can say token Set extension and we want the rating a topic and we want Boolean flex if it's a negation intensifier or a dimmisher And then we can work with these new attributes There's a funny syntax. So you have token dot underscore dot and all the extensions are in this underscore attribute So that's something you have to get used to that's that's spacey, but that's some I would say some trick to make it easy to extend from from the spacey people and I can say token dot Topic is food and when I print it token topic then I get Food so I can say a schnitzel is about food And I don't have to Introduce a new data structure myself. So I just use spacey's token So that's just an intermission for the following slides. There's a little debugging function Nothing interesting just that we can print for the token contains and now we want to extend the pipeline So at the end of the pipeline we want to look at the tokens Compare them with our dictionary compare them with our list of dimmishers and diminishes and Negations and Assign this information to the token and that's quite easy We have a small function which iterates over the sentence iterates over the tokens look if this is a negation dictionary that's true and if it's none of if it's a negation diminisher and Intensifier and if none of that it also checks the lexicon and If it can find the appropriate things the word in the appropriate tables then it stores the information at the token and There's one line of code to add our own function to the spacey pipeline and We end up with tokens when we iterate over it the next time And we have a simple function here. So if we have tokens and we're only interested in the tokens that are essential for our for our sentiment analysis We want to look at token that have one of our edge it has one of our attribute set and Here's a function that just filters all of them and if we call it with our Sentence the schnitzel is not very tasty It gets reduced to occasional schnitzel which is about food not which is a negation Very which is an intensifier and tasty which is about food and is a good sentiment And all that with extending spacey a little bit So this is very nice and we can do The same with rating But we we have to apply the diminishes and intensifiers and the negations on the rating so What we basically do here is when we have our four tokens We first look is the rating somewhere. Okay, it's good. Yes tasty is a rating And then we walk to the left and we find the intensifier very okay, so it turns into very good but then it's not so it's not very good and We have to turn the not very good into somewhat bad That's basically the the function we created before we can pass it a rating And This is basically python code to do that Essentially, it's yeah combining checking the various conditions Combining them into something doesn't really fit on the screen, but it's not much longer So it's not that that's difficult to understand most of it is Handling the loop and special conditions when there is no rating and things like that So here's an example How to call our combine ratings function? So first we we extract the extension tokens from our sentence like we did before and then we combine it and There's only two tokens that remains the schnitzel because it's about food There's no rating in it and the other three tokens they got summarized into one. It's still called tasty because you can't You can't change the name of the token in spacey But instead of the rating good it now has the rating somewhat bad and the other two tokens are removed So the schnitzel is not very tasty gets reduced to schnitzel somewhat bad All right, that was the wrong button We were there Yeah combine ratings Okay, so we can reduce it to that So from that we can build quite easily a function that attracts extracts topics and settings from a rating So here's an example Print topic and rating of a certain sentence schnitzel is not very tasty and it's rolled out, okay But here's a better one when we want to do it with the whole text feedback text So we have opinions and the long text the schnitzel was not very tasty The waiter was polite and the football game ended two to one You come up with three sentiments the topic food and the rating somewhat bad because of the schnitzel Then there was the service. Okay, the waiter was polite service is good and then there's a football result and But football results are not interesting to us. So both the topic and the rating of the football results are known So that's essentially it with the code It's a Jupiter notebook I'm going to publish the link and you can play around with it. So all the code is available for you and Of course when you talk about code You can't really understand everything I guess because it's way too fast So the the main information I wanted to transport to you is that all of this is basically small parts of code most of the functions are just a few lines and it's Spacey standard functions and it's Python standard functions. So there's nothing really special about it and You can dig into this I think quite easily to come up with something that is at least powerful enough to work with the mission diminishes and negations and in my experience that are is already enough to analyze Feedbacks from from restaurants to get about 80 percent a little bit more right And that's enough to find out where where is my business hurt? Where is my? Where is my where are my USBs, but is there something else? Yes, of course So there's plenty. There's models like code and should the typically also indicate negative rating There's idioms that indicate rating. For example, this leaves a lot to be desired plenty of words that basically say bad Actually very bad There's back references So you have one sentence where I talk about the waiter and then the next sentence just uses he and so You have to kind of connect that again One simple solution is if there's no topic use the topic from the previous sentence work surprisingly well with simple feedbacks You can add a topic hierarchy like we said before it's not about food The schnitzel is small the schnitzel has a price the schnitzel has a taste so there's a different aspects you can replace my carefully handmade combined rating function with abstract rules and a grammar parser and this is When you really want to get deep down and dirty what you what you will have to do because then you're much more flexible There's lots of linguistic jazz. So Still you should have the basic building blocks to start your own sentiment analysis and extend what's provided here Yeah, so as I said special topics I'm going to skim over this so just to give a few comments Because I think you you're going to find it helpful. So emojis also of course Can include a rating like a smiling emoji is good and an angry is bad There's an extension for spacey that can combine different kinds of emojis There's western eastern emojis does unicode emojis are actually smiley codes So and there's a list of unicodes you might find helpful because there are some of them already have a suggested rating from the unicode There's also slang terms if you compare Scottish English and Oxford English or Austrian German and real German And this is something it's basically synonyms So you just look for those special words and you also look only for those that are relevant to you for the sentiment So nicks is like nothing as a negator things like that Unknown abbreviations. Yeah, you can add them to spacey or treat them like synonyms typos Again, only those that are relevant to you to the to the rating There's a fuzzy fuzzy for example to do a fuzzy search. So that was a talk about it yesterday We didn't use it Just a couple of references if you really want to get into Sentiment analysis, I recommend the book from Leo It goes much further than what I showed you and there's again spacey extensions very good blog entry about it So summary sentiment analysis is challenging Well, it can be done to some extent Python and spacey help a lot with the development part and The code complexity as you could see is quite manageable if you apply simple basics of software engineering Thank you. It's a great talk and we have time for questions Hello, I have a question about the for example negation if you have multiple negation in one sentence and only one is Supposed to be connected with your token How do you handle it? The way it is now so it works nice for simple sentences. It doesn't work for complex sentences. I Looked at about thousand feedbacks and most of our feedbacks were simple sentences Because they had to input it in a mobile interface and people use very nice short condensed language for that if you want to Analyze social networks you have to use more like that and I recommend the book from Leo. He talks about these topics Thanks a lot for the talk. I was wondering like in your use case It's probably all right But how do you handle your performance because you have a lot of for loops and I guess this is really slow with text It depends on the size of the dictionary we were happy with about 500 entries in the dictionary because it's Only a few things that people talk about when they give feedback to a innkeeper We also get the feedback and we can analyze it when we get it And you don't get that many feedback all the time Especially if you're an Austrian startup who focuses on regional inkeepers If you want to improve the performance as I said before yes, you have to do something about it But it could analyze the thousand feedbacks I had in I don't recall it was less than a minute at least so Okay, that was that was fast enough for me. Thank you Did you think at all about? automating the lexicon building Using some sort of final machine learning or stuff like that What would you gain with that? You would get in gain a general generalized approach. I mean you could not only analyze And some reviews for ends, but also I know reviews for hotels or reviews for other places You you're talking about the rating or the topic now Yeah, well you could you could automate building the other topic index for example. Yeah, you can do that, but then I Think it's in our case It would have been more trouble because then you always have to look at the decisions the machine made When when the machine looks at the sentence that already contains a couple of negative words and then it finds words that also Come up very often in negative sentences. That's the thing you mean. I guess. Yeah, then you try to learn Ah, these are also negative words. Yeah, and so if it works, it's nice So you can generate the lexicon and if it doesn't work then you have to fool around with the machine learning algorithm and In in this case it was much easier to just maintain a manual lexicon But of course, it's it's a valid approach in in different scenarios I have lived in grads and have been looking a bit in the restaurant reviews and You know the Austrian humor. So I was wondering I have seen many reviews. How would you handle irony? Because I remember very Specifically the food was so good. We ended up in hospital This sentence the interesting thing about it when there is no Public forum for the people to present themselves. They stay very factual and we had very few Very few feedbacks that actually used irony or or sarcasm. So it was less than a percent So we just ignore him But it's it's it's a big topic when you are in social media, of course, yeah, I know your text your program Works for German language, but what if you had Comments in Korean for example or in Arabic text to which extent would be possible to adapt? spacey or train a model for it I Unfortunately have no idea I Don't know how how these languages work. I know there's different directions to read it and you can combine it there Was a little bit about it in the talk yesterday, but it also was that you have to deal differently So that's probably a limitation. Yes. This works with leading based languages, I guess So it works with English and German It probably also works with French Italian Spanish because I think as far as I know they have a very similar structure But if you have radically different languages with radically different grammatical systems, then you have to do something else That's correct one small question. What if people put like plus one in there? Comments it's common in extremes, but in your case. I think it's not an extreme people just write and Press the button and send their comment a comment that is plus one and it's not the emoji You can you can treat it as a synonym So that that would be a classic pre-processing step you look for plus one and then you translate it to rating good for example We don't have time for more questions I believe that Thomas is happy to answer questions, but I need to leave people to go to lunch Please round of applause for Thomas again