 Well, hello for everybody. Before I start, I want you to know that this is not a technical talk. I'm a computational linguist, so I prepare the data. I pre-process the data for modeling later. So this is a practical approach on how to prepare and how to pre-process data in order to be able to use it when modeling. So just for you to know, I'm a computational linguist, as she said. You need first, for this recipe, you need to gather the corpus, then you pre-process it, then you do the text modeling, and then I'm going to tell you about the pros and cons of supervised learning. So I'm going to talk about pre-processing in social media, preprocessing text. So we need to know first what social media is. And basically, there are four things that you need to take into account. First, it needs to be web-based app. Also, it needs to be user-generated content. Users must be able to create profiles and to be able to connect with other users. And with this, we have the development of social networks, which are basically social media. So what is social media? With the definition we had before, we can think of Twitter, Facebook, Instagram, Pinterest maybe, LinkedIn, but also Amazon. Because in Amazon, you have a lot of content reviews. Booking, the same, TripAdvisor, and also Wikipedia, if we think as a content site where you can share ideas. So the type of content we find in social media is text, as in Twitter, for instance, or any other web-based application. Also, Instagram with images, videos in YouTube, for instance, or Vimeo. Then the step for text analysis, as I said before, are those there. And we start with gathering the corpus. To gather the corpus, you can take corpora from online-free sources or from web scrapping. Those are the two main basic places where you can find data. So online-free corpora, you have this corpora there. And for instance, you have the first one. So you can type import NLTK. I know because it's a video, but I'm not writing any code. You write NLTK.download, and then you get these user-graphic interface where you can download all the data. It's just so that you know the tools that you have. I'm not coding. But if you have a look at this, you can download a lot of packages which are really useful when starting, especially doing NLP-based applications, because they're really useful. Because they have labels, they have a lot of resources that are really useful for starters. Also, the big young university, they have a really good corpora in English, also in Spanish and Portuguese. But any other languages are difficult to find. The British National Corpus, you can access the corpora or you can download it. And there's this guy called Martin Weisser. And he also has a really good corpora of online English. And he's got really good resources there, too. And then you can do the web scrapping of social media and information resources. Social media, you know that. It's Facebook, Twitter, whatever. And then the information resources. In my case, for instance, I did a script that was able to retrieve information from the Spanish academia web page. And it was useful because I needed to check whether a list of words do really exist or not in the Spanish academia. And I could do it automatically. So it was really helpful. So the kind of texts we find in social media are the ones that are tricky ones. For some reasons, I'll tell you later. But it determines the way we're going to analyze the data. So we have posts. We have tweets. We have those tags and comments in the post. I mean, you see there's a lot of comments. And in the tweets, you can also have the hashtags. All of these information are really valuable information for text analysis, because they will, all those tags and comments and whatever are going to be really helpful when organizing and classifying text and all these tasks we're going to perform. So now we have the corpus, and we need to go to preprocessing. When preprocessing, you can do a lot of tasks. But I'm going to explain those three because they are the main important ones and also the most useful ones. Tokenization is separating a text into smaller units. So you can separate into sentences, or words, or whatever unit you need. And you might think this is very easy, apparently. But you might find some examples like those, like City of Bombay. You have to decide whether you want to keep this as a whole unit, or you want to separate each word as a separate unit. So this is a lot of work you need to think about before you do the processing. Also, you will have problems such as ex-Malaysian Prime Minister. You need to decide whether you want to keep those units as one unit or separate ones. So this is it. Also, want and theirs, you need to decide whether you want to keep the verb and the negative form or you want to separate them, because especially the negatives are tricky when sentiment analysis, text analysis. So all of these you need to think about beforehand so that you can have the proper data you need. Also, it depends on the language. It's not only based on the text itself, but the language. You see Japanese and Chinese used to write everything together so that you cannot use spaces as word breakers. And especially Japanese has also those four alphabets. So it's not that easy to decide which one of the tokens are you going to separate. As the word removal is pretty good also for removing words that are meaningless. In the case of pronouns or determiners, possessives, I know you can read those. But it's basically a list of words you want to remove from your text, because it will be noise in your algorithms, in your models. So it's very useful to have a list of words you don't need. And finally, lemmatization and stemming are useful because you can group lemmas better than if you have all these inflectional endings that are also noise. So for instance, lemmatization is to remove inflectional endings, but coming back to the original lemma. So in this case, you will basically need a dictionary or a set of rules that help you to go back from the inflectional ending. I mean, the word with inflectional ending, for instance, in smiling, to smile. So you need all the set of rules or the dictionary to go back to the lemma, which is smile. Stemming is easier because you just chop off word endings. So you don't need to do, you don't need the dictionary. You can only chop off the endings, have a small list of endings, and then you chop off, and it's over. So what are the problems we find in text in social media? Basically, the most important one is time sensitivity. As you know, Twitter, Facebook, and Amazon, or Amazon reviews, I mean, and all of these sites have content which is dynamic. It means that it's constantly changing. People are constantly creating new content, and it's really difficult to build a model because you cannot decide on a set of parameters and expect it to work because parameters are constantly changing. So this is the main problem we have in social media when analyzing the text. Also the short length. If you try to analyze Twitter, you'll see that you have, for instance, I don't know, Superman and Clark Kent, and you know that they are the same person. But you might have a Twitter on one name and a Twitter about the other name. So when you come to cluster those words into different groups, you'll see that they might not be related in the same cluster because there's no contextual information that is telling you they are the same person. So it's difficult. This is a problem for text analysis. And this brings you to the semantic gap, which is exactly what I just explained. Also the problem of unstructured data. There's two variants. First, the variants of content quality. You see there's people that write really good in a really polite manner, but then there's people who write the way words come out of their mind and they don't try to make sense of the sentences. So this is a problem. And also acronyms, abbreviations like you or two in this case, or the end. And the misspellings, like where is the word, but how do you distinguish where from we are? So all of these problems you'll find in text analysis a lot. And also you have abundant information. There's tons of data, and you need to cut somewhere in order to be able to process. So applications in real world. You can use all those strategies for event detection, for instance, to know what is a new which is very famous or which is popular, or to predict what kind of information is going to be fashioned in the next week or whatever. Also you can take advantage of collaborative question and answering in sites like Stack Overflow, for instance. If you scrap the web, you will be able to find a really important information and a specific one rather than if you Google your search. Also you can use Wikipedia to fill in the semantic gap I was talking about before. If you use Wikipedia, which is a trustful resource, you might be able to create somehow relations between those words that are apparently non-related. But with Wikipedia, you find relations of the two. It's also useful for sentiment analysis. I will put an example later. And also to identify influencers and see if they are a lot of mentions of that person or also for quality prediction. Like in Amazon, you might want to know if a review is faithful or not. And it's very useful, these kind of tasks you can perform with NLP to decide whether to trust that user or not. So now we're going to go through text modeling. I'm going to go through it very quick because I'm not an expert. But with this preprocessing we did, we should be able to come up with a proper way I mean proper data set to be able to perform this. So if you go to Amazon and you find those reviews, you might want to separate, well, they already do. But if you have your own page, you might want to separate between positive and negative comments. So to find positive and negative comments, you need a vocabulary of positive words, negative words, and neutral words. So if we go back to these, you first need to define the task. Maybe you want to group the words in clusters and you need to decide which kind of clusters you want. So once we have this task defined, we need to decide the strategy we're going to use for sentiment recognition. There's mainly two ways, the supervised and unsupervised. But unsupervised learning, in this case, it's really, really difficult. So the supervised learning, you will need a labeled corpora, which is time consuming, pretty fine categories. And you need to go all through the data before you decide on the categories. And you can use sentiment lexicons. For unsupervised learning, you will use an unlabeled corpora, which is easy. You can use k-means for a category discovery, which is also useful. But you'll have the problem, I said before, with the time. Because it's constantly changing. So you never know if your model will be properly good. As I said before, we need a list of vocabulary for positive, negative, and neutral. And also, you need a list of comments you want to analyze. The result, this is a very basic algorithm for, this is naive base. But I mean, you can use anyone you want. The input will be mostly always the same. So the result you'll get is generally a positive score and a negative score. And then you can classify the text depending on the result you get. This is a very basic task. In my opinion, the best way is using sentiment lexicons. Because you have a word list of positive and a word list of negative words. And it's a binary fashion. So it's an easy text classification task. The problem is, well, sorry, these are two really good ones. Also, you have WordNet for NLP. And you can download WordNet in Python. And it's very useful. I'll add the link after if you want. But those are really simple ones. And they come with this binary fashion already classified, so it's pretty useful. Another approach, which I thought it was really interesting, although it's not that new, but it's a really interesting approach, are polarity lexicons. What those guys did is they had a small list of positive adjectives, and they thought that every adjective connected with that adjective in the list, by and, which is coordination, will be necessarily synonym. So they decided to scrap the web and find, well, more manually, but they found a lot of pairs of words connected by and. And so all the words were added automatically to their already built list of positive words. They did the same with negative words with the word but, or however, so they could automatically have a bigger list of negative words. So this is a really good semi-supervised approach for this task, because you get to have the best of both. Like the unsupervised is less time consuming, and it's easier. This is an example from Dandrovsky's Stand for NLP course. This is a really good book also. They have a book that they are editing right now. The third edition, I guess, is going to be ready in a few months. And this is a really good course also for a lot of NLP tasks. And this is a really good approach so far I've seen so far. So the results are basically depending on the task you want to perform. As I said before, having tasks in NLP depends a lot on the text you have. I mean, if you are analyzing Twitter, you might need some strategies really different from the strategies you're going to use if you are analyzing, I don't know, a text in a novel. So this is a very important thing you have to bear in mind, because if your task is different, you're going to need a completely different strategy, only also a very different algorithm. So K-Means is a good one for clustering. If you have a bag of words and you need to group them in clusters, but if you want to, I don't know, for instance, decide what in a spell corrector, if you want to know if a grammar is well written or not. I mean, in a sentence, if you want to check the grammar, you might need a completely different approach. And you might want to use n-grams or another strategy. So using the lexicon for sentiment recognition, for me, is the best approach, as I said before. And we have prompts, like the topic discovery is a challenge. And if you already have a small set of words, it's really helpful, although you can increase the amount of words automatically somehow. And also, the performance is better, almost always. I mean, all the cases I've seen, always the performance is better. On the contrary, dynamic language is difficult. So it's really time-consuming, building the lexicans and labeling the corpora. So it depends on how, if you need to have a task finished in a small amount of time, or you have more time, you can use one strategy or another one. So this is the bibliography. I'm going to upload the slides in case you want to check. Mining text data is a really, really good book. They have a lot of algorithms explained for each kind of task, so it's very useful. And just to finish, I'm from Mallorca. I'm a co-organizer of the PyData. And I'm a computational linguist. I couldn't came up in the whole web with the proper definition of what a computational linguist is. So I found this slide, which I think is awesome. And I decided to copy it here. You also have the information there. It's from a presentation. And well, thank you very much for watching. Thank you very much. We have a few minutes for questions. How do you determine your stop words? Is there a common set of words that tend to work? Or is it based on your application? Well, basically, anything without lexical information would do. Like, for instance, if you have the word house, you know that it might be necessary to have the information of house. Because you can find synonyms. I don't know. You can find words that are related, such as the roof or whatever. But if you have, for instance, the determiner, it has no semantic meaning, which means that you cannot find relative words, I mean, related words, sorry. So basically, this is what you take into account when listing stop words. Also, it depends on your task. You might want to, I don't know, decide that house, in your case, makes no sense in your backup words. So you won't use it. So there are lists of stop words which are already built. And you can, in LTK, for instance, the package has some, and I know many packages have lists. But basically, this is the main principle of when deciding whether a word is a stop word or not. No, you tokenize first. And then you remove whatever you don't like, usually. Thanks for the talk. I was wondering how you apply K-means to a backup words. Like, there must be a step in between, right? Well, as I said before, I don't do a lot of modeling myself, but I usually prepare the data. So I know the problem in that case with K-means, the clusters that resulted from the first task made no sense at all. Because as I said in the example, you might have two words that are related because you know it. But when it comes to clustering, somehow the algorithm doesn't find similarities. So my job was to decide how to improve the data so that the algorithm was better. I don't know exactly how they did implement the algorithm itself. OK. So I'm not so much talking about K-means or anything. It's more like how you represent the data to an algorithm like K-means. Because usually you have a set of vectors, and then you compare the vectors, if this makes sense to you. And it's difficult. Like, you need to encode the word into something that is meaningful for a machine, right? Yeah, I understand. But I don't code the algorithm. So I don't know if they did a step in the middle. I mean, I have a set of data, a list of words in that case. And I had the results, and I saw that the clusters made no sense. And I had to come up with a solution on how to improve those clusters. So from a linguistic point of view, I saw that words that should be related were not related at all. So I started reading a couple of articles, and I found out that the problem with K-means, for instance, is that you need somehow information that relates the words. And this was missing. So this is why I was saying that if you use another resource like Wikipedia, you can somehow find relations between the words. But I'm not coding the K-means algorithm in this case. OK, thank you. Another question? Hi, thanks for the talk. So these days on social media, especially if I'm trying to measure sentiment, what I see is that a lot of people respond or review with emojis, or they would use sarcasm, or they would just react with like a gif of something that they're feeling. So when you're preparing data to measure sentiment, for example, do you take these things into account somehow? Yeah, of course. It's really difficult. I mean, most of the times you want to remove all the noise you can. So when it comes to sarcasm and you find that some words are really not helping and are noise, you basically remove them. Also, you can try to do another analysis, which is deeper linguistically, I mean. And you try to label everything. And with this, you improve a lot. But it takes a lot of time. It's time consuming. And also, you need a lot of people working in that. You need a lot of people labeling the data and helping with the task. Like you have lists of stop words. There's no such resource out there where you can say, OK, this emoji reflects this emotion. No. It's most of a negative. No. OK. Thank you. Well, not that I know. But so far, I now found a list. So you mentioned one of the difficulty is to decide if you split the combination of words. So in your practical experience, what kind of metrics you use to determine if you need to split or not? You do it manually. Or you use rules. Like what kind of rules? Well, you can use. I mean, of course, you can use algorithms and try to do it automatically. And I mean, you can get good results. But if you want to be really, really specific and make sure that all the words are the way you want them to be, you usually can build a dictionary or you can use. So my question is, how do you determine if you want to split or not? This is depending on what you want to do. I mean, the task. If you want to, I don't know, if you need the names of people, you might want to focus on that task specifically. And then you perform that task better. And then you find, I mean, I usually work with lists because the task I did was, I mean, you needed specific words to be found and without no other interference or so. But I mean, I don't like doing it this way. There's many ways you can do it better. I mean, I'm a linguist and my job is to solve the problems that the algorithms cannot solve. So my job is not so good as, I mean, it's not that cool as writing code and everything works perfectly. I need to find a box and resolve them, which is the problem with the algorithms, which are obviously much fun. But I work with lists or with rules like a regular expressions or whatever I can. Thank you. Basically. We are run out of time for questions already, so let's thank Olaia once again for the talk. Thank you.