 Hello, my name is Dr. Julia Kazmaier, and my role is to teach computational social science as a new forms of data training on behalf of the UK data services. And today I will be talking about text mining, specifically on the basic processes of text mining, including cleaning and preparing some, your data, your corpus and all of that. And also some of the first natural language processing steps that you'll probably want to do. Just a quick flashback. Before we go too much further, I do want to point to some of our other resources that you may find useful, like previous webinars on being a computational social scientist, the introduction and theory webinar on text mining. There's a whole series on web scraping for social science research, some code demos. You can find these in our past events tab on our UK data service website, also on our YouTube channel. And upcoming events that you may be interested in, there's an advanced option to follow on from this webinar, happens in just under two weeks. There's a health studies user conference, so if you're interested in how data plays into health and care sector, that's quite useful potentially. And then there's a multi-day social data and third sector event in July. But if you have attended the introductory webinar that I just referenced, you might remember that text mining is about turning unstructured or semi-structured input into structured output. Today's webinar will demonstrate some of how that transformation actually works in practice. If you attended the introductory webinar, you will probably also remember that text mining has four basic steps, retrieval, processing, extraction and analysis. And although this suggests that they're a clear one-directional path to analysis, it's actually a little bit iterative, especially processing and extraction that we will be focusing on today. They're not always linear sometimes, especially if you're doing something new, either new as in text mining you've never done it before or a new text mining process. These will probably go around a few times before you really get the pipeline that you want. Specifically today, we will cover processing steps like tokenization, standardizing, removing irrelevancies and consolidation, some basic natural language processes like tagging, named entity recognition in Chuck King, and some basic extraction, which I see that I have repeated myself a little bit, but also word frequency similarity and discovery. Don't worry too much if you don't know what these things are now because I will go through them all. Okay, so the first step, the first thing you do in processing is turning your raw data into something that you can work with. So that's what you got out of the retrieval step, which I'm not covering, and into something that you can do text processing on. So let's assume that you have a great big file with hundreds of newspaper articles in it. You might want to break it into smaller files with one article each, or insert a line break or some kind of clear delimiter after each article so that you could import the file into a spreadsheet program and each article would have it. Or you could turn each article into a dictionary entry possibly with key value pairs for the typical kind of article things, like author and title and date and things like that. All of these are potentially useful ways to process a raw data file into something that's suitable for text mining, and they lend themselves to different kind of analysis. The first with separate files is useful if you want to compare the documents, that is, if you want to compare the content of the file. So this, you know, this is if you want to dig into the text and compare it to other texts. The second in a spreadsheet is maybe more useful if you want to analyze them as entity. So, for example, an article on this date by this author as one entity rather than blocks of text. And the third, the key value pairs is useful if you want to discover relationships between these features. So it's not necessarily this article with this date by this author, but how all of the different articles by that author change over time in their content. So there's a lot of different sort of ways you can focus on text mining, and that influences how you break up your raw text into different text mining suitable files. However you do this, you will probably want to continue breaking it up into things, specifically tokens. Because text mining requires a basic unit of natural language processing analysis, which is called the token. So lots of things can be tokens, including whole documents, chapters, paragraphs, sentences, words, word stems, and lots more. Most common ones are words and sentences. So let's look at an example. Imagine one of my small files or one of my rows in my spreadsheet or one of my fields in the dictionary key value pair that I created first is this string of text. I would tokenize this string as words. It would look like this. Notice that apostrophe S counts as its own word. Also punctuation counts as its own word when you tokenize by words. It could also tokenize by sentences at which point the full stops at the end of the sentences are included as part of the token. Now when you have chosen your particular tokens, and that depends on again on what kind of analysis you want to do, there's no right or wrong answers here. You want standardization, and standardization is about replacing multiple forms of the same token with a single form. So this improves analysis. For example, if you want to count the things, you want to count all of the things that you really consider to be the same thing, you want to count them together. So for example, color written in American or British spelling systems, you want to standardize to one spelling system. Or for example, if miles per hour is written out in full words in some parts of the text but written MPH in other parts of the text or things like this, you just want to standardize on one particular way. So one tool that I find useful for standardization is reg x, which stands for regular expressions, and it works like a find and replace. It's useful for standardizing on terminology or acronyms and things like that. So for example, if I know that somewhere in my text are likely to be both the words cats and poody cats, and I want to standardize on just one, it's perhaps a strange choice to standardize on poody cats, but I'm going to do it. So in this example where you have cats, I want to run a reg x operation to replace it with poody cats. So that way I know that it matches and it will be counted similarly with where else in the text it definitely says poody cats. Now this is going to sound really tedious and boring and you have to go through all of your text loads of times and find out what's in there, what should be the same as a... And that is a big part of the exploration and the sort of processing. But once you find out these things, you can do multiple operations at once. So I can replace cats with poody cats, dogs with doggos, elephants with rhinos and problem with kerfuffle. So I can change this sentence in one operation into that sentence. And this is not a particularly useful example, but I did want an example that will fit on the screen. So this is why I'm using just a couple of sentences rather than proper text analysis kind of text. Now there are other kinds of standardization tools, things like switching from however the letters appear into all lower case or all upper case. There are spell checkers that will standardize to English or American spelling, for example. And there are other examples as well, but they all work a little bit like reg x in that they iterate over the text, compare what's in the text to an established dictionary and sort of switch them in and out depending on how the operation is defined. Things like the lower case and upper case or the spell checkers already have established dictionaries as part of the process. So each of these tools tends to target a particular kind of standardization. You might need lots of different standardization loops. Now, depending on your analysis, you may or may not want to remove some things that are present in the text if they are irrelevant to your purposes. So station is a good one. If you're looking at words as your unit of analysis, well, if you remember back to the word tokenization, the punctuation appeared as its own token. Now that punctuation is useful and it does have a lot of meaning within sentences or paragraphs, things like that. But it is not necessarily a whole lot of meaning as a token. So you probably want to remove these tokens to get something like this. Now in this case, I've highlighted that the apostrophe has been removed from S. This is a choice I made in this particular example. It is not required. You could remove punctuation but leave that apostrophe if you thought that would be useful to you. Another example of the kind of thing you might want to remove in this phase of processing is stop words. Stop words are usually determiners, conjunctions, adverbs and words like that. So like punctuation, they matter a lot for the sentence-wide or paragraph-wide meaning of the words, but they have little actual content as words in and of themselves. What's more, for any given language, they tend to have more or less the same distribution in text regardless of author or style. So they don't help you analyze the text necessarily. The fact that the occurs with a certain frequency is pretty much the same in newspaper articles as it is in grand literature. It's just one of those words that like punctuation, it happens as often as it happens. It doesn't tell us much. So you can remove these with stop words, sorry, stop word removal functions, which take a text like this and give you this. As humans, we can still understand this output. It still makes sense to us. Obviously, it's not grand literature. It sounds like someone who doesn't really speak English very well is saying this, but we would still understand the content. So that's something to consider when you're working through these text mining processes is do you need to run processes that will make the output hard to read? If so, you may want to run two different streams of processes so that you can sort of have something in parallel that's a little bit easier. Next is consolidation. This is getting a little bit into linguistics and the linguistic forms of the same words. So rather than miles per hour written out in MPH as an abbreviation, it's things like rain and raining and rains, you know, as different verb forms of the same word, things like that. And there's a couple of different ways you can do this. The first is stemming, which is a really aggressive way of stripping back word markers, things like verb endings and plurals and things like that. It would, for example, it worked according to rules and they would have a rule like remove ed from words that end in ed or remove s from words that end in s. And it can be a little bit too aggressive sometimes. So if we run this text through a stemmer, we get things like it grain cat dog, it also rain elephant become problem. So again, we can probably more or less still understand this, but it's not clear why elephant was stemmed to elephant. I think it's possible that there might be a adjectival form of the verb. I'm not sure. But cats has been reduced to cat dogs to dog becoming to become, you know, so it's, you know, all right, this is good. But, you know, if we're happy with this, if this is enough standardization and consolidation and we're ready to move on, then great, we can do that. But sometimes we want to count the verb forms and the noun forms separately or the adjective forms. So the rules for stemming are perhaps too in elegant for us. We don't account for irregular pearls or irregular tenses and they reduce rain, raining and rains to rain. And, you know, at that point, you can't tell verb and nouns apart. So another option for consolidation is lemmatizing. Lemmatizing is similar but a bit more sophisticated. It reduces nouns to singular noun forms and reduces verbs to root verb forms, but it keeps nouns and verbs separate. So if we put our process text through a lemmatizer, we get a new list of tokens. It's a little bit different than when we got out of the stemmer. We still have cat and dog instead of cats and dogs, but we have elephant instead of elephants or eleph. You'll also notice we have raining and becoming rather than rain or become because unless a part of speech is given to the lemmatizing process, lemmatizing assumes everything is a noun and effectively becomes just a depluralizer. So that's not entirely great. I mean, maybe this is fine as well because we have elephant and raining and becoming. Maybe we're okay with this. But there are other things we can do, which brings us on to our basic natural language processing, part of speech tagging. What lemmatizing couldn't do in the last step that we saw was properly separate rain and raining and becoming into their root words because we didn't tell the lemmatizer that those are verbs. So part of speech tagging is a basic natural language processing function that sometimes this will count as part of your processing and sometimes it'll count as part of your abstraction. It uses linguistic structure and linguistic information. So it is text mining and sort of language analysis rather than just cleanup and preparation of data. But it is not usually a goal of natural language processing in and of itself. It's more of an intermediary step that allows you to do some much more sophisticated natural language processing. But if we start out with something like this, which has already gone through some of our processes and put it through a part of speech tagger, we get something like this. And so it has the word and then it has a part of speech tag, which we see here includes pronouns, verb and gerund form, nouns in plural form, nouns in singular form, add verbs as well. That's the R-B for also. And if we take our part of speech tagged text and put it back through the lemmatizer, this time telling it to check the part of speech tags, we get a properly lemmatized output with the verbs reduced to a root verb form and the plural properly singularized. So we can also choose to output this. The output from the lemmatizer could be string and part of speech tag pairs like we see here, or it could just become strings. If we're done with the function that part of speech tags has played, we don't have to keep it in the output. So another basic natural language process sort of function that you're probably going to want to use at some point is chunking, which requires word tokenization and part of speech tagged text as input. And it builds it back up into larger structures that have logical relationships. So here I've just taken the first sentence, it's raining cats and dogs. And we put it through a chunker and it gives us a chunk like this where it recognizes that all of these words belong to the same sentence. Now that's, you know, that's all right. That's pretty good. So we've taken it apart into words and then build it back up into a sentence that is recognized linguistically as a sentence. Now I've represented this visually so that it's easy for you to read as people, but what actually is returned from a step like this chunking step is a little bit harder to read. It looks like this and computers can read it, but it's a little bit tricky for us, which is why I represented it differently. Now a particular kind of chunking operation that you will probably find useful at some point or at least interesting to play with is named entity recognition, which aims to group the word tokenized and part of speech tagged word tokens into sentences, but also into noun phrases that are associated with people, organizations, places, other kinds of proper nouns. To demonstrate this, I need a new sample text because there was no named entities in my reigning cats and dogs example. So I have this example. Bruce Wayne is the CEO of Wayne Enterprises, but is also Batman. And if I put this through a named entity recognition chunker we get. Oh, that didn't work. Somehow my formatting is going a bit squiffy. Never mind. Here we see, let's see if I can get person, Bruce. So that's pretty usefully categorized Bruce. It's recognized that that's the name of a person. And organization, Wayne. Okay, well, that's not quite right, but it's probably influenced by the fact that Wayne Enterprises here is definitely an organization. And Scott, you know, is and those are unsurprising Batman, also recognized as a person, very useful. So we can see this is, and that is all still within the sentence category here. So that's pretty useful. I mean, it hasn't worked 100%, but because this is an automated process, you should never expect it to work 100%. If it's really important for you that it be close to 100%, you will probably have to do a lot of manual revision or a lot of training your own named entity recognition chunker. And you can start with one that's, you know, freely available and then train it to get better at the kind of tasks you want to give it. But anyway, so named entity recognition is pretty useful. Now this brings us to sort of the end of the processing section. And it should be clear at this point that there are a lot of different processes, but I can't tell you which ones you will need to do and what order you will need to do them in. So because it really depends on your, your research, your specialty, you know, how you're planning on approaching the analysis, what kind of text you're using, all of these things. So there are some rules of thumb in that things like chunking and or part of speech tagged limitizing requires the text is already tokenized and part of speech tagged. So there are clear, some processes have clear requirements for input. Regular expressions are probably best to run some of these steps before removing upper cases or standardizing to uppercase, because as it is acronyms and abbreviations often rely on uppercases to be recognizable and distinct. So this is probably a step you want to run before standardizing to either lower or uppercase. And if you make changes to the way in which you do things, so if you started out by removing the uppercase and then did redjacks found that that didn't quite work and did switch them. You probably want to document exactly how you do it and start the whole thing from the beginning because replicability is important. So for example, if you had this process, if you started out saying, right, I want to tokenize it, remove punctuation, remove stop words and stem it. And then later on decided, no, actually, I want to tokenize it part of speech tag, it limitize it and then remove punctuation or stop words. You should always start the whole process with raw data every time. And that might seem like a huge hassle, but you can write code in Python or R that chain these processes together so that it's not a huge, you know, faff to start at the beginning every time. Ideally, it might still be a bit of a faff, but it will be an important and replicable faff. And also, this is an iterative process so that, you know, you might start out thinking this is the pipeline that you want, and at the end, it might be this. So, but you have to probably go through that pipeline a load of times before you find out exactly what pipeline you want. This is just part of it, especially if you're doing a brand new thing, either new because you've never done natural language processing before, or new because you're trying to achieve the new kind of analysis. You're probably going to go through this a few more times. So now we're on to the basic extraction. And you might be surprised, but simply counting words to find out word frequency is actually a really useful thing in text binding. It, in a very broad brush way, identifies the most frequent terms and concepts in a set of data. So it's useful for analyzing sort of in a broad way, things like mood or opinion or category or, you know, what this text is about. So, for example, if you're analyzing a set of customer reviews for an online service and the words that come up the most frequently are expensive, overpriced or overrated. Well, you probably need to adjust your prices or aim at a different target market. You don't necessarily have to go through and read all of the reviews to get that kind of insight. You don't necessarily have to do really complex, you know, clever analysis to get that kind of insight. You can just do things like we're counting. So let's look at a couple of examples. First a trivial one using our sample set, you know, our raining cats and dogs and elephants example. So here's a little visualization of the pipeline I did to get this output, which tells me it occurs twice, raining occurs twice, cats, dogs, also elephants becoming and problem all occur once. Now that might be quite obvious to us. The text that we're working from is pretty small. We could have counted that up manually just, you know, jotted down hash marks on a piece of paper. But that would not be a reasonable approach to analyzing the entire text of Emma by Jane Austen. I would not want to go through that counting, identifying all the words and counting how many times each one happened. Emma and many other classic sort of works, by the way, are available as corpus, as a corpus that you can use through NLTK corpus Gutenberg functions. And I will cover this in the code demos that I will share the link with you at the end. And you can see my code and you can run these code and you can play around with Emma. So again, here's the pipeline I used. I got this as an output. Well, this is the 10 most common words when I ran this analysis of Emma. Mr. occurs most more common than anything else. Emma, pretty common still, could, would, miss, must, hurry it, much, said and think. All right, pretty good. But we don't necessarily just want all the words or even the most common words. Sometimes we want to know how many times a particular word occurred. But that's an option you can do as well. You can search for, you can get the return of the count of a particular target word. In this case, the word common occurs 142 times. So it's outside of our 10 most common words, but it's still pretty common. Now word similarity, if we're not interested just in how many times a word occurs, but we want to know how many times similar words occurred, maybe we can count them together or we can contrast them and say this word, you know, this field of research uses this word, whereas a very similar word occurs in this other field of research and these two research communities are not talking to each other. For example, that might be the kind of analysis you want to do. You will want word similarity and this uses the concept of word vectors, which are built into packages like Spacey. And what these do is for words included in the word vectors package, you get each word that has a vector has a score on 300 dimensions. These scores are derived from how the word is used in a large corpora of natural language, what part of speech it is or is most frequently used as, what words are typically found before and after it, that's called collocation and other kinds of linguistic analyses. So you use these to derive word vectors and then you compare the word vectors, between words. So when you compare word vectors, you get a score between zero, which is no similarity and one, which is identical. As an example, let's say we have three words, troll, elf and rabbit. Now, all three are nouns, so we're going to be somewhat similar, but two of them are fictional creatures, so should be more similar than either is to rabbit. And if we look at the pairwise comparison of word similarity, we find that to be true. Troll and elf are 0.4, they score 0.4 in that similarity ranking, whereas troll and rabbit is only 0.29. Interestingly, elf and rabbit is more similar than troll and rabbit. And I assume that's because, I don't know, maybe they occur in similar kinds of texts, or they both live in the woods, they're both seen as positive. I don't know exactly what it is, but according to the word vectors, rabbit and elf are more similar than rabbit and troll. That kind of makes sense. You can do something very, very similar for document similarity, and this is a really interesting option that works similar to word similarity, but instead of having preloaded word vectors, it builds a vector based on your documents. Now, this is really sensitive to how your documents are processed. So if you remove the punctuation, if you stem or lemmatize, you know, all of that, that will influence the document vector in a big way. But what this does is the analysis builds a vector for a document and then compares the vectors between two or more documents. And again, it returns a value of 0 and 1. So as an example, Emma and Persuasion are both novels by Jane Austen. They scored a 0.99 similarity. They are very similar, not least because they're both books by the same author. Emma by Austen and Julius Caesar by Shakespeare are 0.97, which isn't, you know, that's less similar, but it's still very similar. They're both English language, for example. They're both classic literature. They're both fictional, you know, things like this. Emma by Austen and Firefox, which is a selection from a WebText corpus called WebText, scored only 0.86. So while Emma and Firefox are both in English, the grammatical structure of WebText is very different than the grammatical structure of fiction from another, you know, from quite a long time ago. So they score much lower in similarity. And of course, you could do this as a way of testing the likelihood that two texts are written by the same author or that two texts are written in the same timeframe or something like that. This is fundamentally the basis for how plagiarism checkers and things like that go. So those, instead of scoring a similarity, they score a sort of originality thing. So that rather than just a simple document vector, they also look at sub-document sections to see whether things are quoted too closely, whether particular phrases appear identical in each or whether they're just very similar. So there's a lot more complex stuff that can be done. But again, this is just a basic natural language processing webinar. Okay. And finally, discovery. So we got a sense of word frequency from word frequency counts, which words are important to a document. And we can actually set those to be collocation counts. So we can say how many times does word A occur within one word or within two words of word B? So you can, for example, how many times you see cat and hat appear within a forward window or something like that. That's collocation. And it is one kind of discovery. Might, for example, want to know whether a particular noun is more often paired with positive adjectives or negative adjectives. And that's a useful kind of discovery collocation. But I want to show you a slightly more sophisticated thing that is based on patterns. So first, you define a pattern to discover context and use. For example, in this case, the word like, followed by the word A, followed by a noun. If we apply this to Emma as a text, we get like a look, like a merit, like a gentleman, like a job, like a woman, like a bride, like a brother, like a daughter. Okay. So we know that all of these things on the end of like A, they're going to share something in common. In fact, quite often they're referring to people, gentlemen, woman, bride, brother and daughter, all people. And that's a useful thing. So we know that for this author anyway, this kind of structure tends to produce comparisons to people. But I argued that it could be much more sophisticated than this if we use a more complex pattern. So for example, if we define the pattern as a verb, followed by the word like, followed by the word A, followed by up to three modifiers, followed by noun, we get things like looked like a sensible young man, or argued like a young man, appear like a bride, seems like a perfect cure, enters like a brother, writes like a sensible man. And these tell us not only more about how the author uses the phrase like A. Again, these are mostly about people. But we find that there's definitely some verbs and some adjectives that match with the kinds of people. So for example, if we were looking at just this short list, we get the idea that men are associated with arguing and writing, as well as being young and sensible in various combinations. And if we were to continue this analysis, we might see if women are also associated with arguing and writing, or if women are associated with other things like, I don't know, playing piano or walking in the park. Likewise, we could also see if women are associated with being young and sensible, or with being old and insensible, or with being, I don't know, middle-aged and mathematical or any other combination of adjectives that might be interesting to look at. You can see how this is a little bit like collocation. And that we could find instances of young men or sensible men. But man argued or man writes are just not very likely structures to, they're not very likely word combinations to appear in the text instead of the, you know, Mr. Darcy writes or, you know, he argued or something like that so that it, they're still male. We can tell that if we do different kinds of linguistic analysis, but they are not pure collocations. But anyway, you can use a structure like this, this kind of like a phrases, which you could interpret to mean that these are stereotypes that exist either for this author or at the time of writing or that there may be ideals like that young men should be sensible or that young men are likely to argue, you know, these kinds of stereotypes. But then they're a little bit more complex, a little bit more interesting than just whether these words occur together. The fact that they occur in this structure tells us a little bit more about how the author thinks, how these words mean, how society interprets these concepts. Now really, we've just dipped our toe into natural language processing and a big chunk of today's natural language processing was taken up with the processing part. So we're about to wrap up. I do want to provide you with some links. The first one will take you to our GitHub page. If you go to the code branch, you can see our, there's a Jupyter notebook for processing and a second Jupyter notebook for extraction that works through interactive code examples for all the different things that I've showed you today and actually a few things, more things like actually how to remove punctuation, you know, different kinds of ways to analyze the text in the basic concepts that I've shown you today. There's also some resources like the natural language toolkit and if you want to look at the corpus, natural language toolkit corpus, things so if you can get the web text or the classic literature books, spacey, it's similar to natural language toolkit, but it is more recent and arguably a lot more powerful. So depending on whether you feel up to taking on something a little bit more powerful, that might be interesting. So also if you're interested in how word vectors and document vectors work, there's a semantic vectors package on GitHub and an excellent book called Geometry and Meaning by Dominic Widows in which he explains why you get meaning from words by creating these 300 dimension or more word vectors and how you can use a geometric interpretation of words to analyze similarity and function and even translation and things like that. It's a very interesting book, but it is absolutely chock full of math and theory. If you're into that, it'll be right up your street. If you're not, it will be a coaster.