 So, hello. My name is Dr. Julia Kasmeyer, and my role is to teach computational social science with the new forms of data training team on behalf of the UK data services. That is a whole lot of mouthful. But more importantly, today I will be talking about text mining, specifically on just two of the more advanced options that you might want to explore with the text mining and natural language processing skills that you have developed, hopefully have developed through this text mining course, but I will allow that you might also gain your skills elsewhere. Before we dive directly into that, I do want to draw your attention to a couple of recent and upcoming things that you may be interesting to you. Recent webinars on being a computational social science scientist. The first two sessions of this text mining series, some web scraping webinar series, and then some code demos that focus on being a computational social science and also web scraping and APIs. You can get those through the past events under the news and events tab on our website, UK data service. You can also get the recordings on our YouTube channel. And upcoming tomorrow we have a health studies user conference in July. We have a multi-day event on social data in the third sector. And I haven't scheduled the dates yet because everybody's unclear about what's happening. But as soon as we get some dates, I hope to advertise for text mining code demos in which I will sort of live stream as I work through the code notebooks and take questions and answers about, you know, how that works. So if you attended either of our previous two webinars in this text mining series, you might remember that text mining has four basic steps. Retrieval, processing, extraction, and analysis. Today we are focusing on extraction, especially on two advanced options. These are more advanced than the ones you saw last time, but they are similar and in fact use a lot of the same skills, but they build on them and then go a step further. Specifically today, I cover classification tasks, including sentiment analysis, and I cover entity extraction, especially as it relates to creating social networks and network graphs. So, you know, the processes, the basic NLP, the basic extraction, those are covered in the first two webinars in the series advanced extraction. This is what I'll be doing today. So let's think about classification as a task, as a thing to achieve. We all understand the concept of sorting things into categories. We have been doing this most of our lives, quite often subconsciously, not always to our benefit. But let's set that aside and imagine that we wanted to sort some really difficult to sort things, like for example, 100,000 old scientific articles and we want to sort them into the modern scientific fields or disciplines, because maybe we want to study the development of those fields over time. We want to see how people influenced each other, which fields maybe had an influence on other fields. This is a laudable goal, but it is difficult because depending on the age of those articles, they may or may not have keywords or abstracts. They might not have been published in journals at all. They might have been published in journals that no longer exist or that don't have any clear link to the kind of modern scientific fields that we understand today. They may use terminology that we don't use. They may not even have titles in the way that we're used to having titles. So basically all of the easy things that we think about how to sort articles quickly may not be available to us. Now we certainly don't want to read all 100,000 old articles, especially because to get a good judgment about which category, which fields those old articles belong to, we would need multiple people to read those old articles and then to tally up the different people's judgments to see which ones match. That's a lot of work. But how about we get a computer to do it for us? More accurately, we get a computer to estimate which field it thinks we should put that article in. This means that we can review a percentage of them and see how accurate we agree with the computer's judgment. And if we agree very, very closely, then we can sort of just accept all of the computer's judgments or we can accept all of the ones that the computer is very confident about and only focus on those that the computer is unclear or a bit borderline about. But how do we do this? How do we get the computer to do that for us? Well, that is a classification task. And to accomplish a classification task automatically, you need a set of documents. In our example we're talking about here, this is the 100,000 old scientific articles. You also need a set of categories into which the documents belong. In this thought experiment, this would be a list of the modern scientific disciplines we want to match the old articles into. It could be a simple list, a flat list, or it could be a hierarchical list. These are both options. Finally, you need a tool to do the classifying. And these are generally, we're talking about machine learning or deep learning tools. And they are typically naive Bayez classification tools. But there are other options. But that's the one I'll be talking about today. And to be terribly honest with you, I'm not entirely sure I'm saying that right because the kind of thing that I've only seen written down. Nevertheless, what the classification task, if you get a computer to classify your things automatically, what it gives you is a prediction about what class every document it thinks should belong to. Typically that's a number between zero and one. So, for example, our classifier has suggested that this particular scientific article is 54% likely to be a physics paper, 0.25% likely to be a math paper, 0.05 biology, and 0.001 bananas. I just quite liked bananas as an icon there. It would give us a different sort of set of probabilities for different papers. So, this one is likely to be biology, although it could be bananas, not really close to physics or math. This one, very, very clear that it is almost definitely bananas, couldn't be biology really unlikely to be physics and math. So, this is the kind of output that you might get from a class, automatic classification task. Now, this is a matter of machine learning, as I said, and that means that you take a set of training documents that are already correctly classified. Now, this is probably a manual process of classifying those, but you could take a set of documents that's been classified by another machine learning tool that you're confident is doing it correctly. What you do then is that you provide these training sets with the correct classifications to a machine learning algorithm, which can do a few different processes, but the one we'll be looking at is sort of a fancy pens version of word frequency counts. So, what it does is it breaks the document down into words, finds the most frequent words, frequent word combinations, and associates those words with the category. For example, words like quantum or boson or quark probably only occur in physics articles. So, if a document has one or more of those words, it is very probably a physics document. Likewise, biology articles have their own words and phrases that are more or less unique to biology. Maths has its own words and phrases, etc., etc. So, once you have taught your algorithm how to recognize the words in your articles and to link them with categories, you need to test it, and that is you provide it a new set of documents. These are also correctly classified, but none of these documents will have occurred in the training set. And this time, you ask it to guess what the classification will be, and you can test its performance, ideally against a benchmark. I mean, that benchmark might be the decisions that humans have made on what category these documents should be classified into, but it could be another machine learning algorithm. Maybe you're trying to improve on the standard, at which point you only have to do better than the standard, not necessarily as well as people. There are three features of performance that you ought to be aware of. Accuracy refers to how many predictions your model makes that are completely correct. Precision is the ratio of true positives to true positives plus false positives in your predictions. And recall is the description is the ratio of true positives to true positives plus false negatives in your prediction. Sorry. So these are different ways of describing performance. And whether you want to try and improve on one or whether you really don't care about precision and recall, you just want accuracy. This depends on your research question, whether you're willing to tolerate more suggestions that should be included in the category that are wrong or whether you're willing to tolerate fewer suggestions, maybe missing out on some that are correct, but with absolutely no wrong suggestions. You know, this depends on your research question, your goals. Now sentiment analysis is a classification task. It is one example of a kind of classification task. But instead of classes like physics or maths or biology, there are classes like positive or negative, sometimes also neutral. And there are also training and test sets that are these. There's actually quite a lot of sentiment analysis training and tests that's already available. You can just download them. But you can make your own. So if you really want to do a very specific kind of sentiment analysis, maybe you want, rather than just positive or negative, you want to test sarcastic or not sarcastic or something like that, you'll probably have to make your own training and test sets. Then you need a learning algorithm. There are lots of these available. And they are potentially quite useful, certainly as a starting point, as a benchmark to compare your results to. But if you're doing an entirely novel task, or you're using entirely novel training sets, benchmarks may or may not be relevant and performance metrics are, you still have to know your performance metrics and publish them with part of your results and your work. But you don't necessarily have to beat a benchmark if you're doing something very, very novel. Essentially, you're setting the benchmark. Now, performance metrics involve re-training or adding new training to your task in response to its faults. So let's take a little bit of a closer look at how sentiment analysis and model training works. Imagine that we have a prepared training set and that it looks like this. Each item in this set is a string and each item has a class, either pause or neg at the end of it. Similarly, we have a prepared test set and it looks like this. It is more or less the same as our training set, but it shares no documents and comments. There's no sentence that is in the training set and that is also in the test set. All right. Now, we have these prepared training and test data sets, but we do need to train our own classifier, which is likely, as I said, to be a naive bias classification tool, but there are other options. More or less what these do is they take the training data that you pass them. They extract features from the strings and associate those features with other pause or neg. And then for testing, the trained classifier sums up the features that it thinks are going to be paused and features that it thinks are going to be neg, and it sort of counts up how whether pause outweighs neg. This is a bit like trying to reverse engineer a marking criteria based on marked exam papers. So what it's trying to do is it gets the marked papers in the training set and it tries to work out what features of those lead to the mark and then it tries to apply that criteria to the test set. Accuracy for both, for a sentiment analyzer like this or for reverse engineered marking criteria come from applying the criteria to the test set and then comparing to the real results. So let's look at a really trivial example. Imagine my entire training data set was these two sentences, one positive and one negative. Each sentence has some unique words and some shared words. If we're talking about extracted features, that comes down to words and we have love and sandwich are positive while can't deal with our negative. I and this are neutral words as they appear equally often in positive and negative sentences. If our test sentence is ideal with sandwiches, the sentiment analyzer would find one neutral word, two negative words and one positive word. It would probably return a prediction of neg, which, hang on, which numerically would come out as a negative 0.25. This is because ideal with sandwiches isn't strongly negative, but it is more negative than it is positive. Deals with implies kind of a hassle or a bother. So the prediction matches the real class assigned to the sentence. Our accuracy in this case would be good and you can specify more sophisticated classifiers that look at co-occurrence rather than individual words, that look at semantic roles, that look at key words and give them more weight or rather than simple word frequencies that pay more attention to first or last words. There are a lot of options that you can do to make your classifier pay attention to the things that you want it to pay attention to. But let's talk for a moment about efficiency. Sentiment analysis training and test data sets I've showed you here are really small, like trivially small. Real data sets have tens of thousands, hundreds of thousands or even millions of items. Training a classifier on very big data sets means associating the score with every extracted feature in the training set, which can be very computationally expensive. Many of these features won't even be relevant, like a full stop at the end of sentence. This is equally applied to positive and negative sentences, so full stops are just not useful features. You can process your training data to remove the number of hopefully irrelevant features like punctuation, like stop words, different forms of the same words. This just reduces the number of features to be extracted that will turn out to be neutral. Your computer will be much less happy if you tell it to extract and associate the features for unprocessed text than if you give it process text. This is really important if your research is computationally intensive in volume, but also limited in resources, which to be fair is most of our work. So consider all of the pre-processing things that we covered in previous webinars. Now that's roughly about classification tasks and sentiment analysis. So I want to move on to network graphs. Network graphs are like a map of relationships between things. The things, which are usually called nodes, can represent just about anything you want. They can be people, businesses, cities, social movements, rock bands, anything really. The relationships between those things are called either links, or they're often called edges in network graph terminology. And they represent just about any kind of relationship that can be social relationships, financial relationships, influence, oppositional relationships like competition, basically anything. You just have to define these things for yourself, but they can represent any kind of relationship. There are various main features of networks that I want to explain, because it's important for you to understand what you need to extract from text in order to build these network relationships. So network graphs can be undirected, meaning the edges are bi-directional, which is the same as saying they are directionless. This means this is meant to demonstrate a reciprocal or equal sort of directionless kind of relationship. For example, Noah and Cherie here work together, so they have the same relationship. Noah has a colleague relationship with Cherie, and Cherie has a colleague relationship with Noah. But network graphs can be and represent a non-reciprocal relationship, although you can actually have nodes going both directions, but they would indicate that the different people have unequal or imbalanced relationships. So for example, the woman at the center, because she has a stethoscope and a notebook, I'm going to assume is a medical doctor, and all the people around the edges are her patients. So she has a blue relationship with them, indicating that she is responsible for them as their GP, whereas the people have a purple relationship with her, which is that they're not responsible for her. They're asking her for advice, or they're relying on her for information and management of conditions. So this shows the difference between directionless and directional graphs. Network graphs can also be unweighted, meaning that the edges are all equal weight. In this case, Noah, Cherie, and their other colleague, who I'm going to call Marisol, all have the same weight relationship, meaning they're all colleagues, and they're all sort of equally colleagues. Maybe they all joined at the same time, or they work together about as equally often as they work with each other. So there's no closer ties there. On the other hand, if they don't work together equally, or they joined at different times, you might find that they have weighted relationships. So in this case, we might find that Cherie and Marisol work together three days out of the week. Cherie and Noah only work together two days out of the week, and Noah and Marisol hardly ever work together, maybe only one day a week, maybe only every other week. So in this case, the edges represent not just that they have a relationship, but how strong that relationship is, or how frequent the interactions are, or maybe how meaningful or important or old this relationship is. The weights can indicate whatever you want them to indicate, but they do need to be consistent. All right. So having understood a bit about sort of what network graphs, what features they have that you might need to extract, let's look at actually how to extract those features. So let's say our data set included sentences like Archibald walked through Manchester with Barrel, Tariq saw Barrel when she was playing tennis, and Archibald shares a house with Barrel and Cherie. Should be Keras, shouldn't it? Anyway, so the basic processes, so what you're going to do, just like we did in the basic natural language processing webinar two weeks ago, we put it through processes to break it into words to label those words with the part of speech. Then we want to put it through a named entity recognition chunker. This divides those part of speech labeled words into chunks that are labeled with, for example, sentence, this is the whole chunk, or person, which is only this chunk. So Archibald is correctly labeled as a person, as is Barrel here. So that's the next step. Carrying on, we want to extract the relevant chunks. So from each sentence, I have pulled out the names of the people involved. And then once you have the people, you want to find unique ones. So this one was the people by sentence, whereas we want all of the people just once, regardless of how many sentences they appear in. These are the unique chunks, and they will become the nodes in your graph. Now for the edges, we want to go back to the people by sentence, because we're going to use occurring within the same sentence as indicating a relationship. So that is, we want an edge between Archibald and Barrel, because they occur in the same sentence. And if we read the sentence, it's clear that they do have a relationship. They walked through Manchester together. Obviously, this one's more clear. The three of them are housemates. So they have a housemate relationship. That's a pretty clear relationship. It certainly deserves an edge in our network graph. So now we want to find co-occurring pairs. And to do that, we pass it through a pair-wise permutation option that finds every combination of named entities per sentence. So for the first two, we get two results, Archibald, Barrel, Barrel, Archibald, and then Tariq, Barrel, Barrel, Tariq. Now the third one, because it had three people, you actually get six edges. So because each person is connected to the other two people. Now in this setup, the way we have this interpreted, all of these edges are bi-directional and unweighted. Now that's not terrifically realistic, because sharing a house with someone is clearly a much stronger relationship than seeing them play tennis in the park. So we could add weights and that's not necessarily very difficult. You just sort of add a comma and a number after the extracted entities. So it's saying the link between Archibald, Barrel has a weight of one. Now clearly, seeing someone play tennis in the park is not as strong a relationship as going for a walk together. And the second one, we could actually eliminate it altogether, making this a clearly one directional relationship. Or we could leave it but put it very, very small weight, because we could assume that if he saw her playing in the park, they were in the same physical geographic area. And that's still a kind of a link, depending on what kind of analysis you want. It's not a social link, but it is a geographic co-occurrence link. Maybe that's what you're interested in. On the other hand, sharing a house, that's clearly a much stronger relationship, so all of these links will get a 20. We could skip all of this about weighting individual links and say that they're all one, but we would still be able to weight the edges in the graph because Archibald and Barrel are more than one link. So we could, even if they were all weighted one, we could get some relationships being stronger than others by summing up with every new link that's added. We could check if it already exists. If it does, we add more weight to it. So there are different ways to add weight and direction to extracted edges. And for things like this, like saw, part of your classifier can use semantic information about a verb like saw that has a clear subject and object. And it would suggest, for example, if you wanted to use that kind of feature to drive the weights or the directions of your edges, it would probably only suggest just this edge. It wouldn't have found the pairwise return. But again, these are the choices that depend on your research question. Do you want every possible relationship between these people, but do you want them weighted? Or do you want only clear, obvious certain kinds of relationships? It depends on your work. So let's look at our basic network graph with undirected and unweighted links. In this case, we see Barrel as the strongest position in this network because she has the most links. So she's quite central, whereas Tariq only has one link, and Archibald has two, and Keras has two. So we see there's already some network features we can extract, even from this trivially small link about average shortest path or centrality, different kinds of features like this. We take the same network graph, but this time we put directed and weighted links. Barrel still comes out as really important because she has this really strong relationship with Archibald. She has a medium relationship with Keras, and she has a weak relationship with Tariq, whereas Keras has two medium relationships. So in this case, we get all the same features plus extra ones that account for things like the weight of the edges. Now, I do want to encourage you, if you're interested in this, to go to the Jupyter Notebook and the link will be shared at the end of the webinar, and have a look at how I've created this graph. This still has Keras here, and there's Tariq, Barrel, and Archibald, but this time there's a lot more things because the network, the set of training data or the sort of natural language data that I used to extract these nodes and edges was more extensive. Also, you can go through how to create a circular layout graph like this, or you could create this sort of spring layout graph. This one has weighted because you can see this relationship is stronger. It's got a different quality to the link. So that Jupyter Notebook will walk you through all of the steps that I did about extracting the names, putting them into a nodes list, feeding that nodes list to the network graph visualizer, extracting the edges, and things like that. So there's all the different steps in there, and there are a lot of them will be familiar because it still breaks the original documents down into words. Part of speech tags those words, named entity recognition, chunks those tagged words. This should be fairly familiar, but there's extra steps added on that allow you to do more complicated, more interesting things. So just as a reminder, this is our GitHub page that takes you to our text mining code books. These are the code notebooks that you can launch in your browser and work through the different codes sections that I have. You can see which packages I use, which features I use, the commands, all of that stuff. I also want to recommend, of course, the Natural Language Toolkit book that is available online. It's free to access, very cool. The different corpora that you can get through that. Spaces are more up-to-date. It's sort of a more recent version than MLTK, and some people might find it does things a bit more flashbang-wizzy. Semantic Vectors is a package that you can get. This one's on GitHub, and it's actually from this guy, Dominic Widows, who wrote a really good book about geometry and meaning, and that is what it means to extract features from words based on text and to create a vector, a word vector that has its meaning embedded in it and compare or combine those word vectors. It's really abstract, but absolutely fascinating. Finally, Networks Python package. I really recommend it. It's what I use to create those visualizations, these two. And there's a lot of things you can do. You can change the size of the nodes. You can change the way the edges are represented and the color. You can even change the font of how the nodes are labeled. All kinds of really cool stuff if you want to have good graph visualizations.