 OK, thank you very much, Edward. Thank you, everyone, for coming. As Edward mentioned, I'm working at Cedar on this Marie Curie Enterprise Ireland Career Fit Fellowship. So my interest is in natural language processing and text analysis as applied to data from social science and humanities. That means things like historical texts, political texts, social media text, and legal text. And as my partner project in Cedar, I work with a company called Coralytics who work on analyzing financial regulations and regulatory documents. But today I'm going to talk to you about natural language processing in general, and then look at some examples taken from social media text and some ways of kind of quickly developing prototypes and dashboards for analyzing texts like that. OK, so I've kind of just said that there. The packaging we're going to talk about at the end to do visualization is a component of the R programming language called Shiny, which allows you to develop very quickly dashboards and interfaces that include sliders, radio buttons, dropdown boxes, text areas, hook in the machine learning components to them without writing any JavaScript. So I'm sure a lot of people here work in Python. Python is excellent for natural language processing. In the social sciences, R is very widely used, so people might not know JavaScript or Python. But now there are really good tools in the R programming which we're doing this. So if you know R but don't know Python, it's not as necessary as it once was to switch. If you put NLP into Google, the first results you get are often about neurolinguistic programming, which is a kind of business psychology concept. It's nothing to do with the academic area that I'm talking about here. So this is natural language processing. That's the automated or computational processing of natural language. And the slight distinction, you know, natural language as in English or other human languages, like the language I'm speaking now, but it's the analysis of that and not things like computer languages or other types of information encoded in text, like there's a lot of work on doing analysis of genetic data in text form. So it is text analysis, but it's not natural language processing, so that's the distinction there. And there are lots and lots and lots of applications for this, a huge pipeline, you know, going from when you first speak into a system, such as your phone, down to the responses and the interaction you get with that. So voice recognition, automatic translation, parsing, recognizing entities, these are all, you know, really, really dense research areas on their own. If you look in Google Scholar for any one of these areas, there's hundreds of new papers every year, so I'll show you some quick examples of those and then I'm going to focus a little bit in on classification in particular, and then exploring word meanings, a new area called distributional semantics. And just those packages I mentioned before, if you're working R, two text packages commonly used are Quantita and Tidy Text, and in Python, NLTK used to be very widely used. It's kind of been, I think, taken over from a little bit by these other two packages, Spacey, and the Stanford Core NLP Toolkit. So the Stanford Toolkit is actually a Java Toolkit, but it has Python bindings, too. This is an example of the output from Spacey's syntactic parsers. This is an automated parse of a sentence. If you read down from the top there on the left, the sentence is we are building a better health service and providing more care. And then you can see in this table all the structured information that a modern NLP system is able to extract. So the second column there, the lemma, is a process, sometimes very similar to a process like stemming, where the inflected form of the word, like providing or building, is reduced to its root form, goes morpheme or its lemma, so build or provide. So stemming just cuts the end, cuts the inflection of the words, basically. Lematization is a little bit more clever. You can see the second line there, it converts the verb R to its root form B. So whether it's R or was or is, it gets converted to the root B. And then the final three columns encode the syntactic information about the sentence. So you've got the part of speech of the word detected, and then the last two is easier to understand or if you see it as a tree. But basically for each word, it tells you the type of syntactic relation it's in with another word in the sentence. So we is the noun subject of the verb build. And that's really useful for extracting propositions, you know, predicates, logical expressions from texts when they can be actually realized in surface language in many, many different ways. Okay, so you're probably fairly familiar with the idea of classification, even as it's applied to documents or texts. So email spam detection is a type of classification. You want to say whether a document in your inbox is spam or not spam. We might want to do a kind of hierarchical classification by placing a document into taxonomy. So for example, if we want to say that a comment on a social media website is spam or it might be unwanted in a different way like being too toxic or abusive. So this is just kind of the same problem as classification but in a hierarchical way. And this task can also be a continuous task. So there's very a difference in practice in the way you implement this. Your output function can be something that converts a toxicity score into a binary yes, no decision for spam or not spam, abusive or not abusive. Sentiment you've probably heard of means like a positive or a negative emotional content in a comment or a review of a product or something. So whatever the specific outcome in terms of classification or placing on a scale, the method is fairly similar. We're looking to extract information from the text like words, phrases, syntactic relations and use these as features in a model to get an output. And then there's an unsupervised version of this where we can just extract features from the text, again like word frequencies, phrases like bigrams or trigrams or syntactic properties and without any particular classes in mind, simply scale the documents on a map 2D or 3D map and see which documents naturally are similar or close to one another. And this in the supervised classification case, this has all of the usual evaluation metrics of a classification task. So you have to consider, as always your business case here, if you're a social media company, it's not just the accuracy of your kind of abusiveness detector you need to worry about. You need to worry about which kind of error is worse for your business. Is it worse to, you know, delete a comment that actually wasn't abusive or is it worse to allow through a comment that was abusive so they can both have bad consequences. So accuracy alone is not always the best score. Okay, so any of you familiar with NLP in the last few years might have heard of topic modeling. So topic modeling is a way of discovering topics or issues from a group of texts in an unsupervised way without having pre-specified classes in mind. So topic modeling exploits the way that words are distributed differently across different documents to extract groups or clusters of words that represent a particular topics. So this kind of, these rows of words here are output from a model trained on British newspapers. Each row corresponds to a particular topic and you can see that there are some topics that seem to correspond to healthcare, others that correspond to the media, others to court and the police. And this isn't, nobody puts this information into the model, you tell it how many topics you want and these kind of related groups of words emerge from the way that words vary across documents. There's good packages for doing these in Python and are off the shelf. Okay and then an area that I'm personally interested in is one that's got maybe a bit more attention in recent years, which is distribution of semantics or trying to learn about the meanings of words, the semantics of words based on statistics of how they occur in text. So this has got a lot of attention recently because of progress with word embeddings and word to VEC, but it's not a new idea as such. There's a history in linguistics and the philosophy of language that says, words don't have some kind of completely objective meaning handed down from the sky. The meaning of a word comes from how it's used in society, it's used in ordinary language and we can model what a word means in a particular linguistic community by looking at how speakers use the word and the way the word is used is the definition of what its meaning is. So as I said, this is not a new idea in these references I have at the bottom of the slide there. Some of these predate word embeddings, but there was even older techniques in the 80s and 90s called late and semantic analysis, in semantic indexing. It's just that the availability of very, very large text corpora and much more processing power has kind of made these methods more successful and more accessible. Depending on what you count as a word, the English language, most people have about 50,000 words in their vocabulary. The wide range around that, like I said, depends on what you call a word. But if you think that you want to look at the occurrence of each word in the vocabulary with every other word in the vocabulary, you're already starting to fill up a matrix that's 50,000 by 50,000 and you need a lot of text to get examples of, for you to run across examples of many core occurrences. Until recently there were sort of data and computational issues. But now everyone's using word vectors and word embeddings, which are like the previous techniques, they just use a different method, either a neural network or SVD to reduce the space of the word co-occurrence matrix. And you might be thinking of everything I've said so far, I haven't mentioned deep learning. So where does it fit in in the context of all these things? Well, word vectors, word embeddings are not a deep learning technique. They do use a shallow neural network to reduce the dimensionality of the co-occurrence matrix. But where deep learning really is applied, a lot of these days is in the previous tasks I showed you, like translation, speech recognition, dialogue systems. A lot of these things now are the best performing systems are systems that just take in very, very low level features like words or even characters. And you have a deep learning system that goes completely end to end to do translation or question answering or something like that. But for the applications I'm interested in, like social science, law, humanities, interpretability is very important and people in these areas are used to using interpretable regressions. And while there are definitely, there is progress being made on interpreting deep learning models, it's still not as straightforward as looking at the value of a parameter in a regression or some kind of linear or near linear method. So if the, and often in these cases, the regression or naive Bayes or something like that, linear SVM on a bag of words or a bag of trigrams gives very, very close to as good performance as the deep learning method. So it's at least worth trying the kind of more traditional statistical learning approach before going to deep learning. So yeah, this is an example of, I'm gonna show you now an example of one application of distributional semantics, which is to look at how word meanings changed over time and what meanings of particular words were in the past. This is a tool that's been developed years ago in the UK called the sketch engine. And the idea is to, from a corpus, produce a sketch of what a word means. This is from a corpus in the 18th century, which is why you have these funny long S's in some of the spellings. And the idea is to get an overall idea for what the word liberty meant in this time period by looking at how it occurred in syntactic relations with other words. So liberty and property, liberty and privilege, liberty and rights. These were common associations in the time and then other types of syntactic relationships like the preposition friend of liberty, cause of liberty, blessings of liberty. And you can use slightly different association scores other than just word frequency to control for the background frequency of words and in different ways play around with the word association scores to see what the corpus suggests the word means. That tool is called the sketch engine if you'd like to look up online. Okay, so I'm gonna show you an example and hopefully a demo of working with some modern political text from social media from Reddit. Reddit is a very good source of text data. It has a lot of problems too. So there's a lot of kind of really terrible content on Reddit, terrible communities, a lot of abusive content. And it's not a random sample of the population, 70% of Reddit users are men and it's demographically skewed younger as well. But a really nice property of the text from Reddit is that it's very neatly divided into subsections. So you have lots and lots of specific communities. There's a community for Manchester United, a community for vacuum cleaners and a community for every political affiliation you can think of. And these are good test grounds for political text because the rules of the, you can see the rules for each subreddit and they clearly say what's allowed. And these places tend to suggest that they're places where people who consider themselves to be socialists, for example, come to talk about socialism among themselves. So it's not really an argument that involves lots of different things mixed together. They're kind of clean examples of people with this ideology using the language. And then I also have a demo. I don't have time to show you both, but I'll put the URL up for both of them. There's a corpus in one of the R packages as it comes with a corpus of speeches from US presidential primary debates for the Republican Party. So there's this kind of data, at least in the modern context, there's a lot of it out there. Yeah, this is an example of some of the discussion from the Reddit socialism community, just to show you that this is an anonymous online community talking about politics. So you might expect it's gonna be pretty terrible, but a lot of them are fairly well-moderated and the text often does contain good political discussion, at least as close as you get access to online among ordinary people talking about politics. It's not all terrible and abusive. And there's a lot of metadata with it, so you can get the content of the comments and you can get their score and their time of creation so you can kind of filter the data in different ways. So what I did with this data was to extract co-occurrence relations from the words by looking at how often one word co-occurred in the same comment as another word. And then you can use the co-occurrence frequencies to build a score. You can do that by just using the frequency score, but if you do that, it tends to be that the words with the highest score are simply the words that are most generally frequent in language. If you know the measure of mutual information, point-wise mutual information, something like that is a way to control for the background frequency of the word and then extract for every pair of words, you'll have a score for in the corpus overall, how often do they tend to co-occur together. And you can use that score between each pair of words to build a network. So if the score is above a certain threshold, then we'll draw an edge between the two words. Then you've got a semantic network or a word co-occurrence network and you can use all the tools of network science on that. So this is an example of a list of the terms with the highest centrality in the networks extracted from the libertarian and socialism communities using two different network centrality measures. And you can see that at least for face validity, it really clearly reflects the most important terms that you would think would be most important to those communities. Now you might get something like this out of topic modeling, but it's a different approach and it also allows for a kind of flexible exploration, as I said, with all the tools of network science. Okay, and then you can also build a dashboard with all the packages that R provides to explore this. I'm going to hopefully show you a demo of this now. So as I said at the start of the talk, Shiny is an R package that allows you to build dashboards with panels, input boxes, sliders, for exploring data quickly. Now if you've got a huge customer-facing product, you probably have designers to do this kind of thing in JavaScript and Python for you, but for an academic researcher or for someone prototyping internal tools, being able to develop these kind of interfaces quickly without having to use JavaScript is really great. And also there's a lot of really cool visualization packages in R. So the network visualization and all the different data packages R provides I think is richer than what Python does. And it's also really responsive. So the data visualization components are hooked into the input components and will respond to new input really quickly. So you can immediately enter in new terms and see how the network spreads out. And there's 2D and 3D visualization packages. And you can use the things that I-graph offer you. So if you have things like shortest path, we'll detect the shortest path between the two nodes and make that into a network. So you can, and this is all adjustable, maybe the labels are a little bit large there, but you can adjust all that stuff with I-graph. And then there's also a classification example if it loads. So this is the data from the Republican primary debates. So the Quantita package comes with these speeches preloaded. So I don't know if you can see. I trained two regression models on this text. So just to say a bit about how this task works, if you're trying to train a model to predict the political party and the speaker based on the words in the speeches, okay? So as I said, the vocabulary might have 50,000 words. So then if you include two word phrases, two diagrams in this, because you've got a huge, huge number of dimensions. But it's not hard to find ways to reduce these dimensions. You can exclude the most common words, stop words, and you can exclude words that occur less than a certain amount of times in the text overall. And then depending on how many observations, how many documents you have, you can do an ordinary linear regression even or you can do a penalized regression like ridge or lasso, which will do some feature selection for you and discard the features that aren't needed. So again, these type of models are really responsive. The reason it starts off with the Republican prediction higher is just because there's more Republican text in the corpus. So the kind of the prior or the intercept is higher for Republican. But if you start to type in text here, this will change live as you type, okay? So I typed in, we need to help the poorest people in society and it kind of balanced up a little bit more. I think Clinton's bar came up a bit. So this is really simple to do. This is the code for this classifier. So this is the UI code. It's literally 10 lines of code that reads two bar plot outputs. And then the equivalent of the backend, although they all sit in the same place on the server, is less than 30 lines of codes to split the input text into words and then pass that to the predict method of the trained model. And even if your data overall is less than a certain size, you can host your app for free. The site is called shinyapps.io. I think it's linked to RStudio. I saw they just stand out there. And there's a button in RStudio where you can click publish and it will immediately send this to a URL that you can try right away. So it's a great system. Okay, so I'll leave you just with those links and I think that's it. Thank you.