 This is text mining and introduction and theory. So we won't be doing any coding in this webinar, but there are two more webinars you see here in the upcoming section. Basic processes on the 16th of June and advanced options on the 29th of June. I will be demonstrating code in both of those webinars, and I will also provide you with Jupyter Notebooks so that you can work through the code on your own or copy-paste into another platform as you like. So before we move on, I just say my name is Dr. Julia Kazmier. I'm a research fellow at the Kathy Martian Institute in the UK Data Service. Please feel free to send me emails or tweets about questions you have about this webinar or any other work that you see me do. I'm happy to engage with the topics on various platforms. But diving into the topic today, I just want to start out and say text mining is one kind of data mining that, as you might have guessed, looks at text. That might seem blindingly obvious, and that's fine, but we're going to dive into that a little bit more and maybe I can cover some things you wouldn't have thought about. So other kinds of data mining might look at images and sound movements, lots of other kinds of data out there, and we may do webinars on those in the future, but today we're just focusing on text mining. But importantly, data mining, or indeed text mining, is not just a buzzword. It's actually a large category of research methods based around the idea of transformation. Stuff goes into one end of a sort of transformation machine and it comes out the other end. But what is it that goes in and what comes out? Lots of things can go in, text, obviously, but also recordings of speech, videos, images, seismographic information, who knows. Unstructured data, usually in chunks, things could be like a blog, which is a relatively unstructured chunk of writing, but also images or live video feeds, because it can also be a stream as well as chunks. So what comes out of the transformation process? Well, data, specifically structured data. And let's talk a little bit about that. So what do I mean by structured data? Well, I mean, most people, when they think about structured data, they think of the kind of that data that comes in databases or Excel sheets, vectors or arrays or a matrix or something like that. They're arranged in structured ways, usually depicted as rows and columns or as tables. For example, if I want to track my online education sessions, I would probably want to record basic information, like the title of the session I gave, maybe the date I gave it, the number of attendees, 75, there are actually much more than 75 attending, so that's great. Thumbs up to all of you. Now, I could write this information on a sticky note and tack it to my corkboard, and that is indeed how lots of people record information. But if I can assume that I will be giving lots of online education sessions, it's probably safe to assume that I want to do something a little bit more interesting than just tack it to my corkboard. I might want to calculate my average attendance or plot whether my sessions are becoming more or less frequent over time. I might want to, you know, plot maybe the evaluations that people give. These are all things that I can do once I have a certain volume of data. So knowing this, it's much more sensible to record this data for a given education session in the same place and in the same order as the other information that I collect on other education sessions. So, you know, obviously I can do better than sticky notes tack to my corkboard. I can, you know, sort of record the names and the dates and the attendance in rows and columns. So, you know, this is back to Excel spreadsheets or data mining or sort of databases and arrays and that kind of thing. Now, if I do this for each, all the different sessions, I can have the same format in each column and the same related content in each row. So in the columns, this column, for example, I have all the titles and this one all the dates and this one all the attendance. And in the rows are all things that relate to each other. So this title and this date and this attendance are all for the same event. That's pretty obvious. I think we all, you know, we're pretty good with stuff like that. But there are maybe less obvious examples of structured data. So calendars are structured data because they're in rows and columns for days of the week and, you know, a sequential order. Database systems obviously very, very structured, but not necessarily in rows and columns. There can be tables and sort of linked tables and dictionaries and things like that. Filing systems obviously quite structured. Libraries, but also less obvious examples like a grocery store where things in the aisle, at least in principle, go together or share something in common. Your wardrobe where hopefully you don't have like your business suits and your hiking boots and your underpants all shoved into the same drawer. Instead you have them structured so that there are things that need to hang or hanging and things that go together are all in the same drawer. Also even train stations because the platforms are structured and the people move in certain ways and the information is shared out in structures in an organized fashion. So these are maybe less obvious examples of structured data. So the thing about structured data is from a statistical point of view, it's really familiar, it's really easy, it's really easy to demonstrate how to work with it and how we came to the conclusions. You know, for example, if I wanted to show what the average attendance was for these sessions, it's easy to see that you add up the columns and divide by the number of columns. That's great. But what about unstructured data? Now this is very familiar to us in our lives. We move through a world in which data is largely unstructured but it's less familiar or easy or demonstrable to work with in the way that data scientists are used to working with data. The actual information is probably about the same but the lack of structure means it's all jumbled up together in one document like this first one, it's all put on one sticky note but in the second case it's on three different sticky notes and in the third case, you know, some of the information is missing. So here it's all jumbled together. Here it's the same information but spread across three different notes. This one's missing a title, I have no idea why, I must have lost the sticky note. Also things are in a different format so I've written the dates here in three different ways. I've written the people who attend in three different ways. One is a key value pair where 75 is the value and attended is the key. One is written out, the numbers are written out but there's no key and then one is even a date range, sorry, a numerical range. It's not a specific date, also no key. So this is more or less the same information but in a very unstructured, inconsistent way. Now that can be a problem. There's also superfluous data like this picture of the shoes that I was wearing that I want to wear when I give this session, you know, they're great shoes but doesn't need to be there. So the key point to take away when you're looking at unstructured data is actually that it's semi-structured or semi-unstructured. There's still a lot of structure here but the structure is less coherent and orderly and tidy so it's less immediately accessible. So in this case, semi-structured or semi-unstructured depending on how, you know, whether you're a glass half full or glass half empty kind of person is less accessible, it's difficult. It requires a lot of intuition and common sense with how to work with it and intuition and common sense are not really available to computers. So trying to get a computer to do this is very difficult but that's fine because that's what text mining or data mining is about. So it is important to say that historically, like intuition and common sense can be applied to unstructured or semi-structured data, that's fine. It just means that sort of getting insights and conclusions and results out of working with this kind of data requires a lot of time, a lot of expertise developed slowly, a lot of gut instinct, and that when people, you know, want to share their conclusions, it's really hard for someone else to reach the same conclusion because it's such a slow and difficult and personal process. So it's hard for you to show your work, it's hard to show how you came to your conclusions, hard to convince someone else to come to the same conclusion. It's a little bit difficult and that's why certain, historically, scientific sort of things that work with structured data, you know, temperatures and dates and numbers, different kinds of values that can be recorded in a very structured way. People can show their work and they can publish their results and it's easy to see the conclusions. Whereas, you know, humanities had a lot of discussion and interpretation and, you know, the work, a humanities paper tends to have different sections than a hard sciences paper, for example, because just showing the work is difficult. So what does this matter, though? Like, the division used to be that we just gave structured data to the sciences and unstructured data to the humanities and people just did separate work on different kinds of data. Well, that's not entirely great because the division is getting a bit boring and really we don't want to limit ourselves in how we approach things, especially in a world where semi-unstructured data is now absolutely everywhere around us, you know, social media and fitness trackers and geolocation data on your phone and there's all kinds of data that is kind of structured and kind of unstructured and it says something about how we live our lives, but it's hard to work with in traditional ways. It's also happening much too quickly to become an expert on in the traditional humanities approach of sort of reading all that there is to read and thinking critically about it, writing another book. So what are we going to do? Well, you can't just treat them the same. Because the tools just won't work, you can't sort of force a blog into an Excel spreadsheet and expect to do anything useful with it. It's also the process of forcing it into an Excel spreadsheet is going to be really difficult. It's going to be a waste of time. It's not going to be easy to describe how you did that or why you did it so no one else will be able to replicate your work anyway. First you need to turn the semi-structured data into structured data and there are tools to help you do that and I recommend you use them, which is what we're going to deal with now. It's just a quick look back at this image like what goes in is semi-structured data and what comes out is structured data and the machine that does this transformation is for our purposes text mining, but it could be data mining more broadly. And it does that by capturing the structure that already exists in semi-structured data and amplifying it and cutting off the bits that don't fit into the amplified structure. There are four basic steps, which I will walk you through now. The first is retrieval and you will be familiar with this if you watched any of our web scraping webinar series because it is about acquiring data. So you can scrape it off the web or you can download it through an API or in fact you could digitize some records from the local library. There's a lot of different ways you can acquire data. You can retrieve the information to work with. So for example let's say that I want to identify some language that politicians use when they are talking about projects that are unlikely to actually get off the ground as opposed to projects that are very likely to genuinely happen. So first I would want to retrieve a set of texts that include speech of writing from politicians about projects. And that must include some that I know that happened and some that I know didn't happen. And I might for example do a search through LexisNexis or some kind of like news source search engine and specify a source and a date range and key words to look at. In this case rail electrification in the north of England was the topic I was looking at. And I want to sort of get at least a hundred or a thousand or something like that texts, newspaper articles in this case on this topic. That would be a good retrieval method for getting texts to work with with text mining. Now I would also want to have some projects that I know were successful but this is just an example of the kind of thing. I see a typo in the date range that is not going to help. Anyway, so it's always preferable to keep a copy of your raw acquired data. So whatever you get from this search, you run this search, download a big data file, always keep that separate and work through the rest of the steps on a copy of that data. That way you can provide your raw data for anyone who wants to replicate your steps. The next step is processing. And this is the process by which it's a bunch of practical steps that turn your potentially quite messy raw data, that big downloaded file, into something a computer can read and work with. You can, in theory, do this manually, but if you're working with any kind of volume, you will want to use computational tools. There's some very good office shelf tools that you can use and adapt, for example, in R or Python, and you will want to become at least familiar with this if this is the step that you're going to spend a lot of time working on. So the practical steps of this include things like dividing and renaming files. So you'll take your raw data download and maybe cut it into a series of documents that are one document per web page or one document per newspaper article or one row in a database per tweet, something like that. So you want to cut it up into bits. And then you'll want to do sort of basic natural language processing things like maybe correct spelling or remove capitalization or substitute acronyms or something like that so that the United Kingdom spelled correctly or spelled with capitals or written as the UK or UKGB or something like that. Those are all counted as the same, talking about the same thing. So that you have to sort of process over it to make sure that the things you're interested in are counted properly. And then there's more advanced natural language processing as well. Classifying words by grammatical category, for example, disambiguating meaning by context and parsing sentences and marking up structure. And we'll go into this a little bit in an extraction because the processing step and the extraction step can sometimes run into each other and then can sometimes need to loop back and forth between them a couple of times to make sure you know what you're doing. If you're going to do these steps on very large volumes of data or going to do these steps frequently on new data sets, you probably want to devise your own coding scripts to automate the processes. And that may require a lot of time at this point, but that will be time saved depending on the volume of work that you're planning on doing with this step in the process. So extraction is about running the analysis on the processed data. So for example, if our raw data was a sentence like this, maybe from a message board or a tweet or something like that, it includes capitalization, punctuation, and the spelling error. We would process that to take out the capitals and the punctuation and the spelling error, and then we would put it genuinely into machine readable format, which here is sort of this nested set of brackets about how part of speech tags and definitive cook is classed as singular and present tense verb and things like that. Now this final sentence looks like gobbledygook to us. We're much better as humans at reading the first sentence despite the spelling error, but a computer is really going to need this sort of output, this nested series of tagged language. So this process does involve some choices. Do you want to remove the spelling error? Because maybe the research question that you're interested in involves not correcting spelling errors because maybe they're part of slang or they're part of the research topic that you're looking at in some way. But you'll want to document all the steps that you took. So did you correct the spelling errors? Make sure you say how you corrected the spelling errors and what program and at what step in the processing, that kind of thing. So moving on to extraction. Extraction is about running the statistical analysis and methods relevant to your research question. So this might be fairly basic things like counting the relative word frequency of different words in your body of work. For example, people that write letters of support for candidates, job offers and that kind of thing, recommendation letters, statistically use different words in different frequencies when they're writing about men or women candidates. So this is the kind of analysis that using something as very basic as relative word counts, you can show a statistical difference between letters written about men and letters written about women. But just the same, the extraction could be identifying patterns that are more complex than just relative word counts. So equivalency suggestions. You could show that how people use language demonstrates that people are using two words in the same way but in different contexts or the same word in different ways in different contexts and therefore that word has different meanings. So these are the kinds of questions that are a little bit more involved than just counting words but they're also about counting and they're about counting in context, things like that. There's relationship discovery. This one's quite interesting because you can calculate the relationship between terms or entities or events or places or lots of things. For example, you could graph the social distance between people or between organizations based on how often they are or are not mentioned together in documents. You could put events extracted from a newspaper article into timelines or onto maps based on this kind of relationship discovery because articles will say things like nearby or previously and you can use text mining to capture those kinds of words and turn it into maps of the relationships between the entities. Automatic categorization. This is quite a popular one because it's a real time saver. It helps classify documents by sentiment or by topic or by author or time period or gender. It can help recognize plagiarism. It can generate automatic responses to customer emails. There's a lot of automatic things that can be done by getting text mining sort of extractive processes to look into documents or questions or tweets and recognize things automatically. And finally prediction. This one is quite difficult but it is also very interesting. It's a growing area of application. A paper by Choi and Varian in 2012 referenced contemporaneous forecasting or what they called now casting to predict crimes or terror attacks based on intercepted communications. So they took intercepted communications as the data they retrieved and they processed it in the way that they did and they extracted statistical sort of errors indicating that an event was more or less likely to happen at different points in time and that's quite useful potentially if it's a very important event that you're trying to predict. So all of these steps use familiar statistical analyses more or less to show significant differences or significant relationships or significant factors of influence. And this step is more or less about your regression models and your p-values and your error bars and all the statistics that you're probably already familiar with. Exactly how they apply to text is perhaps less familiar but yeah, that's just an application issue. We can learn that. And finally insight. So insight like step three, the extraction should be broadly familiar territory. It's about sort of showing the outcomes from the statistical methods matter. It also involves justifying all the steps you took, why you retrieved the data you did, why you processed it in the way you did, why you extracted the things that you chose to extract, why you ran the statistical tests that you did. And this will be quite familiar as a process to many of the hard sciences because this will be the methods section, it'll be the sort of results and discussion section. And it's perhaps less familiar to some of the humanities. Social scientists are probably also familiar with a lot of these steps because they'll have to justify why they chose the sampling method that they chose or ran the statistical tests that they did. But the more sort of abstract expertise-based research, you know, you can't just say building on the theories of this previous person, we now add aspects of this. That won't fly. You need to be specific. What theories are you building on? How have you added to them? Why have you added to them in the way that you have? So this step also includes presentation or visualization of your results, which can be very difficult when they are not as many well-accepted examples how to present or visualize those results. For example, this is a word cloud of one word reactions to one of President Obama's State of the Union address. The size of the word represents frequency, but location on the apparent map is irrelevant. Now, this visualization is a very different impact than you would get with a simple table of the words in one column and the frequency counts in the next. It is, you know, it has a very different impact, but you have to be prepared to consider that maybe that impact could be misleading because people might interpret the location of the word as being relevant to where the people who used that word were from, and that's not accurate in this case. So is it useful? Is it misleading? Is it distracting? Is it just unfamiliar? Is it perhaps quite encouraging because it shows very visually something that's very hard to do with a table? I mean, probably people would not look at a table nearly as much as they would look at an issue like a visualization like this, and that's useful potentially. So let me just seem to have an error in my animations here, but I'll just walk through all four steps of one text mining example that I did consider. So, for example, if I wanted to download 10 days of tweets from 20 users and also the trending hashtags for those same 10 days and remove everything that isn't a hashtag and I store individual hashtags in a data frame labeled by date and author. Now I want to compare the tweeted hashtags to the trending hashtags to calculate a trendiness score for authors based on the degree of match between the trending hashtags and their tweeted hashtags and also the timing. So my insight would be about what does this trendiness score indicate? Does it indicate someone is causing hashtags to trend or that they are quick to react to hashtags that are starting to trend? Does it indicate that they follow trending hashtags quite closely and that they start participating in things as soon as they become trending, something else? Just to let you know, I did not actually do this bit of research, but this is the kind of research you could do if you had a research question about how do we demonstrate that people cause hashtags to trend or respond to trending hashtags. If you had a research question like that, you could do these steps. Now a more advanced example and one that I actually did participate in was to download news articles with words like Manchester and Commonwealth Games and put that through some very extensive natural language processing to pull out the nouns, dates and known structures and relationships and then compare essentially to create timelines of events about when investment went into an area in order to host a mega event like a Commonwealth Games and what kind of regeneration events occurred like long after, 20 years after, create a timeline. One of the outcomes of this research was to compare whether we could do this automatically better than people could do, just actually reading all of the articles and scoring things. It turns out we did okay, but it was much faster than people, so there is a bonus there. The insight was about whether we could use automated event extraction and timeline creation to support policies of how investment and regeneration around events go together. Feel free to ask me more questions about this at the end if you'd like. Every text mining choice has pros and cons and text mining in general has pros and cons, which I did address there a little bit. You need a very large-scale approach to difficult stuff. We are really not going to get traditional humanities experts to read thousands of newspaper articles looking for tiny little details like whether businesses closed just after the stadium opened or whether they relocated or whether they opened closer to the stadium. That's the kind of tedious detail that is not really worth the time of really critical thinkers, but it is the kind of tedious detail that computers are happy to spend their time on. Because it's such a large-scale approach to really tedious, difficult, boring stuff, we can get into details that would not be accessible if we tried to get real people to read all those things. We can support novel applications like trying to predict criminal apps or trying to justify different kinds of policies. So far, people have not necessarily been able to justify their decisions very well. On the negative side, you do need a lot of material and not all of the research questions that people have can be matched with a large enough body of material. Also, you may need a lot of manually created training data. That is something I haven't gone into, but we can talk about at the end if you're interested in how that process goes, especially if you're going to run a process on huge volumes of data or run a process multiple times for quite long time frames. If you want to, every single month, analyze the previous month's tweets, you're going to want a lot of manual work to go into the automation process. There is a lack of human interaction and supervision, so it's not always easy to say when the computer comes out with something. Sometimes it can be hard to believe and sometimes it can just be hard to interpret. And so far, partly because this is new, we're not clear what questions text mining or other data mining processes are useful for addressing. So it might just be that people are trying to apply text mining to questions that there are other or better tools for. And another issue is that enforcing semi-structured data into a more rigid, coherent, uniform structure, we do cut off some of the information and structure that is hard to capture or amplify. So when we correct spelling, for example, we make it easier to count instances of a word, but we lose out on the fact that maybe some people are using words ironically or as slang or in new ways or that they're more likely to be used by people who are using mobile devices as opposed to keyboards or something like that. That information is harder to think about and capture and amplify. It may become important. It may not be. It kind of depends on your research question. So it is very important to show that text mining cannot yet provide the expert level insight that a lot of humanities research has. Text mining, even though it can read all of the books in the library, it will not be the same. It will not have the same expert level insight afterward as a human who had read all of those books. It just won't work. That's like saying that if a text mining algorithm processes a book and tells you how many counts of the word the are in the book, that's not the same as the human who writes a literary criticism of the book or something like that. They're very different process and they don't cut the same output, which is fine because text mining has other benefits. It does very quickly sift through a whole lot of data and it can help you notice interesting things that you can then apply your expert level insight to. So, for example, just as a bit of a lark, I searched for whether Mario or Luigi plotted on Google Books in Gram viewer. So, this shows how popular these words were in all of the books that are in Google's history over time. And we see that they're pretty close for a long time. Luigi has a good spike just before the turn of the 20th century. Go back to being close and then Mario takes right off in the 50s and Luigi is left behind. So, this is not a research question that if I had to read all of those books, I would not be interested in producing this graph manually. But I'm happy to produce it through nGram viewer and other text mining methods because it is quick and easy. Is it worthwhile? That is difficult to say. But it does show that something happened for the turn of the century that made Luigi much more popular and that something happened in the 50s that made Mario much more popular. So, you know, we can use text mining in this case to find patterns that are worth exploring through other methods, maybe even other text mining methods. But, you know, it's the fact that it's automatic and it's easy and it's fast is very, very useful for pointing us towards more interesting research questions to approach. Yeah, we're also, you know, less likely to waste a whole lot of time on becoming an expert on a topic that seemed promising but which actually turned out to be unimportant if we can do sort of quick text mining analysis of it and show the pattern that we thought we would find just isn't there. So, text mining is good for exploration, good for saving time, potentially not always applicable to all research questions, but, you know, throw it at the wall, see if it sticks. You never know.