 I think I can see it. Yeah, it looks good. Yeah. Okay, I'm just going to read out this link. How's it looking? Are we live? We're live. Do you know when people have joined? Yeah, we've got a few attendees here at the minute. Just joined. We will get started in just a couple of minutes. We're just getting things sorted. Great. I'm monitoring the email in case any technical difficulties come in. Nice one. Thanks. Right, I'm going to go ahead and begin recording. Is that all right? Yep, that'll do. Go ahead and just introduce me at the beginning as someone who will be answering questions in the Q&A, and then I'll turn off my camera. I'm going to give it a couple of minutes before we start. Do you ever have a preference for recording it to a computer or the cloud? Actually, we just do. I'd say cloud, because you can share a link to the cloud recording rather than download it to your computer and then upload the file. Very smart. All right. All right. I guess we can make a slow stop. I've got some participants ready, but lastly, everyone, my name is Nadia Kenner. I'm a research associate at the Kathy Marsh Institute with the UK Data Service. We're also joined by Louise Kaplanar, who will be running the second half of this webinar, and Julia Kazmaier, who will be answering questions throughout. I will be turning off my camera for the remainder of this workshop, but please feel free to use that Q&A function or leave questions in the comments and we'll do our best to answer this at the end of each talk. So yeah, today I'll be talking about text mining, specifically an introduction to the theory behind structured and unstructured data and why that might matter for learning how to do text mining and other forms of data mining. Before we get into the content today, I just want to double check that you can all hear me properly. So there should be a poll that's being launched on the screen now, so just fill that in and we'll give that a couple seconds just to make sure everything is a okay. That's good. That's good to look like everyone can hear us. If you haven't any issues with audio, just double check some of these things on the screen and just let you know we are recording this webinar. And it will be posted to our YouTube channel that is the UK data service. Additionally, this is also being live streamed YouTube at the moment. So hello from the live stream if we're there. And yeah, before we go too much further, I did want to point out to you some of our other resources that you might find useful. There are some recent trainings that include an introduction to mapping crime data in R. This is an introduction to like GIS and spatial data. We also also recently run an introduction to machine learning course, which was a three part series that ended with a live code demonstration. You can find the slide decks and recordings of those on our past events page from the UK data service site or from the UK data service YouTube channel. We also have some upcoming events which includes obviously part two of this, this talk, which will be on Friday the 11th of November. And this involves a code demonstration. But aside from that, there is also some events to look out for including how to become a computational social scientist. This looks at the ways in which those social scientists social sciences can utilize computational methods in their research. It's very much aimed at beginners and those who want to do something more computational, but you may, but not sure where to start. There's also the crime server user conference. And finally, I'll just highlight our monthly data drop-in events named the computational social science drop-ins. But yeah, you can stop by to get some free help or advice on any kind of computational projects that you're undertaking. So yeah, it's just a good chance to chat with us as a team and maybe learn something new, share some ideas involving a discussion. But yeah, just to highlight the background of this talk today, we have a two hour talk where I will be speaking in the first hour and introducing the main concepts to time series. We look at capturing and amplifying existing structures. We look at the differences between structured and unstructured data. And then I take you through the four steps of text mining. In our two, which will be run by Louise, she'll take you through the steps involved in text mining analysis. So looking at some of those processing methods, basic natural language processing, and she'll take you through some extraction processes as well. And as discussed, we have our session on the 11th November, which is also a two hour session. The first hour will be a Python tutorial. And the second hour will be an art tutorial. So we look forward to seeing you then as well. So what is text mining? Well, text mining is one kind of data mining that as you might have guessed, looks at text. It might seem really obvious. And that's fine. But we're going to dive into that a little bit more and maybe even uncover some things that you wouldn't have thought about. So other kinds of data mining might look at images and sound movements, you know, lots of other kind of data out there. And we maybe do webinars on this in the future, but today we're just going to be focusing on text mining. Importantly, data mining, or text mining is, it's not just a buzzword, it's a large category of research methods based around the idea of transformation. Stuff goes into one end of sort of a transformation machine, and it comes out of the other end. But what exactly is it that goes into the machine and what exactly is it that comes out. Well, lots of things can go into the machine. Obviously text being one of them. But you may also be able to input recordings of speech, video images, things like maybe like seismographic information, typically unstructured data is inputted here in a form of chunks. So this could be something like a blog. You may also have live video feeds. And this is because you can also input stream as well as chunks. But the question is what comes out of this transformation process. Well, data specifically, but structured data is is kind of what leads us on to our next topic. So what do I mean by structured data. Well, I mean, most people when they think about structured data, they think of the kind of data that comes in databases or Excel sheets, vectors, arrays, matrices, they're arranged in a structure which usually is depicted as like rows and columns or as a table. For example, if I wanted to track my online sessions, maybe the machine learning sessions I recently run, I'd probably want to record some basic information. I might want to start by recording the title of the session. Maybe the date that I gave it. And the number of attendees. So I could now write this information on let's say a sticky note and hack it to tack it to my whiteboard or leave out my desk or put on a court board. And yes, this is indeed how lots of information or lots of people record information. But if I can assume that I'll be giving lots of online education sessions, it's probably safe to assume that I want to do something a little bit more interesting than just applying it to my whiteboard. I can calculate my like average attendance, or maybe plot whether my sessions are becoming more or less frequent over time. You may want to plot maybe the evaluations that people give it. And these are all the things I can do once I have a certain volume of data. In this, it's much more sensible to record this data from for a given education system in the same place, and in the same order as the other information that I collect on other sessions. So we can do one better than just using individual sticky notes attached to like a whiteboard or led a question desk right. I could record the names and the dates of attendees in rows and columns. Which would be similar to like a standard database or an array that you know look something like this look something like an Excel spreadsheet. And if I was to then do this for each of all the different sessions, I could have the same format in each column and same related content in each row that looks something like this. So yeah, this is the column for example where I have all the titles. And this one has all the dates and this one has all the attendees and each row represents a different session. There are maybe some less obvious examples of structured data that I'd also like to talk about just to kind of paint that bigger picture. One example is calendars. Calendars are a form of structured data because they're in rows and columns for the days of weeks. And you know they tend to have some sort of sequential order. Database systems are another example. These are very, very structured but not necessarily in rows and columns, you might. There might be like link tables or dictionaries or things like that. And then there might be the filing systems which are obviously quite structured. You might have different reference numbers that represent a different shelf or or a different or a different section. Also have libraries this is kind of that similar structure to the filing system in that you have different categories of books and different themes different authors. So the obvious example would be like a grocery store where things in the aisle, at least in principle, go together or share something in common. Quite similar to your wardrobe in this picture for example we see that the shirts are organized in one section trousers in another shoes in another and so on so on. And the last example which I find quite interesting is just train stations. This is because the platforms are structured and the people move in certain ways, and the information is shared out in structures in an organized fashion. But yeah, these are just a few examples of some like abstract structure data that you might see in your day to day. So the thing about structured data in well from a statistical point of view is that it can be really familiar. It's really easy, and it's really easy to demonstrate how to work with it and how we can come from how we can come to conclusions you know. For example, if I wanted to show what the average attendance was for those sessions is easy to see that you can add up the columns and divide by the number of rows. Data scientists they have a wealth of experience with structured dating, we know how to store it record it graph it do statistical tests show our work and argue about it. And then this example I could easily run some basic statistical tests on which lessons was the best attended, or what my average attendance was or whether my online session we getting more or less popular over time. Importantly, if someone wanted to disagree with my results, they could take my structured data, run the road test show me the work and we could discuss it. So what about unstructured data. Unstructured data is very familiar to us in in our in our lives in the way we move through the world in which data is largely unstructured but it's less familiar or easier to understand. It's much harder to work with the lack of structure makes it also like harder to to organize to organize about anything so in this example, I've jumbled up my sessions into into three different formats. In the first format. We have all the information on one sticky note. In the second format. We have information split between three separate sticky notes. And then the third format, you can see that some information is missing. There's also some additional information. So here it's all jumbled together. It's the same information but just spread out across three different notes. So, for example, if we look at the date formats we can see that the date format is different between our three sessions. We can also see that how the attendees was recorded is different in one instance we have, we have this written out in another instance we are given a numerical range. So, it's not. Yes, just not consistent. And then we also have this bit of extra information that is sitting in our in our third format. But this is more or less the same information, but in a very unstructured, inconsistent way. And this can be a problem. The key point to take away when you're looking at unstructured data is that it's actually semi unstructured or semi structured. There's still a lot of structure here, but the structure is basically less coherent and less immediately accessible. So, in this case, semi structured or semi unstructured is difficult because it requires a lot of intuition and common sense with how to work with it. And this level of intuition is not commonly available to computers. So trying to get a computer to do this can be very difficult, but that is exactly what text mining is about. It's important to say that historically like intuition and common sense can be applied to unstructured or semi unstructured data. It just means that the sort of like getting the insights and conclusions and results out of working with this kind of data requires a lot of time is very time consuming. So it can be hard for you to show your work. It can be hard to show how you came to conclusions. It can be hard to convince someone else to come to the same conclusion, and it's all just a little bit more difficult. But so what what's the point. Well, you know, certain like historically scientific events and sort of things that work to structured data like temperature and dates. When people can show their work, they can publish their results. And you know it's easy to see the conclusions, whereas the world, whereas in the world of like social science sciences or even humanities. It becomes a bit more difficult. I think that discussion and interpretation make up a large proportion of this process. And humanities papers tend to have different sections than the hard sciences paper, for example, because just showing the work is difficult. So what exactly is the matter. It used to be that we just gave structured data to the sciences and unstructured out of the humanities. And people just did separate work on two separate kinds of data. But that's not entirely great because the divisions getting a bit getting a bit tedious is getting a bit boring and we really don't want to limit ourselves on how we can approach these things in a world where semi unstructured data is now absolutely everywhere. For example, social media and fitness trappers. Things that track your location. These are all forms of structured data and it says something about the way we live our lives in the world of like digital health specifically there are there has been advancements. In fact, I read a study recently that looked at automatically extracting data from electronic records using text mining. And basically, they were able to show. They're basically able to identify trial participants and to collect baseline information, which I think is really cool. And there's been loads of application and digital health in itself such as evaluating like treatment effectiveness. They've been able to save lives of patients using like predictive medicine. Health care at different levels, managing customer relationships, and even as far as like detecting waste fraud and abuse within a system. But why can't you just treat semi structured structured data the same. The real reason is that the tools just won't work. You can treat them the same. I mean you can't just treat them the same because the tools. Yeah, the tools just won't work in any sort of way and like, for example, you can't force a blog into an Excel spreadsheet and expect to do anything with it. And it's also this process of forcing it into an Excel spreadsheet spreadsheet. It's going to be really difficult so no one else will be able to replicate your work anyways. So documenting this process can be very difficult and it can be hard to understand the methods and to visualize these results. So the first thing you need to do is turn this semi structured data into structured data. And there are tools that can help you to do so. And this brings us back to that first image earlier on in the slide. What goes on, what goes into the transformation process is this semi structured data. And what comes out is that structured data and the machine that does this transformation process for our purpose is is known as text mining. And it does this by capturing the structure that already exists in semi structured data and amplifies it and cuts off all the bits that doesn't fit into the amplified structure. This leads us on to the four steps of text mining. We'll go through all four steps and then discuss some examples towards the end, but the first step is known as retrieval. This is basically about how you acquire data. And there are plenty of methods to do so. Some examples include like web scraping or like digitizing older records or digitizing like older library books. Let's say for example, you wanted to use tweets to explore the detriment of derived sentiments about COVID-19. And maybe you wanted to do this across two channels. So maybe you wanted to examine main media channels compared to smaller media channels. You would first want to identify some language that these news outlets use when they're tweeting. So first I'd want to retrieve a set of tweets that include that include speech of writing. And we can get this from the two different news sources by using like hashtags. You would then maybe specify a source and a date range and specify some key words to look into this. So maybe these words could be negative, fatigue, lockdown. And I say this because discussion surrounding lockdown in, in, yeah, discussions surrounding lockdown could be really interesting because this might be a increase of fluctuation of tweets during when like new lockdowns were announced. So yeah, you could get, you know, at least 100 or 1000 tweets and that would be a good retrieval method for getting text to work with text mining. A little tip, it's recommended to keep a copy of your raw acquired status. So whatever you get from the search file, the original thing that you downloaded, it's important to keep that so that you can provide your raw data for anyone who then wants to replicate your steps. The next step is known as processing. And the aim of the step is to basically turn your messy data into something a computer can work with. You can in theory do this manually, but if you're working with any kind of volume, any kind of large data set, you will definitely want to use some sort of computational tools. This could be our, this could be Python. So the practical steps of this include things like dividing and renaming the files. So you'll take a raw data, download it, maybe cut it into a series of documents that are, you know, one document per webpage or one document per or like one row in a data set per tweet, which would be relevant to our example. And then you want to do some sort of basic natural language processing thing like maybe you want to correct spelling, remove capitalization, substitute acronyms. For example, the word United Kingdom might be spelled correctly or it might be spelled with capitals or it might be written as an abbreviation as UK. And these are all the things you check during that like basic steps. There are also more advanced natural language processes such as classifying words by grammatical category. For example, disambiguating meaning by context and passing sentences and marking up structure, and we'll get into the destruction a little bit in the second half of this webinar. I think it's also interesting just to note that processing and extraction, which is the third step can overlap sometimes there isn't quite a linear path between the two, you'll find yourself working back and forth. But if you're going to do these steps on very large volumes of data, or you're going to do these steps frequently on new data sets, you probably want to devise your own coding scripts to automate the process. So that might require a lot of time at this point, but this will be time well saved depending on the volume of work that you plan on doing with this kind of step in the process. So for example, here we have some raw data with a sentence, maybe this sentence was collected from a tweet, maybe this was from a message board, maybe this was from an article. So this, this raw data includes capitalization, it includes punctuation, and it also includes a spelling error. So we would, what we would do is process that to take out the capitals, take out the punctuation and to correct the spelling error. And then we would put into a machine readable format, which here looks like some sort of which is like nested a set of brackets that have a part speech tags. The final sentence on the bottom of the slide indicates the computer's interpretation. So obviously we're much better as humans at reading, reading spelling errors, but a computer is going to need this output of nested series of tagged language so this process does involve some choices. So you remove the spelling error, because maybe the, maybe your research question is, is interested in not correcting spelling errors because they're part of SLAM, you know. Again, you'll want to document all the steps that you take so if you did correct spelling, if you did choose to correct the spelling errors make sure that like you make sure you tell us why and what step you did this in your processing. The last, the third step, sorry, is extraction. Now, extraction is about running the statistical analysis and methods relevant to your research question. So this could be for example, counting relative word frequency on different words. An example could be that when people write letters of support for candidates in terms of like recommendation letters. They statistically use different words in different frequencies when they're writing about men or women candidates. So this is the kind of analysis that this is the kind of thing that using something as simple as relative word counts becomes effective you know you show a statistical difference between letters when writing about men and letters written about women. But just the same extraction could be identifying patterns are more complex than just relative word counts. Another example is equivalent equivalency suggestions. And this aims to show how people use language to demonstrate that people are using two words in the same way, but in different contexts, or the same word in different ways in different contexts, and therefore the word has different meanings. And off the top of my head I guess an example could be the word, the word tackle, as in to tackle someone in a football game, or you might have the word fishing tackle which is part of the sporting equipment. We also have a relationship discovery. Now this was quite interesting because you can calculate the relationship between terms or entities or events or places or lots of things for example. You can draw off the social distance between customer relationships and healthcare based on how often they are or not mentioned together in documents. And from this you could map this or you could create a timeline of this. But yeah you can use text mining to capture those kinds of words and turn them into maps of the relationship between the entities. There is also automatic categorization. This basically helps classify documents by sentiment or topic. So this might be categorizing by author categorizing by by gender by career by time period and so on so on. Interestingly plagiarism uses automatic categorization which is. Yeah just a just an example. And the last example is prediction. Prediction is quite is quite new I guess in the field of like text mining but the obviously obvious example would be the prediction of like crime rates right. But in relation to health and digital health. We see this advancement in the field of predictive medicine predictive medicine basically evaluates the probability and the risk of an individual developing a disease in the future. It utilizes specific laboratory tests and genetic tests for analyzing an individual's health and social data and reviewing it against research and outcomes to then determine the probability of an individual developing a disease, which is really cool. One example of this predictive medicine. But shortly shortly after after birth blood samples are taken from a newborn to identify potential genetic disorders as early as possible. And this is one of the most widespread forms of predictive medicine. So here all of these steps use familiar statistical analysis, more or less to show some differences, or like significant relationships. And this is where it's all about your, your, your regression models, your p values, your coefficients, your error bars, and all of those terms that you are probably already familiar with. This moves us on to the last step which is insight. This should be a broadly familiar territory, but this step is about showing the outcomes from the statistical methods and why they matter. It includes justifying all the steps you took so far. So, from step one, why you treat what you did, why you processed it in the way you did, why you extracted the things you did, and why you run the statistical test you did. So it's all about conceptualization. And obviously many of the hard sciences and some aspects of social sciences do this already. And this is because, you know, you would normally see this in the methods or the methodology section of a research paper. Or possibly in the results or the discussion section. So, you know, it's, it's perhaps less familiar to some of the humanities in that social scientists are probably a bit more familiar with a lot of these methods because they'll have to justify and conceptualize why they chose the sampling methods. You know why they chose to run certain tests, and so on so on. This step also includes presentation or visualizations of your results. It's very difficult when there is not as many well accepted examples of how to present results in like text mining but in this example that's on the slide. This is a word cloud of one word reactions to one of President Obama's State of the Union addresses. And as this this as like visually appealing as this word cloud is, there might be some information in here that's a bit misleading in that. Obviously, the size of the word represents frequency, but the location on them is irrelevant. So this visualization is a very different impact than you would get with like a simple table with you know words in one column frequency and another. And yeah it draws on like a different impact but you have to be prepared to consider that maybe that in that impact could be misleading right because you can interpret the location of the word as being relevant to where the where the people who use that word were from. And that's not accurate in this case. So, is it useful, is it misleading, is it distracting these are the kind of questions you should be asking when presenting your results. Because, you know, maybe people are less likely to examine a table for as long as a as a word cloud right. But this moves us on to talk about some more specific examples. So I have here a fairly simple example of a research project, as it moves through all four steps. So in our first step we decide that we want to download 10 days of tweets from 20 users. And we also decide to download the trending hashtags for those same 10 days. So in the next two and processing, we remove everything that is in a hashtag that includes punctuation maybe trailing white spaces, and so on. We then store individual hashtag hashtags in a data frame, labeled by our date and author. In the extraction steps. We want to compare the tweeted hashtags to the trending list. We could do this by time or by volume. We might calculate a trend unit score for the authors based on degree of match and timing. And then we have insight. Here we might have to explain why trend unit score measures. Is it is it affected by influence influence status, or is it just a tendency to jump on the bandwagon. So when running through each four steps and planning your research is important to like question question why, for example in our retrieval step. Why did we choose the users. Why did we choose 20 users. Why did we choose to look over 10 days is is the by date precise enough like how about we use date plus time. What about time zones. And in terms of insight like is trendiness boosted if a hashtag later goes on to to trend or must it be trending at the time. That's one of the questions you should be asking yourself when when planning. I've also got a bit more of a complex example which was taken from a paper called semantic relations for problem oriented medical records by us in 2010. But their objective objective was to basically describe semantic relation classifications on medical just discharge summaries. And yeah, we focus on the aim to focus on relations targeted to the creation of problem oriented records. So in a more. In short, they aim to study the relations of patients problems with each other, and with concepts identified as test and treatments. When they were doing their retrieval step they decided that they download two sets of medical discharge summaries from two separate medical centers. And I don't, they did have a date for these summaries that they downloaded but I forgot to include them. So let's come to the processing step. So this involved examining the discharge summaries, breaking down into sentences to tokens, and then they process those much verbs using structures and relationships. Specifically they used surface a lexical and syntactic features. And I won't go into too much detail about what those are, but these are basically features. They represent a style in a text. So the surface feature is simply the surface meaning of a word. What that what that word means that face value. We then have lexical features and this is the expressive elements behind a word. And then you have the syntactic features which is the which are the way that sentences are structured. So they looked at three different features of language. But yeah, I won't go into too much details. This covers like a really unique field of NLP that we do not directly cover, but an example of a lexical feature could include word frequency or word length. During the extraction stage they compared tokens from within sentences and tokens and concept to identifying the relationship between problem oriented records. And finally, during that insight, they found that these results were promising for semantic indexing of medical records. They were able to describe a semantic relation classifier that effectively identified a set of relationships between patients, medical discharge summaries. We'll now move on to discussing some of the pros and cons of text mining. So yeah, as already mentioned, it provides a very large scale approach to difficult stuff. Like in our previous example, we were not practically going to get medical staff to read thousands of those of those summaries to then examine the similarities. That's the kind of like tedious detail that is not really worth the time of researchers, right? But it is the kind of tedious detail that computers can do and are doing. And yeah, because it's such a large scale approach to like really tedious difficult stuff, we can get into details that would not be accessible if we were trying to get read like real people to read all those things. And this means that we can support novel applications like, you know, trying to predict criminal acts or even trying to predict symptoms from social media. Or even like justifying different kinds of policies that so far have not necessarily been able to justify the decisions very well on. However, as always, there are always cons to discuss. The first thing is that you need a large corpus, you need a lot of material, not all the research questions that people have can be matched with a large enough body of material. Also, yeah, you may need a lot of manually created training data and this is not something that I have gone into detail into this talk, but we could talk about this later if anyone has any questions. It's not exactly discussed in our machine learning talk as well, but it basically requires a lot of manual work to go into the automation process, which can be very difficult. There's also a lack of human interaction in supervision so it's not always easy to say when the computer comes out with something. And this means that sometimes it can be hard to believe or even sometimes hard to interpret that meaning because it's a pretty new area and we're not clear what questions text mining or even or even other data mining processes are useful for addressing. So yeah, it might just be that people are trying to apply text mining to questions, but there could be other or better tools more more easily available or accessible. Another issue, which is the last on the list is that enforcing semi structured data into that transformation machine into a more coherent uniform structure. It means that we lose some of the information we lose some of that structure that is hard to capture or amplify. For example, when we correct spelling. We, we make it easier to count instances. It allows us to do that relative relative frequency right. But we lose out on the fact that maybe some people are using words ironically, or as mentioned before, you know, as slang, or in new ways where, or like in new new ways used by people who are using mobile phones, or new forms of social media and moving away from keyboards, which means we lose out on some some forms of like communication in society within humans. But yeah, so it's so it's very important to show that text mining and not yet provide the expert level into the insight that a lot of like, I guess humanities research can do. As you know, text mining, like, it can read text mining like can read all of the books in the library, it can read them instantly, but it doesn't provide that same expert level. It doesn't provide that that human level when we're like critiquing a book, we just wouldn't get the same. We would only be able to provide like a relative word count, for example. But text mining or data mining more widely just cannot produce these insights and expertise that comes with more traditional human intensive approaches to semi structured data. So although this, although the analysis of a book by like an expert critique is not the same as a statistical analysis of how many times each word is using that book. I think it is fine to say that text mining cannot provide deep insight, at least not yet anyways, but there are some efforts to explore more advanced text mining applications that may not be so far away. For example, Julia provided this really clear example of the application known as Google books and Grand Viewer. So this is a text mining applications that basically shows us very quickly that Luigi was more popular name until about 1945 or so, at which point Mario then became much more popular. So this result is fast and easy. And you know, it allows us to basically ask those underlying questions those explanatory questions about which was more popular Mario or Luigi. However, this result doesn't really tell us why the switch happened. Why both names used to be fairly close but no longer are why there was a spike just before the turn of the 20th century, and what that spike was about. So yeah, although I'm happy to like produce this through this application through and Grand Viewer. And you know there are loads of other text mining methods that do this for us as quick and easy. The question to ask is, is it worthwhile. I have to say because it does show that, you know, in this instance something happened before the turn of the century that made Luigi more popular, and that something happened that made Mario much popular. So with this we're less likely to like waste a whole lot of time on basically becoming an expert on the topic that seems promising, but which actually turn out to be unimportant. So if we can do this sort of quick text mining analysis of it and show the pattern, then we. And then text mining can be used as a good like exploration tool, I guess, and it's good for saving time potentially, but may not always be applicable to all research questions. But you know this opens up the room for possible further analysis and further questions. So why not try it. So that is the end of this talk. Thank you all for listening citations and recommendations are on this slide. These slides will be shared out to our GitHub link, which can be found here. If someone could Louise or Julia just share that link in the chat maybe and all information will added here in the next few days. So with 15 minutes to spare. I thought I would stop to have a little break and give this chance to open up questions to anyone, maybe like have a little discussion, and Julia will be assisting with this so use the chat or the Q&A function. I want to address a couple of questions that have come up in the Q&A during your session. The slides used in this session will be available afterwards that they will definitely be on on our GitHub repo. But they may be available through other resources as well so we'll certainly provide them to and eight for example I don't know where they store slides from their presentations. But if nothing else they will be on our GitHub repo. Also, both of the sessions today and on Friday will be recorded and these will be made available soon ish after the sessions are finished so if anyone has to leave early today, or is not able to attend one, you know, one of the sessions on Friday, there will be a chance to catch up. These will happen. We have another question about, are there any good text mining corpora available specifically with an aim towards NHS data. That's a bit of a tricky one. So like the UK data service, we do have some health diaries and surveys and things like that on our site and we can access those and you can use natural language processing on them. But they may be specifically around a topic that's not necessarily of interest to you. For example, we have one about food and mouth disease. If that's not your particular area of interest then you might feel like there's not exactly what you want. But we will, I'll keep having a look around the web and see if I can find some more resources and I'll post them in the chat. If you have some more to say on any of the questions that I've answered in the Q&A or if you know of any resources that people might want to use NLP on. I can't remember the name of the resource but me and Louise discussed the corpus of medical terms yesterday. Louise do you have any like the other link for that? Yeah, I was just going to say I'll go and try and find it now. It should be on a chat somewhere but yeah, yeah, let's see in a minute. Cool. Yeah, I think I was just a corpus of, was it, I'm not sure if they were like NHS terms or just more generic? Yeah, I'm not sure if it was specifically NHS. Let me see, where would it be now? I think part of the problem is that it's really hard to, like we might think we know unstructured data but actually resources are not usually listed as structured or unstructured. They might be listed as free text or as survey or as open answer or something like that. There's a lot of terminology people might use but probably they're not classified that way at all. They're just recorded as we have these, our results of a survey on hospital waiting times or our results of a survey on prescription charges or something like that. They really tend to focus on the topic rather than the format. That can be a real challenge in finding the data that you want. Okay. All right, so we've got a few more questions have come in. One is what are the main challenges of NLP and the algorithms that generally work best for performing an NLP task. In my experience, the main problems of NLP are getting access to good data. For example, I did a project in which I looked at abstracts submitted to a conference over 20 years. And I wanted to look and see whether the people who've written those abstracts tended to use phrases like people with autism or phrases like autistic people. I was the first person first or identity first language. And the biggest challenge was not analyzing those terms. It was scraping the abstracts that were provided to me in PDF format, putting them into text format, encoding that format in the right sort of encoding structure, and then tidying it up into sort of comma separated value files. So if it's something that's in a nice structure, then you're over the main challenge of NLP. So that kind of addresses a later question as well. How easy difficult is it to build your own corpus. It is easy. If you have access to semi structured things that deal with text. So if you use the Twitter API, you can quite easily build a corpus of tweets. On the other hand, if you only have access to PDFs, you will be furiously typing away the keyboard and getting really frustrated because converting PDFs to text to structured text is an uphill battle. I don't know if Louise or Nadia, but if you have either anything to say about the main challenges of NLP, or how difficult it is to build your own corpus. I think some of the main challenges, well, especially that I've come across is just trying to get my data in the right format. And, yeah, sort out my files, make sure that things are imported correctly. I think we mentioned yesterday when we had one of our drop-ins. One of the hardest parts of this text mining that we did recently was trying to get these rich text files into Python to do some analysis. It was a real finicky fiddly process, which you'll often find with natural language processing. It's a big unstructured mass of words, and you've got to really figure out how do you build it into a database? How do you want it? Yeah, I'd agree. Take another challenge that's arguably a bit vague, but I feel like NLP involves you to have some basic knowledge about programme language, whether that be in R or Python or any language, but it's arguably not a beginner topic to say. Yeah, definitely. It can be quite hard, and there's definitely, you have to get used to certain data structures, like, for instance, like pandas is often used to manipulate data frames, so you've got to get used to, you know, using that to manipulate your data and other tools and other functions, which, again, is a little bit of a learning curve. Yeah. Yeah. What else we got? I think another of the main challenges for NLP is defining your research question, and to some extent that comes down to knowing what kind of research questions other people have asked in order to know whether your research question makes any sense. So the research question I used for the person first or identity first language, for example, I just wanted my first research question was just which of these is more common. And then I expanded on that research question by saying, are there sort of kinds of people that use one or the other preferentially, and it turned out that the kinds of people who use autistic people also tended to be the kinds of people who didn't work in practice facing people they worked in the sort of labs, and they talked about autistic genes and autistic assays and autistic variants and things like that. So they were really used to using the adjective noun structure, and they applied that to people too. So for as people who worked face to face with the people involved in this kind in these questions in this research, they tended to use people with autism. So a very different structure, because they have a different kind of function and experience. And that was something that came out of my first round of research question, which is more common. So you may need to choose different algorithms based on the different questions you're asking. It's difficult to know what questions you're asking without knowing what's possible with natural language processing at first. The first thing you should always look at is just quantifying just describing the text in ways that make sense to you so what's most common here. The distribution of this kind of term or that kind of structure throughout these texts these kinds of basic numerical questions and that may then lead you on to more interesting insightful questions that require more specific algorithms. That helps to feel free to write another question if that doesn't quite hit the point that you were asking about. So some libraries are modules to use for extracting some features such as semantic relations between terms. Yeah, actually, extracting semantic relations between terms is one of the more fundamental features in natural language processing so in fact any natural language processing package or library, either are or Python will have some features like this. What you need to do though is specify what kind of relations are you looking for sort of synonym relationships or antonym relationships or structural relationships as in what kind of noun most frequently follows an adjective like artistic, for example, that would be a structural relationship. So synonym or antonym would be a more corpus wide relationship. Yeah, so there's, there's, these are these are actually much more basic and I think probably we shall you address some of these. Some of these basics sort of cement like the concepts about what an NLP can do. Yeah, hopefully we'll have more of an answer to that in the second part today. If not, please do write in again. Tell us more about what, what you want to know. We have also got another question which talks about was it was it just going to answer answered about the building your own corpus and how hard it is. It's a really good article and it goes through like, sort of like the steps you need to consider. But I guess, yeah, the advice is if you can find one that's already like available. You can type an answer to a question that's already been answered. Okay, nice. Yeah, if you want to paste a link to that article into the answer to question that that's probably good. Good way to go about it. It's good. Okay, I've got a question. How do you choose between different NLP methods, a blend of accuracy metrics project purpose research, previous research solely accuracy. It really depends on your research question, and also your corpus. So, if your research question is about distinguishing, for example, the probability that a given text of unknown authorship is likely to be authored by person A, B or C, accuracy is probably what you're looking for there. If you're sort of, if your research question is about like, is this text likely to be, you know, used for some kind of nefarious purpose like is there is there some kind of like code embedded in this text that people might be trying to use to send information in a hidden way. Accuracy is really hard to judge there. So in that case, it's probably more about like following patterns, or maybe even drawing on more than the text. So it's maybe the text and what time of day it sent or who sent it or how many people read it, these kinds of features, which point that's really a sort of complicated, probably machine learning kind of algorithm. About costuring or probability. Yeah, really have to match. It's a real skill to match the research question, the analysis method and the data or texts that you're using. And I don't pretend that's an easy thing to learn. Practice is the best way. There's one one question left and then we can I suppose move on to the second part of this, this webinar. Do we want to leave a little private break or should we just crack on to. Yeah, we can probably leave a few minutes. People can can submit more questions during those few minutes or I highly recommend that everyone stand up and have a stretch and get a drink and, you know, scratch the kitty behind the ears and things like that so that you can step away from your computer for a while. In that case, I'll stop sharing so Louise slides and then super. Start when you're ready really. I like to point out that it is pouring rain and also sunshine outside my window right now it's a real real strange mix. Whether a. All right, I'm going to turn off my video. And what time are we coming back exactly just five minutes past. Yeah, let's leave it to yeah. Okay, super. So thank you everybody will see you in five minutes. Hi guys. Hi everyone. So it's just in five past so let's get started with the second part of this tech mining workshop. I'm Louise sorry to interrupt but you've got your presentation mode screen sharing. Okay. This far. Yes, there you go. Perfect. Sorry about that. All right. Yeah. So hi everyone. My name is Louise Cape now. Sorry, I'm just moving some stuff. And I'm a research associate at the UK data service. And in this session, I'm going to be taking you through the basic processes of text mining, including the cleaning and sort of preparation processes on some of the first natural language processing steps. I'm going to go ahead and turn my camera off now and then we'll get started. Okay. So, obviously, those of you that have attended the first session just before, you might remember that we said that text mining is about turning on structured or semi structured input into structured output. So what we're doing is demonstrating some of how that transformation actually works in practice. And again, you might also remember that text mining has four basic steps. So we have retrieval processing extraction and analysis. And in this session, we're going to be focusing on the processing and extraction phases. We've already said that this is not always a linear process. This is especially true if you're new to text mining, or you're doing a new kind of analysis. You should expect these steps to be a bit iterative. What you'll find is that you'll probably go around these steps a few times, jump in sort of forwards and backwards before you really get like the pipeline that you actually want. So we're going to be covering topics in processing, basic extraction and basic natural language processing, which lies somewhere sort of between the two, depending on your research approach and needs. So we're covering a process in steps like tokenization, standardizing, removing a relevances, consolidation, and some basic natural language processing like tagging, named entity recognition and chunking. And we'll also be doing some basic extraction, which includes word frequency, similarity and discovery. And of course, don't worry too much if right now that just seems like completely foreign language, because I'm going to go through them all in detail and break them down and make them a bit easier to understand. So first off, as we talked about, you'd need to turn your raw data that you got during the retrieval step into something that you can work with. So into something that we can do some text mining on. So let's say we have a massive big file with hundreds of newspaper articles in it. We might want to break this into smaller files so we could have one article each, or we could insert a line break or a delimiter after each article. So you could import the file into a spreadsheet program with each article on its own row. Another way of making this useful is turning each article into a dictionary entry with key value pairs for the article contents. So you could do this for the typical kind of article things, so offer and title and date, things like that. And all of these are potentially useful ways to process a raw data file into something that is more suitable for text mining. So these methods as well it's going to lend themselves to different kinds of analysis. So let's consider each three of these like sort of methods with a few examples. So the first where we've broken this big file down to many smaller file, small files. This can be useful if it's the text rather than the author or the date that you want to focus on. If you want a division like this, if you want lots of examples of text to work with, because this allows you to compare the contents of each file quite easily. You can just dig into the text and then compare it to other texts. So maybe you want to look at these newspaper articles that contain any interviews with people talking about long COVID. So you can use that to compare it. And where we have each article on its own row in a spreadsheet is good if you want to analyze them as document entities. So for example, you might want to look at an article on a specific date by a specific author and that can be considered as one entity, rather than working with things as whole units or blocks of text. This could be useful if you want to look at mainstream coverage of the coronavirus outbreak from a certain date, for instance. The third where we've got our key value pairs. This is useful if you want to discover relationships between these features and the text. Maybe you want to see if you can discover a language or a style unique to a publication. Maybe you want to see how a topic like COVID-19 vaccinations has been approached by different publications over time. So there's lots of different sorts of ways that you can focus on text mining. And that's going to influence how you choose to break up your text into different text mining suitable files. So that's something that you're going to want to really keep in mind. But even after you break that big data set up into those suitable sized things, however it is that you choose to do that. Many different kinds of text mining are going to require you to then break each one of those things down further to the lowest unit of analysis. And this process is called tokenization. So lots of things can be tokens and whole documents can be tokens, whole chapters or paragraphs, sentences and words or even word stems. But the most common ones that you'll come across are words and sentences. So let's look at an example. Imagine in one of my small files or one of my rows in a spreadsheet that it contains this string of text. So we've got I'm feeling fluid, tired and ill. It's probably COVID but I'm still testing negative. So I could tokenize this string here as words. So this means transforming it from one long string which is here into a list of many small strings. You'll also notice that the apostrophe M has been counted with M has been counted as its own word. Also any punctuation has also been counted as its own word. And that's what happens when you're tokenized by words. Another way that you can choose to tokenize your text and this string here, sorry, is by sentences. So let's see what that looks like. So this transforms your one long string into a list of smaller strings. But in this case, there's fewer of them. So we've got two here. And at this point, the full stops at the end of the sentences are also included as part of the token. In terms of whether you want to tokenize by words or sentences, there's no real right or wrong answer. It's really going to largely depend on what kind of analysis you want to do. Next, we have standardization. And this is about replacing multiple ways that a given word or phrase might be written, how it might be written with replacing it with a single option. This improves analysis because otherwise we're going to have these many different forms, which would be counted as different words. So for instance, we can think about the different American and British spellings of color. Or perhaps you've got miles per hour being written out as MPH or written in lower case in some instances, instances or in uppercase in others. So you're going to want to standardize these terms. And one tool that I really want to highlight for standardization is something called regex. And this stands for regular expressions. And it's really useful for standardizing terminology and like I said, and for standardizing acronyms. And it works very like finding replace word prep word processors. So as an example, maybe I want to replace ill with unwell in my text example. So my original text would become a new text that has ill replaced by unwell. Now this process is, you know, going to sound quite tedious and boring. And you are going to want to, you know, scan through your text and get a real insight to it because you're going to need to consider which terms are going to need to be standardized. For instance, if you're looking at COVID, how has it been written? Is it COVID-19? Has it got the more scientific spelling where you've got SARS at the front and then the dash or the hypen? Sorry. So yeah, you're going to need to explore your data. And that's a big part of text mining. But once you find out, you know, which sort of terms you're going to need to correct and the ones you want to replace, you can actually do multiple operations at once. So you don't have to do this tedious process where you just make one swap each, which would obviously take quite a while if you're working with a really large data set. So what you can do is define a set of regular expressions to find and replace so that you don't have to do this. You know, you don't have to do it one by one. So what I could do is create a Rejects dictionary and I could replace ill with unwell, tired with fatigue. And let's say I also want to replace COVID with COVID-19. So then my original sentence you can see it's been transformed. So I can change this sentence in one operation into this completely new sentence. And there's many different tools that you can use for standardization. And most of these tools target a particular kind of standardization, but in general, they can be understood to work a bit like this Rejects dictionary. For example, another tool that is worth mentioning is a case converter. So that comes with a built-in dictionary that has the lower case and uppercase forms of all the letters already defined. Then you can simply apply the case converter to iterate over the text, which switches all the uppercase to lowercase or vice versa. Likewise, a spellchecker finds commonly misspelled words and it replaces them with correct ones. And these spellcheckers already have established dictionaries as part of the process. So each of these tools tend to target a particular kind of standardization. And you might need lots of different standardization loops in order to get your text into the format that you desire. And another important part of processing is removing irrelevancies. So depending on your analysis, you may or may not want to remove some things that are present in the text, but are largely irrelevant to your purposes. For example, we have punctuation, which in and of itself doesn't have a whole lot of meaning. If you're looking at words as a unit of analysis, then a full stop or a comma should not count as a word. But as we've seen, the tokenization process has put them as separate words. And we can see that here, right? They've all been counted as their own token. So we can get rid of these. So our original text is transformed into a new output in which the full stops and the commas are gone. You can also see that the apostrophes are gone. So you can see it's been removed from the M and been removed from the S. You can change the process of removing punctuation so that these apostrophes don't get removed if you want to keep them as obviously, you know, they have some meaning that we might want to preserve. And another example of the kind of thing that you might want to remove is stop words. So stop words are usually determiners, conjunctions, adverbs and the like. Just like punctuation, they matter a lot for sentence wide meaning, but they have little actual content as words. Also, for any given language, they tend to have more or less the same distribution in text, regardless of author or style. So they don't really help you analyze text that much. So text mining approaches just removes them, which if we do to our text here, it produces a new list of word tokens that just about interpretable for humans. And this list now is a list that can be analyzed and statistically in practice, the text that you're likely to be working with are going to be much longer. And so there'll be much more suitable for statistical analysis. So we're just using a short example here that would actually, you know, fit on the screen. So you can see that and has been removed, what has been removed, don't have a whole lot of stop words but still important to get rid of them. Now we have something called consolidation. And this is about turning the various linguistic forms of the same word into just one word so that they can be counted as the same thing. So there are a few ways to do this, one of which is called stemming. What stemming does is it aggressively strips back word markers like verb endings and plurals in a sort of very basic way using rules like remove ed from words that end in ed or remove s from words that end in s. So let's have a look at our sample text again. Let's say we put it for a stemmer. And you can see that we've got testing which has become test, feeling has become feel, and you can see for probably the why has been removed. So yeah, if you've got any plurals and verb endings, these will be removed. And then if we're happy with this stemming process, we might step back and say, okay, we're done with the cleaning, and we can now dive into the text mining. But maybe this is too much consolidation. Sometimes we're going to want to preserve those different verb forms to be counted together, but not counted with nouns or objectives. Plus the rules that are associated with stemming are a little bit ineligible, so they don't account for regular plurals or irregular tenses. So another option that we have for consolidation is lemmatising. So lemmatising is similar to stemming, but it is more sophisticated. So it reduces nouns to the singular form and verbs to the root verb, but it keeps these separate. And for this consolidation method, I'm going to use a different example sentence because my original one didn't have any words that needed lemmatising. So here we've just got a different example, which reads, my uncle ate a lot yesterday. I went to the the doctor and he said it's not broken, just sprained and becoming more optimistic. Not the most pleasing sentence to read, but just using this for an example. So if we put our process text here through a lemmatiser, we get a new list of text tokens. And we can see that aches has changed to ache when has been changed to go broken has been changed to break and becoming has been changed to become. So the output that we've produced here is a little bit different to the output that we would have got if we stem this sentence. So if we had stemmed this sentence become would have had its e removed and ache would have had its ed at the end removed. So you can see that this has been a little bit more sophisticated in preserving some of these forms. Now let's return back to our original covered sentence to explore and basic natural language processing. And we're going to look at something called part of speech tagging. And this is a basic nlp function. I find that this sometimes counts as processing and sometimes as extraction. So it does use linguistic structure. So it is text mining rather than cleanup and preparation of data, but it is not usually a goal of nlp in and of itself. So POS tagging can start with raw text or process text. So put in this. Oh, sorry, I've said we're going to return to the covered sentence. We're not we're going to stick with the one we used before. So if we put this through the POS tagger, it produces a new list of text tokens, paired with a POS tag. So we can see singular noun tags, verb and past tense tags, verbs and past participle tags and basic adverb tags. And this gives us the POS tagging. We can also choose to POS tag our words before we perform lemmatization. And this produces a properly lemmatized output. With the verbs reduced to a root verb form and the plurals properly singularized. So let's have a look. So here we've got a POS tagging of our terms, and now we're going to do lemmatization. So not much difference, but we can see that more has changed from being a comparative form of adverb to just a root adverb much. So we can go back and see what it was like when we just did the lemmatization before we POS tagged. Let's see. Yeah, you see it's, it's just been kept as more here. But when we POS tag first, that comparative form of the adverb has been reduced to the root adverb much. We can also choose to output just the lemmatized strings. So rather than getting the output of these POS tag pairs, which you know, not always very pretty and not always what we want. Another useful basic NLP process is something called chunking. And that requires word tokenized and POS tag text as input. And what it does is it builds it back up into larger structures that have logical relationships. And importantly, this works best if you have not put the text through all of that standardization on consolidation, the processes. So you can see, still got my situation here. So if we put this for a chunker, it returns a chunk like this, where it recognizes that all of these words belong to the same sentence. So we've taken it apart into words when we tokenize them. And then we're able to build it back up into a sentence that is recognized linguistically as a sentence. And you can see I've displayed this visually to make it easy for us to read that X here denotes that it's a sentence. And what is actually returned by our computers, it's not always as easy to read, but it's the format that our computers find easy to read. So you can see here we get the POS tag pairs, and this is the output that it would produce. And a particular kind of chunking operation that you might find useful when your text mining is something called named entity recognition. And this aims to group the tokenized and POS tagged words into sentences, but also into noun phrases that are associated with people, organization and organizations and places. To illustrate this, I'm going to use a new sample text. So here we have the sentence tokens from the sentence so it says Bruce Wayne is the CEO of Wayne Enterprises, but it's also Batman. So let's see what happens if we point this through a named entity recognition chunker. And it returns this output. And it's a more complex output with levels of grouping. So we still have the big sentence wide group denoted by this S here, but we also have some groups within the sentence. So we have Bruce here, and we've got organization so CEO Wayne Enterprises so it's recognized Wayne Enterprises as an organization. But we can see that it's not been 100% accurate. We can see that here Wayne has been falsely recognized as an organization. And that's not quite right of course, but it's probably influenced by the fact that Wayne Enterprises here is definitely an organization. And it hasn't worked 100% because this is an automated process. So you should never really expect that it's going to be 100% correct. It is really important for you to be close to 100%. You will probably have to do a lot of manual revision, or maybe even train your own named entity recognition chunker. But good advice is to start with the ones that are freely available, and then you can train it to get better at the kind of tasks that you want to give it. This is a really useful technique, and that sort of brings us to the end of the processing section. So I've demonstrated for a lot of different processes, but what I can't do is tell you which ones to do or in what order, but there are some basic rules of thumb, and some common sense steps to take. Well, as I've said, trunking and or POS limitizing requires text that is already tokenized and POS tagged and rejects might be best before removing uppercase to better catch those acronyms or abbreviations. But most of the time, you're going to have to do some experimentation, some thinking and a bit of strategizing. For example, you might get a long way through your analysis before you notice an abbreviation or an acronym that would be better solved with rejects early on. So like I said at the start, best to consider this as an iterative process, which means you probably want to add a step in a step into a pipeline of steps that takes the output of the previous as the input to the next. So I'm going to run this whole pipeline from scratch after you insert a new step into the middle somewhere, because otherwise it'll be hard to keep track of. So when I'm doing my own text mining projects, I'll often do my pre pre processing and processing. And what I'll do is I'll take the output from that, and that will become the input for the next step, which will be the extraction stage. And another note here is that you should always keep track of what you're doing. So write down, keep notes of what you're doing and why you're doing it. And this is really useful so that others can reproduce your workflow. So it's important if you want to share it and, you know, get your methodology right, you're going to want to explain why it is that you chose to do things in the way that you did. So the pipeline can change throughout. So here we have a pipeline that will be tokenized, got rid of our punctuation, removed our stop words and we've stemmed. But maybe it changes and you decide that, well, now my actual solid pipeline that I'm going to settle down is tokenizing first then POS tag in. Instead, I'm going to lemmatize and then I'm going to remove my punctuation and stop words. And like I've said, it is a very iterative process. And now we're going to move on to some extraction techniques and some things that you can do in your text mining. So I was going to talk about word frequency. And that can be used to identify the most recurrent terms or concepts in a set of data. And that's useful for analyzing, for instance, customer reviews, social media conversations or feedback, or just get in a sense of what a text is actually about. So for instance, if you keep going across the words expensive, overpriced and overrated, if they're frequently appearing on your customer reviews, this might indicate some important information. So maybe you want to adjust your prices, or maybe you need to adjust your target market. Let's look at a couple of examples. So first, we're just going to look at a trivial one, which is using the sample text, the COVID one. And we've got a little visualization of the pipeline. So this is what it went through in processing. And once we've stunned it, we then perform this frequency count. And we can see it returns a dictionary of the word, much with a number, which indicates how many times it occurred in the text. So mine is not very interesting or insightful at all because I'm just using this tokenized sentence. So you can see my most frequent token that appears is I, and then we've got just like the one counts for the rest of the words. But if you're using a less trivial example, like, let's say you're looking at the entire text of a novel, or maybe you're looking at, you know, an entire like medical papers or a bunch of them. You don't want to obviously do that manually. If we use this same sort of basic pipeline, we're going to get too many words and counts to show. So instead, if you are working with like a lot of data, or, you know, like in here we have the entire text of Emma by Jane Austin. And this is a really good test text to like work on. You can get it with the NRTK corpus. If you Google that you can see that there's a lot of other examples too. So yeah, what you could do is you could instead print the 10 most common words in the text. So in this case, 10 most common words in this novel are Mr. Unsurprisingly Emma and a bunch of others. We can also find the count for any target word like maybe you want to see how much the word common appears. So in this text it appears 142 times. And of course pick any other word instead of common. This is just to demonstrate that you can get a count of every specific word, either in a big list or from a targeted query. So we might not just be interested in how many times a word occurs, but we might also want to know how many times similar words occurred. Maybe we could then group them together or we could contrast them. So if this field of research uses this word, whereas a very similar word occurs in this other field of research, and maybe there's these two research communities that are just not talking to each other. And maybe there's, you know, a big push maybe to get them on the same page and to homogenize which terms they use. So this sort of word similarities can be really interesting. You'll need to understand the concept of word vectors, which are built into packages like spacey. Not sure if I'm saying that right, but we'll go with that. And what these do is for the words that are included in the word vector package, it has a vector which has a score on 300 dimensions for each word. And these scores are derived from how the word is used in a large corporate of natural language. So what kind of part of the speech it is, or what is most frequently used as, or what kind of words are typically found before or after it, which is something referred to as collocation and other kinds of linguistic analysis. So we use this package to derive word vectors, and then we can compare these vectors for different words. And when you compare these word vectors, what you get is a score between zero, which denotes no similarity on one, which means that the words are identical. So we can consider this with a really simple example. So let's say that we have three words. We've got troll, elf and rabbit. So all three are nouns. So they're going to be somewhat similar. But as we've noticed, two of them are fictional creatures. So we're probably going to expect that they should be more similar. And that is what we find in the results. So each word pairing returns one when the word is compared to itself, which makes sense, right? And something between zero and one when it's compared to another word. So yeah, troll and elf are more similar to each other than either is to rabbit. So you can see we've got a similarity sort of similarity score of 0.4 for a troll and elf. So they're more similar to each other than either of them is to the rabbit. So if we compare the troll compared to the rabbit, we have a very low score of 0.29. Whereas if we compare the rabbit to the elf, it's got a slightly higher score of 0.34. And that's quite interesting, right? So the rabbit is more similar to the elf than to the troll. And this could be because rabbits and elves are used more often together or they're maybe in a similar context more often than rabbits and trolls are. Maybe they occur in similar kinds of texts like maybe like fantasy texts or it could be that they both live in the woods or they're both considered as positive things. Whereas, you know, when we think of trolls, we think of a troll under the bridge. So who knows what the reason is, but according to the word vectors rabbit and elf are more similar than rabbit and troll. And that kind of makes sense. We also have document similarity. So you can do something very, very similar for documents or similarity. And this is a really interesting option that works quite similar to word similarity. But instead of having preloaded word vectors, what it does is it builds a vector based on your documents. It should be noted that this method is really sensitive to how your documents are processed. So if you remove the punctuation, or if you stem or lemmatize, you know that all of that is going to influence the document vector in a big way. What this does is the analysis builds a vector for a document and then compares the vectors between one or between two or more documents. And again, it returns a value of zero and one. So for example, we could look at the comparison to documents here. So we have Emma and persuasion, which are both written by Jane Austen. And unsurprisingly, they score quite high in terms of similarity. And that makes sense, right? Probably similar language, similar writing style. So yeah, very high score. Then we've got Emma by Jane Austen again on Julius Caesar, the play by Shakespeare, which scores 0.97, which is less similar. But you know, they're still both written in English language, they're both quite classic literature, and they're both fictional. So things like that are going to influence that ranking. And then we have a file comparison, which is Emma again by Jane Austen and Firefox, which is a selection from a Web Text corpus called Web Text. And this scores much slower at 0.86. So it still is pretty high because, you know, they're both in English, but much lower than the other two because the grammatical structure of Web Text is very different than the grammatical structure of fiction. So they're going to score quite lower in terms of similarity. So you could do this as a way of testing the likelihood that two texts are written by the same author or that two texts are written in the same time frame or something like that. And this is fundamentally the basis for how plagiarism checkers work. So instead of scoring and having like a similarity score, they score a number for originality. And rather than just a simple document vector, what they also do is they look at sub document sections to see whether things have been quoted to closely, or whether there's particular phrases that are identical, or whether they're just very similar. So there's a lot more complex sort of stuff that can be done. But again, this is just a basic natural language in processing webinar, so we're not going to get into that too deeply. And then we have discovery. So we can set word frequency counts to count which words are next to target words. And this is called collocation, which I mentioned before. And that's one kind of discovery. So we might, for example, want to know whether a particular noun is more often paired with positive objectives or negative objectives. But actually what I'm going to do here is show you another kind of discovery, which is based on patterns. So first what we can do is define a pattern. So here I've defined my pattern as the word like, followed by the word a, followed by a noun. And then if we search for Emma with this pattern, it returns these examples. So not amazingly interesting or insightful examples, but quite a good start. And what we can do then is build maybe a more sophisticated and a little bit more complex pattern. So if we define a more complex pattern like this one, so we've got a verb followed by like a followed by up to free optional modifiers, followed by now. This returns some quite interesting fragments. So we can see here that according to Jane Austin, or at least according to a perception of society at the time, men are associated with arguing and writing, as well as being young and sensible in various combinations. And we could continue this analysis to then see if women are also associated with arguing and writing, or if these are considered to be sort of men only things. So you can see how this is a bit like collocation in that we could find instances of young man or sensible man. But man argued or man rights is going to be less likely to be returned by the collocation. So this is a more sort of sophisticated approach than just pure collocation. And the fact that these appear in like phrases, this is what it does is it suggests standards or expectations and possibly even ideals. And these are a bit more abstract than just descriptions or collocations. So it's worth to think about, you know, developing your own patterns when you're looking to do some text mining. And now we've sort of come to the end of this presentation. So what I have is some links to the code. So for Friday's code session where we've got an hour in Python and an hour in R. You can visit our GitHub here if you want to, you know, just get a little bit of a little bit familiar with the stuff that we'll be going through. And we're going to be looking at a full and mouth qualitative data set. And we'll be doing a bit of processing on that. And yeah, we've got some other links here which are really useful. And I definitely recommend that you give them a visit. And of course you can access the slides after this. So don't worry that I'm just going to move past them. And now I'm going to pop my camera back on. See. Howdy everyone. Excellent stuff. Thanks so much for sharing. I liked how you included sort of health examples in your chunking and part of speech tagging. I think that's useful to see examples that are relevant. I'm not sure how Bruce Wayne and Batman, how relevant that is to everyone. Yeah, super. Nadia has just shared a poll in the chat about whether people will be attending the R or Python or both sessions on Friday. If you could let us know what you'll be attending. There's, there's no obligation if you haven't decided yet or if you want to invite other people, you know, swap, swap your place for someone else, and you don't know what they'll be doing. Please don't feel obligated, but we're just a bit nosy. Yeah. Yeah. So we, we got a few more questions come in a few technical things like some of the links we'd shared hadn't been copied in properly so we were re-sharing those and things like that. But we did, for example, and I hope it's all right to read it out. One of the people who'd asked earlier about, had shared a question earlier, I in turn asked, would you mind sharing your research questions because I think it's useful to show people how questions linked to methods linked to data sets. And so Roberto is shared that he's trying to understand the impact of urban green and blue spaces in non-communicable diseases. So they're hoping to use the reviews from Google Maps left by local guides to extract some insights about why people are or are not using these spaces. For example, are they not accessible? Is there no facilities, etc., and working with local councils to improve them. So that's a really interesting example of the research question is, are these spaces suitable, if not, how can we make them more suitable? And the data set then would be Google Maps reviews and comments. And so the methods there would be natural language processing either to identify e-words, maybe classify by sentiment or to topic model, things like that. And so you can identify, you know, is accessibility a problem, something that people talk about a lot is, you know, are these spaces dirty or dangerous or pleasant or quiet, you know, these kinds of things. These would be useful things to apply natural language processing to. Thank you very much for sharing. Anyone else who has that kind of research question method, data set, or even maybe just two of those three and wants to share them. I'm happy to talk about them a little bit and see, you know, does this make sense to us? How might we go about things like that? So it's very interesting stuff. Also, if people have particular packages or libraries and other Python or R that they use and think are very good, please do feel free to share those as well because experience and recommendations never goes amiss. Especially for introductory level, things like this where we're all just sort of feeling it out and we have no idea. It can be really useful to, yeah, just hear what other people are up to and what they've done. Yeah. Natural language toolkit is always a pretty good place to start with Python, I think. Yeah, it's a real comprehensive package that's fairly straightforward and logical. It's not necessarily. It's pretty good. I think, certainly for introductory or intermediate levels, it has probably everything you need. It may not have everything you need to do the entire process. So it doesn't include web script and for example, so the retrieval step you'll have to do somewhere else. But yeah, it's real, real useful stuff. So please let us know if you have any other questions or if in fact you're sick of the sound of us and you don't want to go lay down. Understandable. Yeah. Oh, great. Um, Roberto has shared his GitHub, which I will be following because I'm quite nosy and interested to know this goes. Yeah, thanks for that. Sounds really interesting. Yeah. Yeah, I mean this, this is also an interesting use of public data. You might also incorporate with this sort of project, you know, trying to encourage people to add new Google Maps reviews, you know, you might write to disability groups or accessibility groups or even like local schools or things like that and say actually could you make it part of your recommended activities that take a few minutes to review the spaces that you use most. Yeah. So this is this is a little bit contributing to the issue of like how do you get a good corpus. Sometimes you might need to make it and that might include nudging people to engage with a platform or fill out a survey or. Yeah, because I was going to say I was going to say I'm not sure. Like I've never thought about that reviewing like a space on Google. So I wonder how many people do use and I wonder what the demographic is like of the people that do use it. Yeah, it's, it's unlikely to be the most neutral of reviews. Like, yeah, people either absolutely love something or absolutely hate something if they're going to review it like all the people who are like yeah it was nice enough. Yeah, Google reviews, but with that in mind, you can certainly use what is there. Yeah, definitely. Yeah, I think it is hard sometimes I guess because like most people don't go out of their way if it's a good review like if I've had someone nice like I don't normally unless it's like absolutely blown me away I don't write a review but then you know some people if they've had like the worst experience ever that'll be the first time that they ever write the Google reviews isn't that so. Yeah. Yeah, and sometimes you might need to combine it with other kinds of data. So if you have if there's a tram stop nearby, for example, you might be able to get data on how many people check in or out of that tram stop on a regular, you know, on a given day, you can get a sense of how many people are moving through the space. Yeah, you might be able to multiply the reviews by the number of people moving through a space and say okay only three people left these comments but actually 3000 people are moving through this space every day. There's only three left to comment. So maybe it's, it's, it's actually more of an issue because lots of people won't be motivated. Yeah. Yeah, I don't know. Take any other questions. Some of the things they want to tell us. Have we got any sort of like results from that poll. Is it. We've got eight responses in and all say both. So. Oh, okay. This is interesting. I think it's because people maybe haven't decided yet how they want to go about their research, you know, if it's someone is asked when is the hour in Python coming up. That's coming up Friday. So the same time is this so from one to three o'clock. Is it Python or our first. Python first right. Yeah. So, yeah, the first hour, approximately, I mean there'll be a bit of a break in between will be Python and then the second will be our and are they covering more or less the same steps. Yeah. Yeah, we're going to try and largely do the same. But it'll be interesting, I suppose, if you are joining both to see the contrast between the two. Because like we encounter different problems that we were building these notebooks and that would be quite interesting to like. Yeah, which people find them to more attractive options. It's slightly out of scope for this topic, but it does show that there's some very fundamental philosophy of data differences between the languages. And that's an interesting thing, although it may not be what people want to hear. Do we have the sign up link that's an interesting question. I think it's the same sign up link as today. I think you're already registered to sign up. Yes, already signed up. It should be the same zoom link if I'm not mistaken as well. Yeah. Okay. Yeah, and we'll also be live streaming it so tell your friends. You know anyone who wants to come. It doesn't have the link or ever. They haven't signed up. Time to watch it live. Yeah, super. Thanks so much everybody. Positive stuff. Please don't feel obligated to stay around if you've had enough. Go ahead and go. Yeah, but we will be here for at least a few minutes more answering questions if you have further questions. Generally positive response. Thanks for everyone. Yeah, thanks guys. How did you both feel for this presentation do you feel like you understand the concepts better for having explained it out loud. No, no, you understand it less than you did before. I'm also still live streaming. I'm not sure. Yeah, it was good. I feel like I learned a lot. And then, you know, I did a bit of research within digital health as well, which was quite interesting. It's not a field I'm particularly familiar with but yeah, it was really good. Yeah, super. Recording. So go ahead and stop the recording. I think we've finished the.