 Yeah, we will do Granny, yes. Right, okay, let's close this code to the cloud. Okay, so just like to introduce everyone here to the code demo of our text mining help workshop. So yeah, this is the second part in that series, just to let everyone know that this webinar is being live streamed at the minute. So if there's anyone watching in our YouTube chat, big hello to them. And it's also being recorded as well. So you will be able to go back and rewatch this if you need to. My colleague Nadia is going to be facilitating in the chat. So if you've got any sort of issues and technical issues, you can send a message. And yeah, on that note, we should probably do a little poll to see if everyone can hear us. I'm just going to launch that now. Nice one looks like everyone can hear us. That's what we want to hear. And yeah, if you've got any specific questions about the content, there is a little Q&A button. So again, Nadia will be on that one I'm presenting. So I'm just going to go ahead and share my screen now and we can make a start. Okay, so for those of you that aren't on this at the minute, you can go to our GitHub repo link and you will see there's a little binder icon. If you go ahead and click on that, it'll take you to this notebook. If you then click on Processing, you should be able to see what I see. And then you can run the code at the same time as me. If I could just ask Nadia if it's all right, if you just pop that repo in the chat. So yeah, I'm going to go ahead and get started because we have got quite a bit to go through. So what we're going to be working with today is this foot and mouth data set. So it's called Health and Social Consequences of the Foot and Mouth Disease Epidemic in North Cumbria, which was from 2001 to 2003. So this data collection includes 42 individual semi-structured interview transcripts, 40 semi-structured diaries, six focus group transcripts, and it's also got some audio transcripts as well. The topics covered in the interviews and the focus group discussions include the perceptions and the effect of the foot and mouth disease crisis and its effects on the life and livelihood in Cumbria. So yeah, I had a really big effect on economic, social and political life in rural Britain. So we can look at the human health and social consequences of the epidemic. If you want to know a little bit more, you can read the rest of this, but I'm not going to spend any more time on that just for now. So you can see here that the first step that's important when you're doing any sort of programming is you want to import your required modules. Now, the software that I'm using, this binder, it allows you to basically you don't have to install anything because you're running it virtually. So if you did want to replicate the code in your own coding environment, all you'd have to do is uncommon these lines of code here and then you can install the modules that you need. So I'm just going to go ahead and comment them again. So let's go through which sort of packages and stuff we're using. So we've got the OS module and this just provides functions for interacting with the underlying operating system. So if you want to change your working directory or locate files, so for any text mining, you're going to need to read in files, right? This is a really good library for it. We've got NLTK, which stands for natural language toolkit and this has loads of great functions for NLP and text mining. We've got RE, which stands for regular expressions, which we talked about on Wednesday. We've got pandas and what we've done is we've just imported it as PD. So anytime we use a pandas function, instead of typing out pandas dot, we can just type out PD. So it's pretty useful short hand if you want to do that with anything that you kind of import. And we have this XLRD. We need this to read the dot XLRS file because pandas isn't old school enough to do that. And we have auto correct and that provides functions for our spell check that we'll be doing. So first thing we want to do is read in a CSV that we've created. Now I do have another notebook available on this binder and it's called read in data. So what it does is it the foot and mouth data is actually in rich text format. And it can be really fiddly and annoying to get it into the right CSV format. So if you are overworking with these rich text files, it really recommend you to go to the read in data bit and you can get a bit used to like how you actually put them into the correct format. What I did is I created a text CSV out of the files. So we're going to look at this in a bit further detail now. But like I said, if you do want to get to grips with rich text format, then go ahead and access that binder in your own time. So yeah, we want to read in the CSV. And the way that we do it is we use pandas. So you can see here, we've got this PD that denotes that this is a function from pandas. And we create a variable DF, which stands for data frame to convert the CSV file into a data frame. So let's go ahead and run that. Oh, that's because I didn't run the codeable. So yeah, always make sure that you do that because otherwise you can have issues. And we can use this function head. And what it does is it lets us view the first five rows of the data server. And it's really good just to get like a snapshot of how your data looks. And you can also put like different, you know, you can put like if you want to see the first 20 rows and put that in a print out. But the default for it is just to print the first five rows just to give you that little snapshot of what your data looks like. So we can see here that we've got our file name. And then we've got our text information that comes with it. So this is the name of the file and what is included in the file. So you remember at the start that we have a bunch of like diary files, we've got some group interviews and some of the interview files. So now we're going to go on to the preprocessing. Data preprocessing is a technique which is used to convert the raw data set into a clean data set. So in other words, whenever the data is collected from different sources, it's collected in a raw format, which isn't feasible for the analysis, right? This isn't very useful to us at the minute. So certain steps are followed and executed in order to get that data into a small and clean data set. So we can already notice some things that we might want to change here, right? So we've got these column names, they're not very intelligible. So we've also got this first column here, which just numbers the files. But you can notice here that we've already got a numbered index automatically with our data frame. So we're not going to need that first column. So let's go ahead and get rid of that. And you see here what we're doing is we're assigning data frame. We're creating a new variable data frame. And what we're doing is we're just dropping the original first column from this data frame. So you see I use this command drop. I access my columns and I want to drop the unnamed one. And when you are accessing a column, it's important that you put it in these square brackets. So let's go ahead and do that. And let's check if that's worked. Nice. So you can see we've got rid of that useless column. But we also want to give these columns some intelligible names. So let's assign our columns these new names. So it just works in order. So you've put the name that you want for the first column and the name you want for the second one. So let's call this first one file name because it denotes the file name. And let's call the second one text. And then let's take a quick look at our dataset to see if that's worked. Great. So making a bit more sense to us now. But if we go ahead and have a look at the first 50 rows, as I've said, we've got different files here. So you can see we've got a lot of diary files to start with. And then you've got these that have FG. These are the focus group files. And then we've got the int files to denote that these are the interview files. So we might want to split our data frame up because maybe we want to do different sorts of text mining on the different files, right? We might want to pull different sorts of information. And we just want to make things a bit easier for ourselves. So what we're going to do is split this data frame. So we can see from the 0 to 39 is our diary files. And then we've got focus groups and interviews. So what we're going to do is create a variable called diary files. And this will be a new data frame that will contain rows 0 to 39 of our original data frame. And you can see we're using this pandis function lock. And that's used to access rows or columns in a data frame. So any number before the column indicates the start position of the rows that we want to access. The number after the column indicates the end position. And there's just a little note here to say that if there's no number before the column, that means that you're telling it to access everything up to and including this end value. I will change that comma to a colon. So let's go ahead and have a look, see if that's worked. So yeah, we've told it to access everything from the start up until and including the number. So yeah, we've got our diary files there. Let's go ahead and do the same for our rows containing the group and the interview files. So I want to group these together. So this is using the same method here. Just to know if there's no number after the column, that means access everything after and including the start position. So I want to access everything from row 40 up until the end of the data frame. Let's go ahead and see if that's worked. Nice. So you can see that we've got from row 40 up until the end of our original data frame, which was row 86. Another thing that you can do is you'll notice here, right, we can't actually see a lot of this information. So sometimes if you want to just see a little bit more, you can actually mess about with the maximum column width of your data frame. And you just do this using this pandas function set option. And then you insert max call width and you can change this number here to show. So determine how much you want to show. So what I'm going to do here is I'm going to access with iLock my very first row. So you might be wondering if it's the first row, why is it zero? In computer science world, we start counting from zero. So remember that to access your first row, you're never going to be doing one. You want to do zero. And what I want to access is the text column. And I want to print the first 200 items from that. So let's go ahead and see what we get. So this is just the first 200 elements that are in my very first diary files row. So you can see we've got the slashes in the end. So these denote a line break. So we've got information about the diarist, new line, date of birth, 1975, gender, et cetera, et cetera. So we can understand it, but we want it in a better format. And what you might want to do as well, particularly if it's important, you're building a data set maybe on a particular disease, so like long COVID, and you want to document the symptoms that people are experiencing, what you might want to do is pull some socio-demographic variables from that. So you might want the date of birth of your participants. It's probably going to be important that you get their gender. And it might be interesting also to get information about that occupation. So the way that we can pull this information here, because we can see it's, we've got it in the first few lines. So we've got the date of birth, we've got the gender, and we've also got the occupation. So how are we going to do this? So we talked about on Wednesday using regex, so these regular expressions, and we talked about using them with dictionaries to do this standard sort of find and replace. So yeah, it stands for regular expression, and it denotes a sequence of characters that specifies a search pattern in the text. So I'm going to show you a slightly different way that you can find the information that you want from a big bundle of text. And again, like I said on Wednesday, when you're first approaching these regular expressions, it's going to seem like an absolutely foreign language. But with enough practice and a strong grasp of the basics, you'll be able to incorporate this method into your text mining endeavors. I'm not going to go through everything here, but I've included some little resources that are really helpful. So the first one just gives you the lowdown on regex and what some of the expressions mean. And then this other side, regex 101 is really useful for testing your regex patterns. So we can basically search for these patterns in our text and then pull that information out. So yeah, the lowdown on that is that regex is a set of characters which are used to create patterns. We can then use those patterns to search, find, replace or validate text. So when you do go to this website, you can see here I've got this regular expression. So I want digits in the range of zero to nine and I want four of them. And what this allows you to do is if you're going to be scanning all your text most times, and that's not always going to be useful if you want to test a regex expression. So if I type a bunch of things here and then I have like, let's say I have 1998, right, you can see that it's got a match here. So it tells you that your regex expression works. Whereas if I do 838, it doesn't match. But if I do another year, so you know, 2002, it does match. And it also tells you, it gives you a little description of what these things do. So that's really useful if you want to test out and make your own patterns and just see if they're working. It's definitely helped me out of a few tough spots. So let's see how this works. So the first thing that I want to do, so I'm trying to extract information about the date of birth. So in particular, the year that the participants were born in, because you can see that in each file, we've got it at the start. So that's what I want to pull out. First thing I'm going to do is go ahead and create a new column in my diary files data frame called DOB. Then what I want to do is I want to access the information in my text column. And then I want to extract. So we use this method string extract in combination with a regular expression. Now this little R here, that denotes the start of a regular expression. The brackets here denote the capturing group. So the information that you want to capture, this is the pattern that you're looking for. So I'm looking for digits. So D stands for digits, and I want four of them because we're looking for a year of birth. And this B here just denotes the boundary of a word. And that's make sure that I'm only matching. That makes sure that I only match the first occurrence of four consecutive digits. Because if I have a bunch of years in my text column, I'm not going to find that I end up with a lot of them, but I just want this year of birth. So that makes sure that we only select the first one. So if we go ahead and run this, we can then look and see if it's worked. So you can see I've changed the maximum column work again, just so we don't see too much. And great, it's worked. So you can see that I've got the date of birth created as a new column in my data frame. The next thing you might want to pull out is gender. So we're going to do the same thing here. We've created a new column. We're going to access our text column. So we do this for each row. And then we use this string extract function in combination with a regular expression. Now you can see here that the gender is preceded by a comma. I mean, I don't know why I keep saying comma. This is a colon. I should know that. And then this s here just denotes white space. So our gender marker, so m or f, is preceded by a white space and then a colon. So I don't put my brackets around all of that, because I don't want to capture that unnecessary information. What I want to capture is this m or f. And these m square brackets here just denote I either want m or f. So look for that. And then go ahead and capture it for me and put it in each row. So let's go ahead and see if this works. Don't worry, by the way, about these sort of like they look like they're error messages. They're not. It's just what happens when you try and do stuff like this. So let's view the first five rows. And you can see that I've got the gender markers here. So great. Like I'm building up some like really useful information here about the socio demographic makeup of my participants. Next thing I want to do is I want to do the same thing and I want to extract which group they are. So which occupation they are. And if you actually go to the, you'll see in the data folder, there's information about what each group is. So what occupation it corresponds to. And we'll look at that in a little minute. So you can see here, just much like the gender marker, each group is preceded by white space and a colon. And then it's followed by white space. And then a number ranging from one to six. So that's what we want to capture, right? So we just have our brackets around this bit because we don't want anything to do with the colon. Let's go ahead and run this. Then let's have a look at our first five rows. Nice. We can see that it's worked. So we've got group six. And then the fifth row is group five. What you could do if you, you know, you find that, well, that information is a bit useful, useless to me. I want to see at a glance what each of these groups stand for. You can create a dictionary like I've done here. And I just looked at the information in the data folder. There's a file somewhere that tells you which each group corresponds to. And I'll just put it here. So for each one, I've created a dictionary with the value. So these are my keys. These are my values. And then what I do here is I access that occupation column. Sorry, no, I access it here. So I'm reassigning everything in the occupation column. So I want to overwrite the changes that I'm made here. And what I want to do is I want to map these values. So every time it encounters any of these groups in this occupation column, it's going to map those values. So we should see them change. So nice. We can see that it's mapped it correctly. And we can see now if we have a look at, let's see, the first 20 rows. Nice. So you can see we've got some pretty important information here. And we're building up a pretty good data frame. So now that we've done that, let's get on to some processing. So these include the following steps that we mentioned on Wednesday. So we've got tokenization, that's split in raw data into various kinds of short things that can be statistically analyzed. We've got our standardizing, which includes converting case, correcting spelling, finding and replacing words, removing abbreviations and regex expressions, removing irrelevancies, which include anything from punctuations to stop words like the or to. We've got some consolidation, so stemming or lemmatization, which strips words back to their root. And we've got some basic NLP, which is all that stuff we talked about, like tagging, name, density, recognition and chunking. And just a note here that in practice, most of your text mining work will require that your text corpus undergoes multiple steps. But again, like we said on Wednesday, the exact steps and the order of them is going to depend on the desired analysis. And you can read the rest of this, you know, I'm not going to read all about because we want to do some coding and not read everything here. So our first step, let's cut this text into tokens. So what we're going to do is we're going to go back to our main data frame now. So the one that we did at the beginning, so the one that's not split all of these files into like, you know, diaries or like interviews, we're just going to work with this big data frame, the original one, then at the end, maybe we can join some things up. So first thing I want to do, I want to use this text here, and I want to tokenize it. So I do want each row, I want the tokens. So first thing I do again, I want to create a new column. So I want it to be called tokenize words. And now you're going to see a lot of this. So I use this apply and lambda combination. So what apply does is it applies a function along an axis of the data frame. So in this sense, in this sense, I'm going to access axis one. And lambda might be a bit of a weird function for you to get your head around. It definitely was for me at first, but it's basically an anonymous function. So it doesn't have a name, but you can it can take any number of arguments. So what lambda does is it ensures that the function tokenize, so we're accessing that from the NLTK package, it ensures that tokenize is applied to every row in the text column. So it does, it does this thing for row. And you will start to get used to sort of what it's doing, but definitely don't feel like weird if you're confused about it at the minute, because I definitely was at the start. So let's go ahead and do this again. Why is it saying that? Let's see if it's actually worked. Key error tokenize words. Okay, so it looks like it's not recognizing this word tokenize. So I'm just going to make sure that I've done the right imports at the beginning. NLTK import word tokenize that seems to be correct. Is it because it doesn't recognize CD? Yep, that should be already... Let's see what's going on. Is it because I've spelled it wrong? Or nope, it's spelled okay. Use the NLTK download to obtain the resource. Okay, maybe it wants me to download this. Sorry about this. This is just standard coding problems, but it was working yesterday, so I'm sure it will work. Let's see if it's worked now. Yeah, so that's worked. I just... Turns out I hadn't downloaded one of the requisite functions. So it will often tell you in the error message what is you need to do. So I've done that here. So I've tokenized my terms in each row of the text file, and I've done that and put it in a new column called tokenize words. So let's have a closer look at the results, and we do this by focusing on one row of our text column. So I'm going to focus on row one, and I want to print the first 100 items. So you can see here it prints out a list of tokenized words. So great, we can see that it's worked. We can see some things that we might want to change. So you can see we've got holons here. We've got punctuation. We've got apostrophes. So yeah, we can see that the output is a list of strings, and we know it's a list because it starts and ends with square brackets. And we know that the things in the list are strings because they're surrounded by single quotes. And we can see some other interesting weird things. So horses dot dot dot whatever is all count as one token despite having full steps full stops in. So like I said, we've got a few things that we might want to change. Let's move on to standardizing. This is a really important step if you want to look at the frequency of certain terms, because in this case we don't want terms such as horses with a capital H to count as a different word to horses with a lowercase h. So what we can do is we can remove all uppercase letters with a built-in Python command. And then we can use the same combination of apply and lambda functions to create a new column of lowercase tokenized words. So again, we create a new column with the name lowercase tokens. And then I'm doing this apply and lambda combination again. So we access the row we want. In this case, x is each row in our tokenized words column. So what we do is we're going into that list here. So we're accessing everything in this list for each individual row. And I put t here to denote token. So we want to lowercase each token for each instance of the word in x with x being each row. So we use this for here. You'll see this quite a lot in Python in for loops. And it just iterates over each token in the list. Again, don't worry if it's a bit foreign to you at the minute. So we can see if it's worked. And what we want to do is we want to look at the first 10 elements in row one. So let's see if that's worked. Nice. So we can see that information, date, and gender are all now lowercase. I will say when you are using this lambda and apply, what I always find difficult is remembering stuff like to put these square brackets here to say that, okay, well, in each of my rows, so in each row of my tokenized words column, the tokens are in a list. So if I didn't have this here, it's going to throw an error message because it can't access the elements that it needs to in a list. So stuff like that is really annoying to kind of like get used to. But it all just takes a bit of practice really. So next thing we want to do is another bit of standardizing. And let's make sure that we have no spelling errors. And there's several really good spell checking packages written for Python. They aren't automatically installed and ready to import in the same way that the OS package was. But what we need to do just do is if you're working in your own coding environment, like I said at the beginning, you need to install the packages first and then import the functions. And we do that through the installer called pip. So I saw at the beginning if I just go all the way back here, like I said, if you uncomment this and you run it, pip will install it. You don't need to do all that in a minute because you can just run this because you're using this virtual environment here. Okay. So first what we want to do is we want to access the speller function from the package auto correct. So you can see here we're accessing the speller function. And we just set the language to English. I'm going to assign this function to a variable called spell. And you can see here that we've got creating that one word command saves a lot of time. But it's pretty important if you're working on text mining every day for weeks on end, you know, it's good to look out for good ways to save on time. So let's run this here. But as I've said here, speaking of time, the spell check each token contained within each row actually takes 12 minutes. So for the sake of time, I'm not actually going to perform the spell check on the first on every single row. I'm just going to do it on the first row of the lowercase tokens column. But what I'll do is I've included the code here, just comment it out so that in your own time, you can run it if you want to. So if you just uncomment this, you can then you can then run it. But it does take a quite a long time, which, you know, we can't really spare 12 minutes. So to do this on the first 50 elements of the very first row, what I'm doing is I'm using something called list comprehension. Again, another topic that took me a while to get my head around. But there's loads of really good resources on it. And if anyone is interested, I can post a bunch of links as well at the end. So you see what I'm doing is I'm using this spell, this variable I created up here. And it takes an argument x. So the x that we're accessing is the terms in the lowercase tokens, the very first row on the first 50 of them. So that's what I want to do. I want to spell check each of them, each of the terms that just do the first 50. So let's see what this returns. So you don't actually notice that much of a difference, but I've gone through this file before. And I have noticed if I just compare it to the very first row of the lowercase tokens and the first 50 elements in that, what I notice is that it's changed defra to the, because it's not going to recognize that that is an organization. So we probably want to change that. So to change that, we could do this for each row of our spell check column. But remember, we're not creating that spell check column. But if you do want to change it in each row of it, and you have took the 12 minutes out of your day to run that function, I have created another function here, which you can apply to the spell check column. If you just don't comment this at the bottom, you can make sure that it changes that dir to defra in each row. But for now, we're just going to correct it in our spell check list. So what I did is that list comprehension I used before, I assigned it to a variable called l. And I'm going to use another list comprehension with something called re.sub. So remember, re is our regular expression package. sub stands for substring. And it returns a string with replaced values. So we want to replace dir, defra. And we want to do this for each term in l, which is the list comprehension we used here. So let's see if it works. So what we should notice is, yep, dir has been changed to defra. So great. And like I said, if we wanted to make sure we corrected this in each row of the spell check column, if we create it, to do that, we can create a function. Whenever you create a function and you use def, which stands for define. And you can give it a parameter. In this case, I've just called my parameter x. And this is the thing that we supply an argument to when we call the function. What we do then is we have an if statement, an if statement. And it means if the word that I'm checking is equal to dir, what I want to do is return this here. So I want to substitute it for defra. Otherwise, if it's not equal to that, just return it as is. So that's what that's doing. Of course, like I said, not going to run it because we haven't created that column. And you might have a bunch of sort of acronyms and stuff that you need to spell check. So in this data set, when I went through it and I had a little look around, I found the following important abbreviations. So FMD for foot and mark disease, FMC for foot and mark crisis, TB for tuberculosis. So what I do is if I create that spell check file, I can just use these if statements to see if it's corrected them. Because in some case, the spell checker won't change them. It just will leave it as a certain as it is. And what I found is that it only changed FMD and defra. So if you wanted to correct multiple spell checked words in that spell check column, we can use this function here. But I'm not going to linger on this too much at the minute, just because I am conscious that like, I only really have about 10 minutes left if I want to do time for the Q&A. So if you do have questions about this, or you just don't understand these functions or what I've even just said, you can send me an email or ask a question at the end and I can go through it. But let's look at removing irrelevancies. So, yeah, punctuation isn't always useful for understanding the text, especially if you look at words as tokens, because lots of the punctuation ends up being tokenized on its own. We could use regex to replace all the punctuation with nothing. But just for variety's sake, I'm going to demonstrate another way here. So what I've done is I've first defined a variable with all the punctuation that I want to remove. So you can see here, I've got all of the following that I want to remove. But I am keeping some things. So you can ask yourself, you know, do we really want to remove hyphens? So for things like 96 or words like lactose-free, do you want to remove them? And for full stops in things like UK, you might not want to remove them. But in this case, we are going to remove all the full stops. But at the same time, we're not going to remove apostrophes, because these do actually contribute meaning to words. So I'm going to keep them. So let's print out this variable that we created. So this is all the punctuation that I want to get rid of. And what I've done here to remove it is you can see, because you've got this keyword, death, I'm creating a function here with a parameter from text. What this function does is it iterates over the strings in a row. So it kind of goes through each one. The hyphen function make trans here. It creates a table that maps the punctuation marks to non. And then what we will do is we'll print the table to check that it's worked. So I'm creating a new column, no punctuation. And I'm using this list comprehension here. So I access that function. I call it, sorry, not access it. So I'm calling my function here. And it's taking, so it's taking an argument here. So that's the parameter from text for I in my data frame, lowercase token. So it's going to go over all of those tokens and remove the punctuation from it. It's not really important that you understand all of this for now, just the basics of it. So my advice is if you don't get a function, is like Google the individual functions and see if you can not make sense of it, test it out on a really small bit of code. I'll help you understand things better because you know, it's going to be hard to go through everything in this demonstration. So let's see if that's worked. So I'm accessing row one. And I want to check the first 50 elements in my no punctuation column. So let's see. So we should see that it's removed punctuation and it has, we can't see any colons. We can see here that we kept our apostrophes right. And we said we wanted to keep hyphens as well. But what you will notice is that we have these empty strings. So you see here, there's an orphan actually in these. And that's what it means when we said we'll map the punctuation marks to non. So we can look at dealing with these now. Since these empty strings are recognized by Python as instances of non, Python can find them and then filter them out. So what we do is we're going to create a new column with a new name, no punctuation, no space. So it gets rid of these empty strings. And we use a list comprehension again here to filter each instance of non and remove it from each row. So we do this for the list that we encounter in each row of the no punctuation column. Let's go ahead and run that. Let's access row one and see the first thing, the elements. Nice. So you can see we've got no empty strings here, which means that it's worked. And we've still got the punctuation that we want to keep. So we've got here our apostrophes and we've got our hyphens because we want them for the meaning that they contribute. Now we can look at removing stop words. So you'll remember stop words are typically conjunctions, so and or prepositions like to or around. And they're really common in all languages. And they tend to occur in about the same ratio in all kinds of writing, regardless of who's doing the writing or what's about. So yeah, they do contribute meaning because, you know, obviously, if you say, freeze or I'll shoot or freeze and I'll shoot, that's very two very different sentences. But for many text mining analysis, these words don't have a whole lot of meaning in and of themselves. So we're going to remove them. We start by downloading the basic stop words function built into NLTK. And we store the English language ones in a list called stop words. So let's download that. So sometimes it will return true, which just says, you know, you have downloaded it from NLT corpus, I want to import stop words. And then I've created a stop words variable that contains all of the English stop words in a list. And you can see I've just printed it out. And I've used this function sorted. What it does is it just makes sure that it prints your listing alphabetical order. So you can see these are all the stop words that we want to remove. But it does contain quite a few words that will change the meaning of a sentence. So I'm going to customize the stop list word to make it a bit individual to my research, so it doesn't include these important words. So I want to keep words like on, couldn't, didn't, doesn't, et cetera. So I'm going to access the original stop words list. And if these words are in it, I'm not going to get rid of them. That's what this means here. So let's see. So yeah, I've now got 159 stop words. So let's remove these stop words by creating another column called no stop words. What we do is we iterate over the no punctuation, no space column. And we look at the words one by one. And if and only if they do not much, sorry, yeah. So we look at them one by one and then append them to no stop words or column if and only if they don't match any of the items in the stop words list. So we obviously don't want to include those words. So let's go ahead and see if it worked. And to see if it has, we'll compare it to the no punctuation, no space, row one. And we can see that we've got rid of, it's hard to spot them, but I only spotted a few in this. So the only ones that we got rid of was, was and the. Now let's move on to doing a bit of consolidation. So remember I said that includes stemming and or lemmatization, stripping backwards to their root. So again, it's getting, you know, you'll get familiar now with how things are done. So we import a specific tool from the natural language toolkit and we apply it to a pre-existing column. So we're importing this Porter stemmer and we're going to assign that to a shorter word. So it's just quicker for us to access. So this is our stemming function and we've saved it in a new variable called Porter. Then we create our new column. We're going to call it stemmed and we're going to access the no stop words column and use this combination of apply and lunder and we apply it to each word. So Y is our word and it's in our row list, which is denoted by X. And then we can see if it works. Yeah. So you can see here, this means the cells executing, but because it's got so much to do, you know, it's stemming each word, it does take a little, like a little minute for it to run. So yeah, you can see some words have changed. We've got information has changed to inform, occupation to, you know, it's got rid of the end of the word and it's done the same with geographic. So you can see that it's changed those. The next thing we could do is move on to lemitization, but you'll remember on Wednesday that I said that it's important if you want the correct, if you want lemitization to work effectively, you want to know the context of a word. And the way that we do that is by part of speech tagging. So we're going to do some speech tagging before we actually lemitize. And you can see that we've got this new section, basic NLP. So if you want to see where we're up to right now, like what is our data frame looking like, we can just do df.head. So you can see like we've done a whole bunch of things and we've saved them in their own columns, because at the end, before you get to your extraction phase, it might be that you don't know which, like which of these you actually want to use, you know, for your pipeline. That's why I'm doing it this way, just to show you like all of the different steps. So you can see like we've done a lot already, and we've got a new column here, stem which we just created. Now what we want to do is create part of speech tagging column. So again, we create a new column called POS tag. We access the no punctuation, no space column. And we want to apply the POS tag, each word in the column. So yeah, we apply that POS tag function to each row. And POS tag works on a list of strings. So you'll notice like, well, why didn't I put the square brackets here, because you can see we've got a list in each column, well, it works on a list of strings. So that's why we don't need it here. Again, these are the little annoying things about Lambda and apply that can get a bit confusing, you know, but you just got to like have a look at your data and see and have a look at the functions as well as well and see what the documentation says. So let's do that and let's see if it's worked. Have I done that thing again where I've not? Okay, so you can see it's telling me that I need to download that averaged perceptron tagger. So if you get in the same error as me, just copy and paste this, put it into the top of the cell and then run the cell again. So you can see that like it's executing now, same that it's doing all the downloading, all that good stuff. And then it should print it out the first 50 elements in row one. Again, like I said, it takes a little minute to run. The reason it takes so long is because it's having to like, you know, do some pretty computationally intensive stuff, right? It's got to figure out the context of each word in the row that I'm giving it. So we can maybe just maybe scroll down a little bit. I'm not going to be able to get through all of this anyway. So you'll notice that like at the end of it, if we don't get up to that, I'll just quickly whiz through the stuff that we actually do, but we don't get onto the extraction stage. Now what you can do is we have another repo that was just from a text mining webinar that we've run before. And that has an extraction notebook with a lot of like really useful functions that you could then apply to this dataset if you wanted to. But because obviously we've not got time to go through that today. But let's just have a look here. So you can see it's done that. So we've got all of these POS tags. What I would do if I had more time and I wasn't going to spend a bit of time on the Q&A is I would then lemmatize. So I can quickly just run through this. So I'm importing wordnet lemmatizer and wordnet. I probably want time to go through this function. So I create this function here. I'm actually not going to go through this because it'll just be too overwhelming and I won't be able to explain it properly. But yeah, you can run these all in your own time. And all I do at the end is, oh yeah, so it has this annoying error again. Let's just download wordnet. Yeah, sorry about this. I did check that everything worked yesterday and everything was working fine. But sometimes, you know, you just forget the odd download. And this should then work. Again, it's taking a little bit of time to run. All I do at the end is I join my data frames together. So we have that socio-demographic information with the information that we've pulled out from the text. So you can go ahead and run that in your own time just so I'll leave in a little bit of time for the Q&A. But if you have any questions about it, please just get in touch with me if you've run it in your own time and you're just a bit confused or whatever. Okay, so I'm going to stop sharing my screen and I pop my video back on and see what everyone's saying. Thanks for that, Louise. That was so good. Very, very detailed. There was definitely a few questions. Some of them just about the Python itself. So I think the earlier question was what icon do you press to view all of this? But I suggested just launching the binder from GitHub. Yeah. If there's anything else to add, please. I know just about how you run the code, was it? Yeah, so you just, I can go through that if you think it's useful. Yeah, probably, maybe I should demonstrate that at the start. So I can do that now. Sorry about that. So I'm going to show you how I can do that quickly. So you should be able to see my GitHub page here. Can you see it? Yeah, sorry. Yeah. All right. No, sorry. So yeah, you go on to text mining health and then you launch binder. It just saves you from creating your own coding environment, having to download stuff. Because you might not know how to do that and it can be a bit a bit fiddly. So yeah, then it should load in a second. And then you can see you just go to Python code. And we were using the processing notebook. And then what you can do so is if you go to kernel, reset and clear outputs, means you can start from the beginning. I'm sorry, I probably really should have said this at the beginning. And yeah, and then you can just go through using this button here, or whatever keyboard shortcut to run the cells. I'll go ahead and stop. Actually, I'll also say so there is that they're reading data as well. So if you want to learn how to handle the multiple rich text format files, you can go ahead and use that file. Because that's probably one of the hardest parts of like text mining. That's it. I think the only other question that hasn't been answered yet is, I think this was about Ian. It says that it has replaced the DER anywhere with DEFRA. So I think there was an issue with gender, because it included DER. Yeah, I did have this problem actually. I know that we had this problem when we worked with Lila on code. And I didn't actually consider that when I was creating this test demonstration, but there was a way that we solved it. So leave it with me, Ian, and I'm going to have a look at the code and see how we get around that. Because there is a way that you can get around that. Because yeah, it must be searching and capturing every time it encounters DER, and then it switches it. But obviously, we don't want that. So yeah. Surely you might be able to apply like an if loop or a for loop, just... Yeah, I think that's probably how we did it. So yeah, you only need as way of only correcting DER when it is on its own. Yeah. So I think the way we'd probably do that, Ian, is through like an if statement. And I'll go through because we worked on this project with like an intern. And this is an issue that she came across. And we did manage to solve it, but the demonstration didn't consider that. So we'll... I'll get back to you on that one, because that was a really good point, you know. Yeah, I don't think I've noticed that honestly before until that was pointed out. But I suppose we can edit the repo and then push the changes so you can have that bit of code as well on how to do that. Yeah. Yeah, no worries, Ian. We'll get that sorted. Anyone else, any sort of questions? We've got like a few minutes. And I'm also going to leave it, if it's alright with Nadia, like maybe like a little five minute break. I need to get drinks. Yeah, yeah. No doubt. Let's see if there's any else I can add to. So the issues with the Python just need to add in the extra download lines. Yeah. I'm not sure why it was thrown that issue, because I had... I thought I downloaded everything correctly. I was running these functions fine the other day, but just one of them sometimes when you go live with things, it's like you encounter code and errors that you would any other day. But yeah, just those download lines. So when you actually run it, it'll tell you what you need to download. Like it says error not found, and then it'll be like nltk. whatever package it is. You just copy that, whack it in at the top of the cell, and then just run it again. You'll see that it'll work it all out. And again, I'll fix those changes and I'll push it so that it updates that github binder. And then you'll have to fix it manually. You can just run the cell and it will work. I'll watch this back to correct it. It's probably best to do that myself rather than do something. Yeah, fair enough. Do you know what? It's good. That's a good little practice for yourself, you know, encountering an error and like seeing what you could do about it. Because usually the error does give you something to work with. Sometimes it'll be really annoying. It'll be something vague like value error, and you'll have to paste it into Google and see what comes up. But yeah, it's pretty good to just try and tackle it. So does white space before a token count to change DER to DEFRA? Would a white space before DER for DEFRA count? No, white space before a token. So any white space it comes across when it tokenizes a word, it gets rid of that white space if I'm right. And the only time we had to remove like so-called white space was for those empty strings that we had. So when we corrected our punctuation, we ended up with a bunch of empty strings. But when we tokenize our words, we don't have to worry about that white space. So no, the white space before DER or DEFRA isn't really an issue in that sense. I hope that makes sense. Does that make sense to you? Because yeah, when you tokenize, it doesn't, any of those empty, any of those line breaks that I mentioned, you know where it has the slash in the end, gets rid of them. So all right, nice one. So yeah, Fiona says that makes sense. But you need a way of correcting the issue without having to guess what other words might be affected. Yeah, so that would be probably like making sure you implement an if statement. Because otherwise, you know, if it does contain, especially when you're working with like short abbreviations that like DER, which we found a lot at the end of, yeah, like words like gender, that's going to be like quite one that you know will affect lots of things where maybe as if you're correcting like MPH for mile per hour, maybe that's not as much of an issue or maybe that's not the best example. But if you've got another like abbreviation, which like you're not going to find in the middle of the word or like, you know, like Z, X, Y or something, right, you might not have to implement an if statement. But yeah, the way of correcting the issue is to is to implement an if statement if you think it's necessary. But a lot of this, as well, you'll just be able to tell by that having a glance at the data after and it is a lot of like going back and forth and like exploring things again, like, oh, has that like messed up things? Okay, like probably need an if statement here. What would the statement look like? I can actually have a look now and go and see if I can find the example that I did. So just give me a second, I'm going to see if I can find it on. But it would be something like, it's hard to think on the top of your head, but it'd be like, if for X in, and then you'd access your column, and if X equals equals, I'll probably write in the chat, it'll be something like this. If X in, let's say we had a data frame and the column is text. If X in D F blah, blah, blah equals equals, equals equals. I'm just writing this out now. So that'd be your first. Sorry, one sex. That would be your first line, then you in Python, you have to remember indentation. So you'd want to then indent the second line. So if it equals dirt, then you would give it some command, which would, we essentially want our command to be only matching, only changing if it exactly matches dirt. But I'm going to have to find that command from the GitHub repo. So I don't just give you an absolutely horrendous line of code. I'm going to see what we find because we did have this issue and we managed to fix it. Ian's added, it's the command which isolates dirt on its own as different to dirt as part of a longer word. That's all. Okay, I think I've found something along the lines of it. I mean, Ian has just said that he thinks it's a good example of how using text is messy and correcting one thing can cause other issues. That is very true. And I feel like you're going to see this issue when we try to run this text, text mining in R, because it also causes an output than what we receive in Python. But we'll kind of discuss the benefits and pros and cons of which software to use at the end. But yeah, if you find that code, Louise, maybe you could just email Ian. I was going to say, I don't want to really cut into your time. So I'm probably not going to find it now, but I can be having a look as well as you present yours. But yeah, I don't want to cut into your time too much. What are you saying? Do you think it's still five minute break or do you want to just head on? Yeah, we'll take a couple of minutes, just let people stretch their legs, and I'll be back maybe just before 10 past. Yeah, that sounds good. I'm going to just mute until I come off while I get a drink. Yes, I am. Okay, so it's just reached 10 past. So we'll get started with the second half of this code demonstration where we'll be working in R. Before I go in, Louise, can you just give me a thumbs up if you can see my screen? Everything's clear? Yeah, you're all good. Thank you. Yeah, so we'll be working in R Studio, specifically on the file called foot underscore mouth, rmd, if it chooses to load. There we go. Maybe just some information you can clone this arm mark down into your own computer, if you'd like, and I'll just quickly show you how to do that. If you head over to the repo, which looks like this, you click the green button and you copy this HTTPS link, you head back over to your R Studios, you can go to File, New Project, I'll just leave that for now, and this will open up a pop-up menu where you can then paste this URL link in if it decides to load. You then click version control because Git is a type of version control. You're going to clone a project from a Git repo, and you can simply just paste that code in, save this anywhere on your computer, give this a name, and typically open this in a new session, and then you can create the project, and this will clone everything that you see on the GitHub repo into your own computer, but I'll cancel that for now because obviously I've already got the data. But yeah, so we're going to get started now. The first thing to do are to obviously install and load the necessary packages. There are quite a few that we use in this code demo. There are numerous packages in R that allow for text mining and analysis, and I think that the package, tidy text, and string are here are probably the most effective when it comes to data manipulation and applying basic text mining analysis. So my aim is to introduce you to a few of the most useful packages, or at least the ones I find most useful, and explore some of the functions within them for text mining and text analysis. So go ahead and install this, this might take a couple seconds, and then you then want to load in your packages. So I'm going to skip the install because obviously I've already installed these, and once a package is installed in R, you don't need to install them again. So I'll go ahead and just load all the necessary libraries, I mean packages using the library function. The next step is to load in the dataset. As Louise mentioned, we're going to be using that text.csv file, which can be found in the code folder. Just to let you know, to run a chunk in R, you can press this little green arrow that faces to the right, and that runs the current chunk. You can also run a code by line by clicking command, enter on a Mac, or control enter on a Windows. So that's just run that first line of code. So let's use the, just to let you know, I've used that read underscore csv function to read in the dataset, which is from the reader package, which is up here. And then I'm going to be using the head function to analyze the first few rows of the dataset just to see what we're dealing with and how this looks in R. So as you can see, we have 87 observations and three variables, which has been stored into our environment. These columns are labeled one, zero, and two. So here is that long variable that includes all the diary files and the interview files from our dataset. So the first thing to do is to basically clean your dataset and run some manipulation so that you'll be able to run further natural language processes on a clean dataset. Here we're simply using the assignment operator to call on the column number and replace these with a desired name. So in my instance, I'm calling on column one and naming this number and then calling on column two and naming this file name. And then column three, naming this everything else. We've run that head function again. You will now see that those, was it one, zero, and two, or what if it was? They've now changed to number, file name, and everything else. If you had more than three columns, I might suggest using a different function such as the rename function from the Dipler package or however you pronounce that package, just because this obviously isn't the most convenient if you had 20 different columns with inconsistent names. But for now, this works. Okay. So yeah, the next step, what we're going to do is basically split our data frame into two separate data frames. And this is because we know that the data contains both information about diaris while the interviews, I mean, there's information about the diary entries and information about the interviews. And we want to split these up. And we can do this by again, using an assignment operator and calling on a new data frame called diary file. We are then calling on rows one to 39 and extracting these. And as you can see in our environment, we have a new diary file with those 39 observations from the diary entries. I'm then going to create the same thing for the interviews. And I've called this group in files, standing for group interview files, use the assignment operator and called on rows 40 to 87. If you run this, you'll then see a new data frame has been added to your global environment. Now we can move on to some of the more complicated pre-processing stuff. The first thing we were going to do is extract the dates that we saw in our variable everything else. To extract information in R, you can use the str extract function from the stringer package. str extract is used to extract matching patterns from a string. You need to apply an input vector, which would be our variable everything else. And also you need to apply a pattern. And this is representative of that reg x pattern that was mentioned earlier. So I'm first going to be extracting the date columns from just the diary files, and then we're going to replicate this on the interview files. But just for a bit of information, a regular expression is simply a pattern that describes a set of strings. And in R, there are two types of regular expressions. You have an extended regular expression, which is typically the default. And then you have a Perl-like regular expression, which is a little bit beyond the scope of this talk. But a regular expression in Perl is basically just a special texturing for describing a search pattern within a given text. But in our instance, we're going to be using the default. So regular expressions are constructed by expressions and by using various operators to combine smaller expressions. I guess like the fundamental building blocks are that regular expressions match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. So any meta character with special meaning, which could be brackets, curly brackets, dollar sign, plus sign, question mark, these are all used in regular expressions. In this instance, I have a rather long regular expression, which details how to extract all different date formats. Because we know that in our file, we have date formats that are written as four words. We have date formats that are written as abbreviations. We have date formats that are written in numbers. So this regular expression aims to extract all the different formats of a date and pulls them into a new variable called date. So let's go ahead and run this. We can have a quick look at this again by using that head function. We scroll along one more time. And now you can see that we've got a column that includes our dates. So let's go ahead and do this with our entry files as well. Again, we're doing the same thing, calling on our dataset called group int file, assigning a new variable called date, using the STRI extract function, and calling on our input variable from our dataset, which is the everything else, using that really long regular expression to then extract the dates. And again, we're going to just view this file as we go. Scroll along. And as you can see, we have a date format inputted here. Now, the reasons these have been inputted differently is because both files included different formats of the dates. But the next step would be to then join these files together, like back together, because now we have the dates extracted, we can then rejoin them. And you can use the R bind function to do this, which is here. So what I'm doing here is calling a new dataset called new footmouth, using the R bind function from, I think this is from the base package in R, and then calling on those two data frames that we just run some data manipulation on. And now, as you can see, we have our 87 variables and the four new variables added to that dataset. Oh, just a little bit of information about the regex in R. When working with character strings in R, we tend to use double backslashes. And this is different to Python where we had that single. And then this is where the complications about reproducibility and sharing code comes in issue because the regex expressions are different across two program languages, which can obviously be very difficult. But yeah, moving on, we now want to extract that gender column. We know that there's gender written in that text, but we want to extract these and make it a new variable. So I'm calling on our new dataset called new footmouth, assigning a new variable called gender. Again, I'm using the STI extract function, calling on our input variable and assigning a value m dash m, m dash f. So if we run this and view the dataset, scroll along, you'll see that we have a new gender column with assigned values for each RTF file type. Lastly, we want to extract the occupations. And we can do this in a bit of a different way because I'd like to introduce another little function that is useful for extraction. We know that str underscore extract works, but the sub function in R is also used to replace the string in a vector or a data frame with the input or the specified string. When you're dealing with large datasets, it's impossible to look at each line to find and replace the target word or the string. But this sub function can replace the strings for you. So as you can see, we are calling on a new variable called occupation. And here's that sub function which replaces the string. But I've also included a az dot integer function. And this is basically used just to convert a categorical object into an integer object so that we have groups one to six, I think it is, assigned in that variable. So let's go ahead and run this. And we can quickly just view the dataset. Again, scroll along. And now we have occupation with the assigned group to each file. So this kind of covers the really basic reprocessing steps, but it allows you to basically work with a cleaner dataset or, as R calls it, a tidy dataset which comes in really useful when applying some of the functions involved in the NLP packages. So let's get on to the reprocessing. The first thing we're going to do is run some tokenizations. And this is simply a way of just splitting text into various kinds of short things that can be statistically analyzed. So we're going to be cutting up the everything variable into a list of words, a list of individual words. Now in R, yeah, so if we want to split the tokens into words, we can do this by using the unset underscore tokens function from the tidy text package. Now, as you can see, this function basically took all the sentences from the everything column and breaks them down into a format that has one word per row and way more rows than before. So our new data structure is kind of like one step away from, I guess, a tidy format, but let's just explore how this would look. So I'm calling on a new dataset called token list and I'm applying the unset underscore tokens. And there are three features that you need to include. One is the output words. So this is the output I want. And I want this output to be named our tokenized words. So that will be that new variable. We then want the kind of token we want. In this instance, we want words, but you can also have, you can have sentences as well, but we won't be covering that today. But if you did, you would simply just replace that word for sentences. And then you have the input variable. So this is the variable that's going into the equation. And in our case, it's the everything else variable. So if we go ahead and run this, and we can view our new, we can view our new data frame by just looking at the token list itself. Now, as you can see, we have all those tokenized words created into a list. I'd also like to point out with the unset token function that, which you might have noticed here, but this function has transformed all the words to lower case and removed all the special symbols. So this is kind of like one less step to do because this function is able to do all of them. And this obviously becomes incredibly important when it comes to cleaning the data. However, unlike in Python, where you started by dividing your corpus into words and then splitting the string into substrings whenever that word tokenizer detects a word. In R, we need to do another function to turn this list of tokens into a substring that matches the file name. And this is because if we view the token list, you can see we have like a million observations rather than 87. This is because they have listed them across every file type. So to do this, we can use the function fun right here. I'll just get rid of this for now. But yeah, you can use the function fun here to apply a function over a list of vectors. And it works by a call to match process it's called. We also have a function called sprint F, which returns the character objects containing a formatted combination of the input files. Basically, sprint F is just like a wrapper for the system in like a C library function. As you can see, there's also this kind of weird looking rejects equation next to the sprint F function as well. And basically, the percentage sign is referred to as a slot, which is a placeholder for a variable that will be formatted. And the rest of the inputs passed to sprint F are the values that will be used in each of the slots. Now, the letter S simply indicates the formatted variable that we want is to be a string. So I've called on a new data set again called footmouth token called on our token list that we had before. And we're using aggregate function from the Dipler package to then combine our file types while turning the token list into a substring that can fit within our original data set. So let's go ahead and run this and see what we end up with. Again, I'm using head to analyze this because it's a little bit easier than analyzing a whole data set because it will take way too long to load in R. So we have our original three variables, which are the file names, and here we are. And now we have our tokenized words. So we have this substring of tokenized words in our data set, which was very similar to the structure that we saw in Python. Just a little detour, I want to also just show you how you can create a word frequency. So we can use the unnest apologies. Yeah, we can create a function to basically count the amount of tokens that we have in our data set. And yeah, this function basically aims to count the amount of words that we have in our data set. It's kind of a bit of a complicated function, but it works really well. And this also includes the function to then create a ggplot. So we can just go ahead and run that function. And in our next line of code, we can then display the frequencies on a graph. So for this, yeah, as I said, we can use this unnest function, which is from the tidy text package. And we also use anti join from the diplo package. Now, anti join is used to basically find unmatched records, and it keeps the rows common to both data sets. And it comes in handy for us in a setting where trying to recreate an old table from the source of the data, we then basically, we then, it then allows us to identify the records from the original table that did not exist in our updated table. And just, I guess, a bit of information. I didn't really explain, but the unnest tokens splits a column into tokens, flattening the table into like one token per row. And this function supports a non-standard evaluation. But let's just go ahead and run this function. And then we can plot this by calling on our reviews tidy, which was created here, in calling on the word frequency, which is the function that we've created to count the amount of tokens. So we give that a minute, we get quite a simple plot, but this is just one way of how you can plot a word frequency, which is like a, you know, a basic step in the pre-processing process. And you can see which words are most common. It's oddly not surprising that the word day is most common. And this is because we have information about what day the participants were born on. We have information about when the interview was taken. So if you were interested, you could simply like remove that word, but we can show you how to do that a bit later on. And your last step, then, I guess, you can do this now or do this later, but I kind of like to do this as I go. So I don't forget where I'm up to, but I'm going to be merging the new data sets that we created. So that's the footmouth dv, which is our original data set and our footmouth token. And if you look in your environment, here we are. We see we still got those four variables, and they've been merged together. So those tokenized variables have now been added back to our original data set. Cool. Now we're going to move on to standardizing our data set. Standardizing basically is a way of dealing with the uppercase or lowercase. You may have noticed already that our, yeah, you may have noticed already that our data set had already changed these variables to a lowercase. But if you're still interested in doing this in a different way, there's a function called to lower that will run this for you. And what I'm going to be doing is just creating a new variable called lowercase. And then I'm going to use the to lower function, which will simply change all those tokenized words to a lowercase. But bear in mind that our tokenized words are already lowercase because this was done using the tidy text package. Let's just go ahead and run that. We can view this again by using the head function. Nadia. Yes. Sorry to stop you. I've just got a question that says that when they run the aggregate function, they get an error and it says buy must be in a list. Sorry. From which was this from this part? The I think so. Yeah. So it's it's where aggregators. Yeah. So the error message. The error message says error in aggregate.data frame. And then it says fun equals function. And then it says buy must be a list. So you can see it in the Q and A, but you might not be able to actually at the minute because that would prevent you from sharing your screen. But yeah, we've got a participant that says they can't create for underscore mouth underscore token. So it must be with this. You see where you got token list and then you pipe it and then you've got the aggregate. I'm actually not too sure why that's happened. Has anyone else had that issue? Or is it wrong? Is that the only? That's the only person who's saying it. But I mean, I suppose I could suggest that it might be that they're not running the whole cell. So they might be running it line by line. Maybe. Possibly. Yeah, you could. They might. Oh, Ian says he's getting the same as well. Error is they're running the chunk or line by line. If I run by chunk, let me see if I get an error apparently. Apparently it's that the error comes even if you run the chunk or you do line by line. Someone said. I'm not sure. I'm not sure what the issue actually is. Where are we for time? We're at like coming up to 22. No worries, we can't get past it. What we could do is like, you know, you can watch Nadia run it and then we can correct the code and then push it. And then if you download it again, you'll then be able to get it working. Sometimes, you know, these things just happen. Yeah, apologies about that. I'm not too sure. And also looks like my awe has now decided to freeze. So let me try to stop this. Okay. Yeah, that's strange. Yeah, someone said, you know, is there another way to create the footmark token because it's needed for later on in the script? But we are having a bit of an issue with that. So yeah, I'm sorry, but maybe I think for now, I think maybe. Yeah, sorry, I was just going to say, you know, if I could ask just participants just to follow along and watch Nadia execute the code, we'll correct it and then we'll push it and update it so you can run it in your own time. Yeah, because I'm truthfully not too sure why there is an error, but but yeah, I'll continue running the code and fix this in our later date. Apologies for that. But yes, we've just run the standardization now. And that was to create the lowercase letters. But just to recap, as you might remember, the tidy text package all the way up here, sorry, in the tokenization does this automatically for us and it also removes blank spaces. The next thing that you might want to do when running your preprocessing is to run a spell check. Now in R, there's a package called huntspell. And these include two functions called huntspell check and huntspell suggest. And these functions can basically test individual words for correctness and suggest similar or correct words that look similar to the given incorrect words. Now, unfortunately, you can see that this code is commented out. And this is because this function is incredibly tedious. And it could take I assume it could take about 20 minutes to run. So in respect to this workshop, I won't have time to run that. But you can go ahead and explore this yourself. Now retrospectively, I could have run this prior then store the correct words as a data frame and save this as an R file so that you can read this in. But the output is exactly as the same as what we would expect to see in Python. So yeah, if you have time between making lunch whatever, then for sure just give this a go. But yeah, now we're going to just move on to removing some irrelevancies. This includes the punctuations, empty spaces and stop words. So punctuation is not always, you know, very useful when running your text mining. You might want this to be excluded, especially if you look at a word as tokens, because a lot of the punctuations ends up being tokenized on its own. Now we can then use regex to replace all the punctuation with with nothing. So we're throwing it back to the str replace all functions from the string R package. So we're calling on a new variable called no punk, which just references to no punctuation. We're calling on our tokenized words, which I realized might be an issue for will be an issue if the previous code didn't run. But I guess you could just still see how see how this works here. But it has the str replace all function has three arguments, the first being the string, which is that input vector, you then have the pattern, which is the pattern that you're looking to look for. So in this instance, we're using regular expression. And this regular expression identifies all punctuation within a given text, much simpler than that regex that was for the date before. So yeah, let's just go ahead and run this and view the data. Oh, is that no typical, very strange. Let's just try that again. As that shows into work now. So as you can see, I feel like R is very tedious when it comes to working with text. I mean, working with text in itself is tedious. But I have had issues with R just randomly crashing because there was just too much data or refusing to run an output. But yes, this has seemed to work where we now have that new variable. If it loads, so that's our tokenized words variable. There's our lowercase variables, and there's our no punctuation. Oh, I skipped a line, didn't I? Oh, apologies. But as you can see, the no punctuation variable is present here, so it has worked. If you're interested in removing non-alpha numeric characters, then the regex used would look like this. So all you would have to do is replace the regex for the punctuation here. But I'm not interested in doing that now, but you could simply uncomment this and try to run this yourself if you're interested. But yeah. And then we also want to remove empty spaces. We can remove, now I showed you how to replace stop words quite easily with the tidy text package earlier, but I also want to show you how you might go about this a different way. So I've created a new variable called apologies. Yeah, so I've created a new variable called stop words regex, and the stop words function is from the tm package, and it allows you to basically obtain all stop words in English. And then I collapsed this to combine several cases into a single line. For here, I'm using the paste function to then concatenate the string by converting it into an argument. So now we have a variable that includes all the stop words made from a regex expression. So again, we can then use the str replace all function to remove the stop words from the no punctuation variable using the new variable that we created up here called stop words regex. Very complicated, I know, but I think it's useful to demonstrate just how many methods and packages that are in R to run simple like processing methods. So let's go ahead and run that and see how that works. It will take a few minutes because removing stop words is quite a tedious process for R because it needs to identify every variable within that row. And we have 87 observations with large text files. If this takes any longer than a couple of minutes, I will just stop this because I'm noticing the time. And there we go. Let that load. Yep. So that gives us that no stop words variable. Cool. Now we can just go on to consolidation, which is kind of like basic NLP methods. I'll run through this quite quickly, but stemming as discussed is a way to reduce words across documents. The TM package in R provides the function stem document, which can be seen here. And this function aims to stem the document to its root. And this function can either take in a character vector and return a character vector or take in like a plain text document and then return a plain text document. Interestingly, you can also use the function. I think if it's called stem complete looks like this to then reconstruct these words back into roots, into a known term, but I'm just going to run this hopefully doesn't take too long. Perfect. It did not. And then we're going to just have a quick look to make sure that this variable is actually worked. Scroll along to find our variable. Very temperamental today, this R-Studios. But we do have all our previous variables available. There's our stop words and there's our stem words. We then have POS tagging. Now, from the POS tagging R, there's a package called UDPype. This UDPype package contains a function called UDPype download model. And this basically allows you to download a UDPype model provided by the UDPype community for like a specific language of your choice. So I'm choosing to download the English language. Just to let you know, UDPype actually stands for universal dependencies tree bank. So it's this like collective, I guess like corpus of languages that work well with an R. So we then load this tree bank using the UDPype load model function. And then we call on to the file model. And this is just the path on the file where the model was downloaded to. So let's go ahead and run these two lines. You'll get a big warning sign, but that's absolutely fine. Just letting you know that's been downloaded. And then we load this model. And then we want to add this to our model. And we can do this using the UDPype annotate function. Now this is a tool used for tokenizing, lemmatizing, tagging and pendency passing annotation of raw text. So if we run this, this should create our POS tags for us. Hopefully it doesn't take too long either. I feel like we might have issues trying to show this data set, show the output from this code. Just because it does seem that my R is struggling today. It might help if maybe if I click clear some things from my environment, but for now we'll just give this a few minutes to see if this loads. But I guess while that waits, I could talk through the lemmatization function and then we'll call this the end of the talk. But yes, this lemmatization is the last method that we were going to show you when it comes to basic natural language processing. We can use the function lemmatize words from the text dem package to lemmatize our tagged variables. Obviously it's still working, which means if I even try to run this, it's probably just going to crash. So I'm just going to let this do its thing for a minute. But yeah, as you can see, it's like really quite easy to run some of these basic natural language processing methods on a data set. It doesn't take too much code. It's very simple. You might have also noticed why I've been creating new variables to dictate the new methods. And this is simply just to show you how this variable would look like. But if you're working on a real project, you'd probably want to, you might want to overwrite this variable rather than creating new ones every time. But yeah, that calls conclusion to this code demo. If you wanted to kind of tidy up your data set, you could run a left join, which will join the data sets new footmouth, which had our gender occupation. There was another one other dates to our footmouth data frame that contains all the pre processing and processing techniques. And you can join this by the file name. So you have everything under one. You could also choose to like save this as an R data frame, because running all of this every single time before analysis would obviously take way too long. But yeah, I will stop sharing my screen now because obviously R is just not working. So I do apologize. But yeah, I'll stop sharing and see if we have any questions to take. All right. Yeah, thanks for that, Nadia. Yeah, classic, right? You know, when you do a live demo, like sometimes you're just going to encounter issues that weren't there the day before. But it looks like it's a problem with the fun equals function, brackets token list line. And I think it's, it's not recognizing it as a list. But there's ways that we can like test, you know, sort of like confirm that we'll just have to let go back and like test what variables it's taken. But it is strange that it's working on Nadia's R. I didn't change that line of code before I pushed. So I'm not too sure, but I'll definitely look into it and make some changes to GitHub with some comments so you guys can see see what's what's going on. But yeah, as you can see our studio can be a bit of a pain, especially when working with like, complex data, like like text data.