 Welcome, Lucinda. Thank you for inviting us here today and I'm going to give you all a general overview of what natural language processing is before Tomasso takes us through his really, really interesting project that he did on using mixed methods in natural language processing. That's very topical with International Women's Week being next week. So any of you who are coming here from the school attack or from an equally data background, some of this will probably sound quite familiar to you, but hopefully I'll give you some new info on language itself. Now, if anybody's coming from a little linguistics background, please don't call me out on my lack of very deep language knowledge. So it's a little bit of an intro on both sides. So if you've got questions, as Lucinda said, please do pop them in the chat and we will have plenty of time for questions and discussion at the end and lots and lots of time for, can I do this with natural language processing? The answer is almost always yes. So today I'm going to start you off talking to you about the human language, okay? Because understanding how computeristic language we first have to really understand humans and we'll talk about the difference between humans and computers and how do we teach computers about humans, okay? And how we can get them to do what we want them to do with natural language processing, okay? And then, like I said, I'm going to hand over to Tommaso after that point. He's going to take us through a really interesting study. Not that I'm biased and I definitely wasn't his supervisor or anything. And then we're going to bring Lucinda back because Lucinda and I are currently working on a really interesting project. So we're going to talk about what we've got coming up as well. So what is language, okay? You can ask yourself this as well. What is language to you? So typically when we talk about language we're talking about a method of communication. My primary language is English. I know English and I know bad English and I can say hello in a few different languages but really nothing meaningful, okay? So today when I'm talking about language, the examples that I'm going to use are going to be English, British English language examples, okay? But the things that I'm talking about work in all languages and I've been advanced across lots of different languages especially recently. Now humans do primarily use spoken language but we also use gestures. Sign language is an established several different types of sign language. We use body language. We use tone and we use all those kinds of things, okay? Computers speak in ones and zeros. That's called binary. Ones and zeros, they make up bits and bytes and nibbles. You may have heard that, terabytes, okay? But essentially computers only speak ones and zeros. Everything else above that that we get them to do is interpretations that are models that are built on top of that, programs that are built on top of that. So you being able to see me on your screen really boils down to a series of ones and zeros and what we make those ones and zeros do. So right from the get-go we know that human beings are capable of an awful lot more than computers but now we see things like chat GPT. We see these the new advance and chat GPT of the generative AI videos you may have seen, okay? But all of that processes down to getting computers to behave like humans. So to get them to behave like humans we need to understand humans and how humans work. So we can talk about words, all right? This is probably bringing you all back to your GCSE English and I know we've got some mixed disciplines and there are an awful lot more to words than what I'm showing on the screen but we're probably all familiar with verbs and nouns and adjectives, okay? Talk about adverbs quickly back ever, okay? Prequisitions, pronouns, me, you, she, he, okay? Interjections, all right? So these little sometimes we call them stop words if we're talking computers. So we know we have all of these words and we can break the words down into the types of word, right? And this is important because we need to tell the computer what type of word this is. We can talk about bits of words. These are called stems. We'll give you an example. Victory, factory and victorious. Now victory and factory, if you look at them straight up, they are actually more similar than victory and victorious but we know in human language victory and victorious are more meaningfully related than victory and factory, okay? So what we do, okay, is we have something called a word stem, okay? So that is the core part of the word that is meaningful, all right? So that means in this case we would take the V-I-C-T of the word, the fact, those are the stems of the word. So we've got the type of word and we've got the part of the word, okay? So type of the word and then we've got the part of the word. These are all meaningful things. These are not things that you think about when you're having a conversation with your friend down the pub but it's all part of the complexity of language. So we can think about what words mean things in communication. So we've got these stop words as we call them stop words in computing and we've got a different word in linguistics but these words are absolutely essential for a sentence to make sense if and but the of so. However, take them out and the sentence can make sense to a computer. So it's nothing to do with the subject of the sentence but really important when it comes to the understanding of the sentence. So the cow jumped over the moon. Cow jumped moon. We don't speak like that when we're speaking out loud in human language. Cow jumped moon. We say the cow jumped over the moon but if you were to read a passage from a book and you did a word cloud we love word clouds don't we? I love word clouds. The word the of if. If is not a word. I just made that up. Anyway if and but the of that's those if you count the number of those instances inside of a book that's going to rapidly be your biggest letters on your word cloud but really if you're wanting to digest a book you want to know about the cow jumping over the moon. Okay you want to see how many times the cow is mentioned the moon is mentioned. So we get rid of these words which are essential for the humans to understand but actually are not the core of what it is that we're wanting to say. Okay so we those are the stop words. Now this is one of my favorite things about languages is there's actually distances between words. This is something called hyponomy and hyponomy and it is nothing to do with your post which does make me slightly sad but we can map the entire dictionary and we can map the distance between words in the dictionary. So if we take our example that we've got on the screen we've got color right so we've got color now underneath color we've got purple we've got blue underneath purple we've got violet we've got lavender underneath blue we've got navy so these positions that words sit in with each other actually have distances that we can put on them so we know that violet and lavender are more closely related than lavender and navy because we measure the distance along these lines so lavender goes up to purple you find that common ancestor so just as you have the genetic difference distance sorry genetic distance between a human being and a gorilla by going to our common ancestor or a horse and a zebra by going to their common ancestor words have that same ancestry okay so they have that same commonality which means that when you are sorting words you can actually say right well the the word violet is more closely related to lavender but how far is the word violet from the word tree okay so we can we've mapped the dictionary in fact I worked with a person who classified the entire dictionary okay very very cool project done by the Oxford University Press but other dictionary other universities and places have mapped different dictionaries because the British dictionary is different from the American dictionary and then of course we have different language dictionaries as well I've now said the word dictionary too many times and it's become meaningless all right so we've got types of words and nouns adjective verbs we've got parts of words we've got the stems we've got words which mean something but also don't mean something we've got words in relation to words and then we have sentences okay so we've built up our sentences all right so we've got our noun phrase the cat we've got our verb phrase plays piano and these are in specific orders because what does the cat do the cat plays the piano what happens to the cat so what happens to the piano the cat plays that sounds a bit weird okay so although it's kind of technically correct something has happened to the piano but because the piano is in the verb phrase we swap it out so this is just a real quick whirlwind introduction to the complexities of language now try teaching that to a computer okay so we've got to know it and understand it ourselves because if we don't then it doesn't really translate over to the computer because remember we've got words sentences inflection gesture computers have got ones and zeros okay so the cat plays the piano right it's meaningful each one of those words plays a role and its position in the sentence plays a role we haven't even got to conversations yet okay we've got a slide going forward right we've even got these nuances of the adjective order so in English it's an inherent thing that you say the adjective is in a certain order so quantity quality size age shape color now until I saw this tweet on the screen I didn't even realize I did that okay but the strange green metallic material is how we would do that okay so it even dictates where you put an and or a comma okay and again all of this is meaningful okay it's specific and it's deliberate of the way the language is used and developed okay we love seeing people on the internet call out are you used then rather than than um and actually those sorts of things generally we do understand what people are going for okay so um but you will see people get really finicky about the way that certain types of language is used because you know um eating with your children and your dogs or eating your children and your dogs and things like there's lots of and lots of examples about misuse which changes the inflections which changes the purpose and next thing you know you've accidentally kept your grandma okay um but all that is to say it's important but it's really specific and really nuanced and can be very difficult to explain let alone right rules let alone translate it into ones and zeros okay so how do we communicate with computers we use programming languages okay so we use programming languages to communicate with computers all right so computers will do exactly what you tell them to do okay so we will turn human language into computer language okay so we will say to the computer print hello world and the computer will do exactly that you will give it a series of commands they're wonderful do exactly what you say and they're terrible because they will do exactly what you say okay so printing that hello world now the computer will turn that into these ones and zeros okay which is fine but it's not especially useful so we need to build that up okay so we need to turn that into a series of commands that means it can interpret human language so we need to teach the computer okay so we teach the computer in a few different ways we can give it supervised learning right so this is machine learning you can split it into three particular ways supervising machine learning this is where we say right here's all the right answer find out why we say unsupervised learning well here's a whole bunch of information figure it out for me tell me where it splits and we do reinforcement learning okay so we tell it when it's doing right and we give it a score so fraud detection customer segmentation chat GPT are examples of all the different ways that we use this machine learning okay so where we have data sets big old data sets historic data sets we can feed that to the computer it works out the rules and then it says this is likely to be this case or we can say well we don't really know what we want the outcome for this week but tell me who is similar to who so we can look at customer segmentation and we can say right have a go at predicting this I'm going to tell you how well you do then you go ahead and you learn from that right so that is how machine learned and that's that's it machine learning 101 you're done you're good to go okay that's obviously a very simplistic approach to it but that is how it all boils down right so quick history then of natural language processing so natural language processing is the computer processing of human languages okay a lot of interest began in World War 2 looking at translation models so looking at different language translation models being done by machinery rather than by humans and the early models and challenges you know done by maybe you've heard of known promsky okay but we began to realize that actually it's not just as simple as translating an English word into a French word or a German word or a Chinese word or any other language because you need to look at the widest structure of the sentence okay we need to look at how things are phrased together okay and that really led us into what's called an AI winter so AI goes through hype cycles just like lots of other things and the first sort of AI winter where everyone kind of put it aside was in the 60s because one of the issues really is when we deal with these natural language processes they're very very complex which means that we need really powerful computers to be able to deal with it okay our very sort of first AI conversation it's called Eliza and so Eliza was built in the late 60s and mimicked a essentially a therapist client interaction and essentially just structured questions okay and it you know it works it's obviously very basic and anyone to use chat gbt you just realize that actually it's um you know nowhere on what that can achieve now in the 80s and 90s the statistical models and the neural networks began to get developed okay so these neural networks are essentially fancy predictive text but what we can do with these is something called encoding and decoding okay so i've got a picture that on the screen all right which means that we can actually create the context from the words when they get encoded work so that when they get decoded that context can remain with the word as the word changes to rearrange it in the right way okay so our models in the 50s and 60s could say you know la rosa casa from the red house when actually in Spanish it's la casa rosa okay so that the relativity when we talk about these great big nets of words these mapped dictionaries can stay with the word making that translation even better then of course we need to train our models so when we're doing these learning algorithms we need to feed it data to learn from and the computing power that goes behind our chat gbt's has increased substantially so chat gbt went from excuse me being trained on 117 million to 175 billion training parameters between 2018 and 2023 so in the space of five years that's hugely hugely improved and that is only capable when working with really advanced computing power so that you know if you remember your little desktop system there even those little apple computers with the green c3 backs on them there is no way that they could cope with what even your phone does now let alone what some of the cloud computing power has we can do so when we actually get into building these models okay it always starts the same first thing you do is you clean your data all right so you get rid of your funny characters you get rid of your emojis you get rid of your misspelled words and mistranslated words clean them up make them all neat and tidy that's always going to be the part that takes the longest so you can ingest your data in any format you can get pds so you can get word documents you can get twitter you can get reddit you can get whatever you like there's always a way to get that in but then cleaning it up all right so even your beautiful translators if your accent's a little bit not placeable like mine then it will struggle to pick up some words but you clean up your data you've got your nice neat data it's there it's ready to go you need to build a corpus okay so the corpus is which which words are you going to learn from now lots of our programs have got built in corpuses okay or corpi i don't know what the plural of that is but you can say right here is a whole heap pile of words train yourself on that right next thing we do is we lemmatize our words so we change our words into their stems we talk about the stems but get the meaningful part of the words and that what that means is that it prevents flooding these models with words which are actually all the same but use slightly differently because then your model's gonna think oh this one's super important and you'll get a model that is focused just on iterations of the word change we get rid of our stop words if and of the but so again we're boiling our words down to the most basic parts of it okay and that means that we are focusing on the really important parts of the words that we've been given then we build our models okay so some of the most popular models that use in natural language processing so your reinforcement learning algorithms well done computer 10 out of 10 for this one that's your chat gpt so chat gpt is based on the same ones that learn how to play chess okay so essentially you reward the computer when it does something right the more you do it the more you train it on the better it gets it predicted okay so it's like a very enthusiastic puppy okay so you train it you give it treats you tell it what it's done well and then it will do everything you want i clearly have not trained my dog all all that well sentiment analysis is very cool okay and tomasso is going to tell you a lot more about sentiment analysis and topic modeling but nike used sentiment analysis to monitor public opinion when it sponsored the footballer Colin Copernick so Colin Copernick is the one who very famously took the knee during the national anthem and nike used sentiment analysis which will tell you if someone is saying something happy neutral or sad okay and then and it's really commonly used to sort of monitor customers and customer opinion about products and feelings topic modeling topic modeling doesn't often get into the public domain but from a researcher perspective it's really valuable so topic modeling showed us that in an analysis of news articles about coronavirus things like the markets company drama and cancelled events were written about much more frequently than the health impacts of the disease okay and you can take from that what you will but that that analyzed 35 000 news articles okay so thematic analysis very very cool topic modeling is essentially thematic analysis on a crazy scale within thematic analysis okay with the most common model that we use is called latent derelict allocation so what this does is you take all your documents so your document might be a tweet it might be a book it might be an interview it might be a single line response and it feeds it this extra layer okay so it takes all of these words and says right these words actually feature in these topics okay so that is how that you know the news story allocation analysis picks things out so it will look across all of these documents and find the common topics discussed in all of them okay and smasso is now going to talk to you all about how he's actually used sentiment analysis and topic modeling together for the project that he did so i'm going to keep driving for you but i'll switch my mic and my camera perfect thank you i just do the old next slide please as i'm when so yes thank you everyone so i'll introduce myself because i guess i've been fairly suddenly lurking in the background my name's tamasso and i also work at the university and yes i completed a project i see recently it's actually a fairly long time ago now it feels like it um on natural language processing and more specifically um how we can use the techniques that sam has just talked us through to detect gender bias or to test whether the risk and the bias in place so this um ultimately led me to a couple of key i guess objectives key uh uh processes to test for this project and one of them was to check whether there is a significant difference in the sentiment and the semantic composition of um twitter data and at the time it was still called twitter so this was just before it rebranded to x so i will continue to call it twitter because i don't think my brain can handle that at this point um and then the second objective of the study was to check whether essentially we could create a tool that we could lift and shift and adapt to different kinds of media different kinds of uses and subjects and see if it um could provide the same sort of analysis um for other purposes so that's that's the basis of my project so sam next slide please oh yeah i've got fancy things in there so yeah to to start us off i will go through some of the stats i found in research in this topic so i particularly i started off working in the tech school looking at the technology sector and at the time of writing this a few years ago now when i first started doing this research um a couple of these things studies found that about a third of uk and us jobs and technology um were occupied by women as opposed to often was over 70 percent in fact i'll keep up my mind um there's a real uh trend um or rather an inverse trend in importance or i guess um value given to an academic research paper and the list or the place of any female authors uh in the list of authors for the whole paper so um basically what i mean by that is that the more likely and more prestigious the the more likely to be i guess valued and the more prestigious the publication the further down the list of authors you will be more likely to see um females female names um additional analysis found that um a gender descriptor so those are things that we used to describe people who have a real drive to take the lead take control and you know get us out of the sticky situations people who are own leaders um there was a real i guess absence um or a real discrepancy between the gender so that was those descriptors those adjectives that we associate with that um with those gender descriptors were way more likely to be used for males um very similarly in news topics um we see continuous um validation of these gender stereotypes um in the term in terms of the the i guess the the spheres of topics that they they talk about so in in in this case it was um i don't it was identified at women or articles focusing on women tended to cover on what the authors described as the private sphere so themes of family home and relationships and that was across a number of different um outlets and lastly and then more relevant to the rest of my study was social media and it was found that um female social media users in um i guess high visibility roles so in these cases it was politicians were more frequently subjected to challenges on their professionalism and have their credibility disputed publicly um and i think in both cases i studied it was on twitter so everyone was able to see that so next slide please um and that essentially um gave me the idea of what to to to look into so um i'm a sports fan i like all kinds of sports and at the time i thought i choose tennis and i essentially found 10 famous tennis players very famous tennis players all with active twitter accounts and a large following base and using the twitter um api extracted the the tweets in which they were directly tagged um so in which they were addressed over a two-week period i collected hundreds and hundreds of tweets over that period and that formed my um my big data set that's what um i ended up testing so next slide and these are the two techniques that essentially helped me to analyze this information so um i sometimes mentioned sentiment analysis and topic modeling so there are lots of different algorithms different programs available to carry out sentiment analysis one of them is there's what a cool name one called uh vader and it's to reiterate what sam said it essentially looks at a body of text and it tells me whether it's a positive text negative or kind of both maybe neither so it takes in those important words and looks at whether those words are ultimately formulate a positive um message within that contained tweet and then afterwards carried out topic modeling which is essentially to look at what the people who were addressing these tennis players were talking about what would they say in exactly um when they were addressing um these users that we started so here i mentioned the program that um essentially developed to to try and test here it is summarized um you know we we start with the api we collect all these tweets we have this nice database crucially then there was a lot of cleaning to do um yes there are lots of emojis images um there are lots of hyperlinks and lots of frankly for this purpose useless things in so removed all of those and then if you can kind of gush your memory back down uh back to the lemmatization and the stemming carried out all of that to essentially create slightly different datasets or rather divided datasets based on um the gender of the users which we then analyzed so that's the program ultimately as a whole um so next slide please so what did i find so you can see this chart uh in a way summarizes the sentiment scores fairly fairly well you can see that really visibly there isn't a huge amount of difference um what's quite nice to see is that you can there's a lot of uh there are a lot more positive tweets being um sent to these tennis players than there are negative um and then followed by the neutral so that was um always that's always lovely to see so um i then carried out a couple of statistical tests to see whether there really was any significant difference and there was between uh i guess across the sets so as you can see that there are significantly more positive tweets and neutral trees than there are negative so um there is a tendency for twitter users to reach out to these tennis players with positive or neutral messages rather than negative however has you probably reduced there was really a significant difference um in the actual distribution between the genders it's more or less the same nonetheless i wanted to investigate that a little bit further because a difference is worth looking into um so next slide please sam i looked at it into a little bit more detail and as you can see the proportion of negative tweets that the female tennis players received was higher by about three percent um between between um higher than the male tennis players so not a huge difference um but a difference nonetheless and in an ideal world we wouldn't see any um so the way that these sentiment analysis algorithms work is that they have a sort of thresholds so they'll allocate a score to the the text that um you've provided and based on that score it'll be categorized into a positive negative or neutral um what i did to maybe be a little bit stricter i carried out some post tuning which essentially means making those ranges a little bit smaller so being a bit stricter in the category in the categorization so if there was a score that was only just neutral only just negative i'd remove it from consideration just to look at um the more extreme cases and the difference remained more or less the same still two percent so fairly small but there was something there so clearly there's a there's a difference there worth looking into so the next step was to look at what exactly people were saying and this is where the topic modeling comes in so on the next slide you can see here in the example of the sort of dashboard that um you're presented when you carry out this topic modeling technique this lda analysis and you can see lots of circles there on the chart um and that is very useful to tell you how many distinct topics the model essentially identifies you can see there's one on the top right one on the bottom left and then a whole bunch on the bottom right so that tells us there are probably three distinct topics um being identified by the model in that particular case there so in the next slide we can see the differences between the gender datasets so on the left we've got the negative datasets we can see that um there are more there are more negative tweets being um addressed to female tennis players in general but we can also see proportionately there are more uh negative tweets that mention that well categorized as aggressive language so in some cases it was really quite abusive language that wasn't identified in the case of male tennis players we can see that negative tweets um also addressed the tennis players physical attributes and um in line with some of the research I pointed to earlier um mentioned topics to do with well by categorized here as family um and we can see a similar trend in the positive sentiment dataset interestingly there are um the there are fewer um aggressive tweets being addressed to female tennis players so generally that tells us that when Twitter users are addressing these um tennis players positively there's as we can imagine no aggressive language but the difference the in physical attributes are more specifically the family themes increase so we can see there's a tendency there for Twitter users X users to engage with these tennis players and depending on the whether the tennis players male or female to not necessarily alter the language but perhaps they're more ready to address them regarding certain topics than the word the other gender and those are essentially the findings here and I think on the next slide we have just a bit of a summary really so yes we can create a tool we can create a program um and you know I did this fairly easily by collating publicly available I have algorithms encoding um on the internet so it's relatively straightforward to do and to then feed it the disinformation through these text data and analyze the sentiment and the topics to to gain some really valuable insights really interesting insights and sometimes quite hard-hitting insights um the lda malus allude to a greater consistency of negative language um we can see the there was just that the report high for pretty much every category there and it also shows us that there is a significant disparity so all these were sort of statistically tested and I found that it was a significant disparity in the themes in the choice of words used by Twitter users to address each gender and this in turn gives us some fairly clear applications for for future research it you know the first place to start is by expanding the sample period as I mentioned this was only two weeks only 10 tennis players maybe this needs to be expanded over the course of a year to 10 however much we can we can weigh basically and increase the the population size let's look at more tennis players from there there's continuous improvement that can be made to the model you know the the program that I created was now so I think about two years ago maybe a year and a half ago so so no doubt improvements have been made to the algorithms that I used at the time or there might be new um ways to analyze the data available out there that I didn't have available at the time so this can continue to progress in line with developments in this kind of era of technology and of course the whole idea is that this tool then gets lifted and shifted as I mentioned earlier to be applied to different types of media different subjects different themes to gain this really useful insight this could be used to look at academic papers it could be looking to further explore the um the type of behavior that the politicians are subjected to um which is particularly pertinent this year which I think is the the record number of democratic elections going on at any one time in the world so you can imagine there's all that information out there that we could really study and then you've got things like marketing material very similarly to what Sam mentioned earlier so possibilities are numerous and it's a very interesting and potentially exciting sphere to to work in so I hope that was interesting um with the stop tour of the project I hope I'm enjoying that so any questions just pop them in the chat thanks Sam thanks to us so that was really interesting like I said I'm not biased or anything but um so um we have had some questions in the chat and so what I'll do is I'll just finish up this little bit answer the questions and then I will bring uh Lucinda in because the question was about whether or not you can do these things in chat GPT and Manish has just tried it on an article in chat GPT so you can and there are capabilities you can also use the chat GPT API um so you can use them the difficulty comes with chat GPT being a bit of a black box um getting the statistical validity out of them there's also a couple of complicating factors which is chat GPT is public cloud so unless you have a privately hosted instance of chat GPT you're really limited on what you can ethically serve it because anything that you give it it remembers and it uses it to train its model particularly on that reinforcement thing as well and also it depends on whether you've got the premium the api so that's the application programming interface is how you kind of hit the back end of the data with things so absolutely doable but it's probably going to cost you money and programmers so um Tomas and I both built our programs inside of python which is free to use which big fan of you know they say if it's free it's for me um so definitely you can use it but use it with caution and it might not be quite as versatile as you hope when you're starting to get into large volumes of things so chat GPT itself as well is trained on reddit okay now I love a rabbit rabbit rabbit hole as much as the next person but with reddit okay so reddit is a primarily american um american user face uh it's also it does have moderators but it does have some some sort of dark areas and a lot of um that can influence the way these algorithms behave okay so computers model humans and humans are most certainly not perfect okay so um google has just released a a challenger to chat GPT trained on all of its own data rather than just reddit um so whether that is better or worse or is yet to be seen but it doesn't take very long to prompt these algorithms into being quite unsavory um which I think is probably the most polite way I can think to describe it on a recorded call so what these models are really only as good as the data that you feed them and it does mean that cultures and dialects are not equally represented okay so you do need to use them and there is obviously a great need for a sarcasm font okay so some of the more advanced models can be induced to pick up on sarcasm but they're not perfect so there is of course a risk that especially with sentiment analysis that you are misclassifying something um as excellent or happy um when in reality it's actually sarcasm okay so just as the difficulty as we get understanding tone and inflection when we're reading text data that computer's going to have exactly those same issues um so is there any validation done around whether humans agree the same sentiments as positive negative and neutral so the way that those models are built is on supervised learning so what we do is we split it into what's called a test and train set so typically you'll have say 10 000 um human classified tweets so usually some poor postgraduate student and has had to sit there and go through manually and say this is happy this is neutral this is sad happy neutral sad happy neutral sad okay and we then get some new we get those 10 000 and we train the model on some of it and then we test it on the rest so we don't train it on everything otherwise it's going to be a perfect fit and so what we're looking for there is to see is there enough consistent variation in the data to be able to say this is going to happen again the google car was um done on supervised learning they drove it around the desert for hours and hours actually days weeks months um for the uh google car so yeah there's there is validation and then what we do of course is then backwards validation and we get a statistical certainty for how confident we are that we got most of these right so when tomasso was talking about that threshold so tomasso challenges model to be more precise so you can actually get this you can say i have this room for error and you can even say i have room for types of error so when it comes to fraud for instance again that supervised learning what we often find is that our you know clients customers that use as a people for these algorithms with fraud they would rather let fraud go then accidentally accuse their customers of fraud so you have your sort of false positives false negatives true positives true negatives and in some cases you can change that tolerance for actually i would rather have a model that errs towards um the false negatives and the false positives but we can talk about that all day so now i'm going to tell you a little bit about what lucinda and i have got fans so lucinda can i invite you back absolutely i was just having a bit of a leg there um in my uh internet so um i started doing a completely separate project i hadn't even met sam when i was doing this project um and i was looking at um legal apprentices feelings around and experiences of academic failure or what they saw as academic failure um and it was a qualitative study so i ended up with um nine long interviews about how they feel um about this and as they'd given their permission um i met sam and i said oh i wonder if we run this through your program if it would come pull out the same themes and come to the same conclusions that i did so currently what sam and i are doing is um looking at whether the computer method um comes um i'll bob my email in the chat in a minute uh in go um whether the computer method comes to the same conclusions that i with a human brain did and then we're going to get some other human brains involved as well both from the school of law where i am and school of tech where sam is uh and see what sort of consensus we get um from people so the idea is to see how close it is and whether or not we could then look at creating um a uh a model or method of computer assisted qualitative analysis because as anybody who's been involved in qualitative analysis knows it takes forever in a day um to go through everything and do the themes so that's what we are currently working on sam i don't know if you want to add anything to that so the really cool part about this study um is that you usually topic modeling and things like that they are done on a big big scale so we all know when you increase your sample size your chances of getting something significant also increase just because you get more versions of what is common so what we're really looking for is to see whether or not this can be used on a smaller scale study um where you wouldn't necessarily have the investment to you know buy a nice tool or a premium subscription you don't want to be using public cloud platforms um but also you want to be able to run it quickly and so these kinds of things are really useful in um you know survey responses and and stuff like that so any of your open responses so what we're doing really it's just checking to see like I think it was um it was Helen wasn't it who said has it been tested against um human outputs it's to see whether or not it works in this case so we're pretty confident they work well on a large scale but can it cope with the small scale interviews and not get flooded by the interview themes as well so we just check and see how sensitive is it um but yeah so that was my turn to thank you all for being a wonderful audience um I've really enjoyed your questions and has anybody got any more so if you want to use it so you I could certainly send you a version of the python program and if I can show you how to use it as well and so you'll just need access to python I use mine through google co-labs um now if you do want to use the twitter api I will warn you the twitter api changes like every other week and you'll need to get yourself set up as a developer on the twitter api but certainly for the you know we can share our but I'll speak for myself I'm happy to share my topic modeling um program um or I'm happy to collaborate as well if you need a programmer to help you with it yeah I've got I'm always looking happy to be a in the background collaborator but I've also got lots of other colleagues as well in the school attack who know more about programming than I do as well do we have any other questions for salmon de masso I think I think that just leaves me to again say thank you ever so much um to both our presenters for coming along and talking about this um I pop my email in the chat so anybody who wants to get in touch um with us either for ALT south reasons and I will add that we are always looking for people to um come along and join in um on the committee as well um at ALT south um or anybody who wants to get in touch with Sam uh and or tomato I can put you in touch with them as well um and thank you very much all for attending thank you so much for having us I've really enjoyed that thank you very much thank you all right I will call the halt thank you