 Welcome to the last spring webinar, we have with us today our colleague from Queen Mary University of London in UK, Alameda Popola, linguist and a person that has very interesting approach when it comes to academic integrity because you use your linguistic knowledge and techniques to make this forensic research for authorship identification, ghost writing and plagiarism detection and I'm eager to hear more about it. So the floor is yours. Thank you Sonya, let me share my screen here. So yeah, so this session will be sharing some techniques from and some some software from using for which enables the use of forensic linguistic techniques by by anybody hence the DIY angle in common academic integrity contexts related to trying to identify or verify authors, ghost writing and plagiarism. As Sonya alluded to, I have I'm an applied linguist and forensic and investigative linguist, but I'm also an education developer working in the areas of assessment and social justice pedagogy and and so the interesting approach that Sonya kindly describe describe my work as it comes from actually me wrestling with these different identities, one trying to detect the fraud in academia and the other trying to promote the culture of academic integrity and be more supportive and preventive. So out of this, a suite of approaches have developed and I want to share some of those with you because I think part of the of the culture of it is to democratize these techniques. It's fine to be a forensic linguist and to go after people and to to uphold uphold standards. But I'm also interested in in education generally. And I think that the more these these tools and techniques are disseminated, the better for education in general. So this is part of a of a general promotion of academic integrity, as well as the culture of academia. So I'm starting with just a basic introduction, very basic introduction to forensic linguistics. And hopefully you'll see how it connects with with academic integrity. So forensic linguistics is just applied linguistics, but in in broadly legal settings. And the community tries to work in these in three areas, the development of of language and linguistic analysis as evidence in support of delivery of justice and language used in legal context, which is sometimes called legal linguistics. And the this relates as as as soon as you you think about academic misconduct and regulations, you know, that side of academic integrity as kind of legal legal legalese, then you can see that there is a potential for these techniques to be used both to uphold the standards through supplying evidence of misconduct, but delivery of justice might translate to the delivery of integrity. So there's a lot of potential for the use of forensic linguistics in in academic integrity. And I'd like to to democratize that rather than it be the preserve of proprietary tools. So that's why I'm going to share this approach today with some some basic things that forensic linguistic things you can do with with free, easily accessible open access tools. So the areas of academic misconduct that we're going to look at today are related plagiarism and gross writing. I'm distinguishing them, although some you don't have to. So with plagiarism, it has been an area that forensic linguists have have tackled for some time, and computational forensic linguists in particular. And in that community, I think more than in the academic integrity community itself, the distinction, certain distinctions are made and tackled. So intrinsic and extrinsic plagiarism, the ubiquitous similarity detector programs, they, they look at and identify extrinsic plagiarism. So similarity with other documents intrinsic plagiarism is detection is the attempt to identify plagiarism just by looking at a single text. So you're looking for variation within a text, a different style within a text within that would suggest that this is another author. So it's a kind of authorship analysis. That's a common area that's been tackled in by forensic linguists working in plagiarism detection. The other question challenge has been to distinguish between intentional and unintentional plagiarism. You know, that is a legal question approaches depend on intention in plagiarism. Can linguistics point to that? And there's been some work done on that. And I'm going to allude to some of that. They show some techniques for starting to think about intention versus lack of intention. And then we're ghostwriting. Again, this is a kind of authorship issue in that in terms of detection, in that the the writer supposed to be a student, but it might be a human ghostwriter, or nowadays it might be a machine generative AI. So these are the two cases that we're going to play around with today. Once I have shared some some principles and some principles of linguistic analysis in this area. So let's keep an eye on time here. So what kind of language might be used as evidence of misconduct? So language as evidence, you know, it has to be not, you know, we're talking about the investigative, the evidence giving stage here. So something that we can use to say on the balance of probabilities that the misconduct happened. So language that can identify who the author is helped to verify authorship that that can be used. Language that indicates capability in some way, whether it's language proficiency or knowledge of a subject. That can be used in a in a misconduct setting. As you can say, well, you didn't write this, it was very, it's on the balance of probabilities, it's likely you didn't write this on the balance of probabilities based on your language level. This is unlikely to be your work. Slightly more gray, but possible on the balance of probabilities. This is above your level of knowledge, subject knowledge. So it's unlikely that you wrote it. That's a little bit more risky, but still it depends on on the levels that we are looking at undergrad, you know, to PhD level, you can see. And any language that can indicate mental state, men's rare is the legal term. So intention can be inferred from deliberate attempts to revise, obscure and modify. So if we can find language that shows deliberate revision, modification, organization, we can use that. The kinds of things that may not be evidence, some, for example, evidence of emotion, not much that you would be able to do with that. Or evidence of this, the certain genre, maybe that's not appropriate. So let's focus, try to set the bar high. With a type of language that we can use in a misconduct setting. So how does it work? How do we use what kind of analysis do we can we do? And how does the analysis work? So a big part of it is identifying linguistic features, linguistic characteristics of a text. So here is here is a text. Have a thing, give everybody a minute or two, what kind of characteristics or features of this text might you point to profile the author? For example, any thoughts in the chat? You want to create a profile of a fingerprint, a right print of the author of this text, what kinds of things might you pick up on to do that? Anything that stands out about this, you think, okay, so it needs to be something that's nonstandard. What do we have here? Capitalized words, grammar mistakes, common phrases. Okay, so you're going for grammar mistakes. So that could be a feature that mistakes. Use of common phrases. So if you mean, for example, collocations, that would make sense, is there are there any collocations here? Yes, there are principal causes, for example, as a collocation. So if you mean that, that's something that is often done in computational linguistics, I'm finding two word, three word, four word common phrases. And yes, like the informality. Yeah, okay, so these are features. So you might take capitalization as a feature, mistakes, and common phrases or collocations, and informality. So then you've got a set of features. And then in an authorship situation, let's write these down, because we're going to come back, we're going to use this is going to be our first case. So we had a, what were they capitalization, and mistakes, and common collocations, and informality. So these are features. And then if you're doing how it works in terms of linguistic analysis, if you're doing, you got an authorship case, and you've got a bunch of texts, and you want to know the person who writes this also writes something else, you would look for those features in other texts. So you look for odd capitalization, however you define that mistakes, collocations, informality, the picked features that I picked out, based on the previous slide of the purposes, identifying authorship, identifying intention or level capability. So I also picked up the informality in the context of academic writing special, or special shapes, special, special shapes is a bit informal, and like is informal, as been mentioned in the chat. So I will pick pick that out. In terms of the showing language level or language knowledge, these are good academic collocations to word by grams, the computation, they go to word bundles, so I might pick those out, and look for more of those elsewhere. And again, another vocabulary one, these look like domain specific terms, or domain specific collocation, so I might pick out domain specific vocabulary as well. So linguistic features, identifying features is an essential first step in any linguistic analysis in this area. So I've got a few features here to show you different types of features from a chapter that I wrote that was published earlier this year, lexical sophistication. So that's a feature. What does that mean and how do we capture it? There are many word lists and corpora that indicate, that give metric to words based on their frequency in language. So less familiar, less frequent words can be considered or use of less familiar, less frequent words can be considered and is considered in analysis, often as a measure of sophistication. So that's that's one feature. And there are some examples. I'm going to run through a few of these features, vague vagueness, vague, vague reference. This is where it's not clear where you have pronouns or determiners and it's not clear who they belong to. This is the feature. So just showing how features can be quite sophisticated. If you have the, well, they can be quite sophisticated. How we captured them is another question and we'll talk about that. But features can be quite sophisticated. Here's another one at verbials, I've called it, some people call it transitions. This is a great feature. Those of you who've been working looking at generative AI text, this is a great feature for capturing that kind of authorship, at least with with the GPT family. They use a lot of these summarizing nouns, often a feature of academic writing. You have a whole stretch of text, and then you say this approach, this piece, this research and so on. So these, again, this is a feature that can be captured. We run through a few more. There are some more stylometric features, which by that, I mean, not actual words, but just ratios. So content words divided by function words, ratio of content words and function words gives you an idea of how dense was sparse a text is. That's a feature that's often used in formality. We talked a bit about that. That's a feature that's often used. It can be captured quite well by contractions and colloquialisms. You can capture these. There are lists of these. So these are the different kinds of features that can be used that I used in this paper to identify ghost writing and can be used in other scenarios, but features are things that you can create for yourself, hence the term or engineer for yourself, hence the term feature engineering. So this is something that's important to do in as a precursor to any forensic linguistic style analysis that you want to do. So you're going to identify some features. Now, you could, once you've identified these features, just count them manually, but we don't have time for that. None of us has time for that anymore. People used to do that. So this is where the tools become useful. So what tools are around? There are quite a few proprietary tools, which are either or both expensive and black box, i.e. not transparent in how they work or difficult to understand. And to be honest, the proprietary tools tend to use quite a limited range of linguistic features. So turn it in, obviously, use your string matching across varying distances. They have an authorship tool, which tried to find the features. I can't find them without having the tool itself. So that's already one layer of transparency. And even if you have the tool, they're quite opaque for people who are not applied linguists, or even forensic linguists. So, but there still is a limited range of features. And there's quite a few AI detectors around now. And they are using a limited range of features based on the fact that these large language models work, like the description I heard I liked, auto predict on steroids. So they work on in terms of prediction. So there are statistical measures that measure how predictable a text is and how much statistical variation there is. These are useful, but they're quite limited. And they are not the kind of thing that an educator, and this is where my dual, my other hat comes in, it's not the kind of thing that an educator can use to whilst preserving the quality of academic output. The standards can also educate, give formative feedback on. You can't really give formative feedback on the complexity and other black box figures. So black box metrics. So it would be good if there are some open access, clear, transparent, easy to use and understand tools that you could also explain to the, to the students, the learners that you are working with. This is why academic integrity is not being upheld in the situation. And this is how you can improve. So if you can use the same tools to identify as to educate, wouldn't that be great? So this is the ideal. So I'm going to highlight here today three, three, three tools that fit this bill, open access. And that can be used both to uphold standards of academic integrity and to educate. And then we're going to have a look at a couple of cases. So the first one, really very, very, very, very simple and accessible. Microsoft editor. It comes with Word 365, but you can download it by itself as a Chrome extension. And it's a little bit like Grammarly. It does a few useful functions as a similarity checker. I haven't tried it, but what it does have and is very useful. It has a readability metric. It has one plus Kincade. And it has that built in. So you can, you can use that. And I should say here that it's better to use, and I'll try to use where possible desktop tools, like tools that you download and don't fill in, put in information online, because you don't know where it goes. Unless, but then, you know, you get tools that are online, but you can trace the provenance you know, you know, who made them. So that can be a different then that can be different. I've got one of those. But yes, so good thing about Microsoft editor is that it's in Word. So you're not inputting data into the, into the ether. And it has these metrics. So readability metrics, they can be very useful because they capture in a single figure information related to word length and sentence length. And these are basic measures of complexity. There's another tool here that has that computes nine readability metrics. There are differences between them. You know, what what is a what is a long word. So fleskin, K and count syllables, others count characters that are, you know, differences. But for general purposes, it shouldn't be, we shouldn't worry about that. There's a question there, but I'll come back, I'll come to that later on. The second tool, the second tool I want to, to mention is Antword Profiler, all the links gone. So this is a suite of tools made by linguist Lawrence Antony, who's also an educator. He's made us a suite of tools, desktop tools, all beginning with Ant. His name is Antony. And this one conducts analysis of of the vocabulary used in a text. And it comes with some preset word lists. He uses word lists. And basically it matches, it counts and highlights the words in a text that you input. And again, you can download it as a desktop app. It counts the words in a text that you input and matches them to word lists. Why is that useful? So it comes with a preset word list. It comes with the academic word lists. I'm sure many of you are familiar with this list of 500 and something, 570 commonly used academic words. So it's often used in language learning, English for academic purposes, to assess the level, well, the vocabulary level of somebody. And the same with a general service list. These are a list of 1000, 1000 most common words and then the 1000 to 2000 in the list, most common words. So these word lists, this word list, the, the number of words from in the, where the reach of a text indicates how familiar the vocabulary is or how less familiar it is, depending on the frequency. So you can get an idea of sophistication of vocabulary from this. And what's great about this tool is that you can import your own word lists. There are many word lists around many subject specific word lists, and you can create your own word lists. And I think this is a useful, very useful direction for, for educators to, to create word lists, maybe based on target knowledge, learning outcomes. And then in cases of authorship, ghost writing, you can track the, the use of this vocabulary by, by your student and be able to get an idea of, of a different, whether a text has been written by somebody else by the vocabulary reach of this disputed text. So vocabulary analysis is good for understanding language level, which is evidence that can be used and domain knowledge, subject knowledge is also evidence that can be used and word profile helps with that. The third tool was, I understand supervised by a member of the audience here. Yes, thank you, Deborah. And Deborah has also shared Lawrence Antony's software suite. Yes. Lots of tools there. So the similarity texture is, is great for detecting patch writing, because it's very quickly conducts a side by side text comparison. And unacknowledged patch writing, patch writing, you know, is a good, often a good example of intentional plagiarism, you know, if somebody deliberately putting in synonyms, which you can see, I'll show you that which you can see, even though the similarity texture, it detects matching, matching text. What it ends up showing is the use of synonyms, because where you get a lot of, because the two go together in patch writing, you get some match text and some synonyms. So even though the similarity texture doesn't detect the synonyms itself, by very neatly showing the, the matches, it makes it easy to see a synonym and I will show you that. So those are the three tools that there are others, but we'll start with these. So let's try case number one, ghost writer. Is this a ghost writer? So this is something that I think I've got a, a padlet here. I've got a link so that you can look at these somewhere else. Let me just find that link and share it with everybody. Although it is quite clear on the screen, what have we done here? Oops. I need to go there and I need to go there and put this link in the chat so you can have a look yourselves. There you go. I hope it's accessible. Can you access that? Or is it blocked? Yes? Okay, good. So all the text that I'm talking about are there. So you see here, this is a ghost writing, potential ghost writing case. So we have text A and the first bit. And this is a typical assessment situation. Text A was a brief proposal for research. And text B is the actual research. So we've got text A, short proposal. This is what I'm going to do. And then a bit later in the academic year, we have the research. So we have these two bits. And the question here, let's look at the text on, on the left, text A, we've already identified some grammar mistakes. What is it that we identified? I wrote it down. Capitalization, odd capitalization, grammar mistakes, with collocations and informalities. So we identified those. Can we see this looking at the abstract, similar length on the right from the report, purportedly from the same author? Is there any informality? Are there any mistakes? Is there any odd capitalization? How are the collocations? What do you think? I mean, this is not the, I will show you how the tools can help with this, but I'm just asking you, what do you think? Do these look like they were written by the same author? No, different authors says Christy. Why do you say, why do you think so? Why off the top of your head, different headings, informalities in text A versus B, informal style. Okay. So informality, can you go to your academic misconduct office or they come to, they come to you and they say, can you give us some evidence? Why do you think that A and B were not written by the same person? Oh, text A is informal, looks more informal. It's not really evidence. And this is where the China apply forensic linguistic standards comes in. So what can we do to provide some form of strong argument to give evidence to our hunches or not? So looking at the readability, yes. So these things you can say that they try to cite in text B, but not in A, you can do that and look at citation style. That's another feature, but it needs to be systematic for evidence. So we're trying to make this systematic. So one thing you could do is use readability. So here is the graphic from the Microsoft editor readability. I've highlighted there the odd number of paragraphs that were not 35 paragraphs. So you have to be careful with how your text is formatted, the things are read in a funny way. So you do have to be careful with these, but that doesn't affect the readability, which is done at the word and sentence level. And we can see that I divided the text into three. So we've got because it was longer than the proposal. So we have B1, B2, B3, and we can see that the reading level of the second text is much higher. In terms of grade level and reading ease, however you calculate it is much higher. Number of passive sentences, much higher. So you could go, this is something a little a bit more concrete that you can use and say. Look at the readability. So but in a longer text as readability vary in a text. So we've put together the no I didn't so I've compared. So I looked at it three times in each of the text and you can see that the the reading ease for the original text is 50, so which is quite pretty easy to read. And then for the other text is 31, 21, 30, average of 27, something like that. So much more difficult to read. And the same with the grade level, we have 13, 16, and 14.5 compared to seven points. So this is something that you could do. You could go in and say, look, readability is different. Another thing you could do with this text is look at the vocabulary range. So here we're comparing the proposal and the first part of the other text of the of the actual report. How many level one how many so general service list first 1000 sort of the most familiar words. They're more familiar words in text A than in text B. Is that significant borderline? Um, but still there is a steer there. And if we look at words not in the list at the bottom, which what you don't get in these lists are domain specific vocabulary, there is much more most twice as much domain specific vocabulary in the report than in the original proposal. So you can say here, the vocabulary level is much higher in the in the report than in the proposal. And you can say that the the readability, the reading ease and the complexity or let's call it complexity is much higher in the report than in the proposal. And you've got metrics for that. We just have a look in the chat here. Yes, thank you Ian. This is Antwerp Profiler Output. Okay, so yes, so you have to think about this in the educated context. First attempt and B is final, but I don't know. What do you think? Do you think it's just a difference between first attempt and final or something else going on here? Here's the vocabulary range stats. So most familiar words, GSL first 1000 that is that is significant, not in lists as a proxy for domain level knowledge, 10.8 to 22.3 that is significant. So what would you say, Ian or anybody if presented with this this information? Is it just a difference between first attempt and final or is there something else going on here? Any thoughts? Anybody in the chat? No. Okay, maybe they will come later. But anyway, either way, yes. So question is, is it development or is it another author? You know, that can be discussed, but at least you have there's a basis for the discussion. And these are easily accessible tools. So that's one. I'm going to show you another one, and then we'll have a little bit of time at the end for question discussion. So what about this? Is this intentional plagiarism? Can we see this clearly? Whoops. Have a read of this either on the screen or the text isn't on the padlet. Anything strikes you about this text? Give a couple of minutes to read the first few sentences on the left and then the first few sentences on the right. Passion for teaching. What about that, Deborah? Anything strike you? So, okay, I think I've done a phrase that looks copied. Okay. How much copying is going on here? The second is a close paraphrase of the first, right? Now, how can we show this an interesting area of forensic linguistics and one that needs development if anybody is interested in this is the visualization of evidence. What needs to be done on that? And what I like about this similarity text at all is that it can visualize what we can see here. It's a close paraphrase of the first. How do I make this into evidence? It will take a long time, a lot of effort. But with this tool, it makes it very clear and easy to see. So you can see all the actual exact matches. And it's known. I mean, you will know this. You'll probably know this anyway, but it was found in early days of forensic linguistics that six, basically, once you get to around six word strings, they will be unique. You can Google six words and they are very likely to be unique. And what's great about this is that when you highlight the exact matches, you can then see the the the the source writing much more clearly. So this is, I reckon, is an unintended benefit of similarity text, but it's very important. So here I combine commitment, I come to research, I couple my research ambition, my comprehensive background on the left, my vast experience on the right. Second paragraph, my teaching style is highly interactive. I start to maintain a high interest, you know, you can see it. So you can see it. You can see it in them as well. So this is really quite powerful. So I think that's an open and shut case there. So those are two examples. And I just want to now share a link to all the tools and the resources shown here. Some have been shared in the chat already, but I will just put that up here and then stop sharing my screen and we can have a chat for the last few minutes. So I have on my blog, I have put links to all the tools and the analysis, you've got the padded link and the paper there. So I will do that and put a little plug for my blog in there. So I'll stop sharing any thoughts, questions, comments. I can't hear you. Sorry, there is a question in the Q&A. Oh, yes. Perhaps I can read the question about the answer. Recently, I have encountered more cases of academic fraud in thesis where detection is characterized by almost 100% matching the eternity in the list of literature used. The use of a large number of peer-reviewed articles of an older with paid access and the number of references for individual short statements. And if I check the content of the peer-reviewed articles, the similarity to the analyzed text is often only in the title of the article or in the literature review in the article. But the research itself does not contain the statements the author pretends to have paraphrased from there. I suspect the use of an AI tool or the author searched for the references and added sources to the text after the fact that randomly based on keywords in the title. How could such text have been created? Is there a tool that could help with the detection? Otherwise, it's down to tedious manual work by the lecturers. So I've seen a couple of these and I did a bit of digging around. There is a dark web of research, academic research. So things that are not, that you can't see, that turn it in can't see and there's text there. So any text that comes from this dark web place will look like it's plagiarism free, which is what generative AI text also looks like. So one of those weird things that the modern digital world throws up, but that it could be generative AI and it could be dark web text. That's a quick answer that I have, but I have noticed that as well. I know that I have a question for waiting for other people to ask. I know that people were really hoping that a tool that will basically be able to compare the style of students' texts over time will be possible to develop. What do you think about it, all these telemetry type of tools? Absolutely. Not automatically, it's a difficult task to automate, but it can be done and it can easily be done. Authorship identification is at a level where it's accepted in court. It can be done to 95% accuracy, but at the moment it's quite difficult to do it. You need training to do it and it's not automated, but it can be done. So it's just a question of universities investing, if they want to, investing in that. It can be done and systematically student comes in and writes something in a classroom about something that they like as a starting point and then you have that and collecting authentic texts produced in settings where you know that it's going to be authentic to build up the database. You have to put it into your assessment practice. You have to then invest in the creation of the profile, but in terms of statistically methodologically, absolutely it can be done and it is being done. Yeah, it was the question about the reference tools that can be added to a text. It has posted an AI tool that might be used for that kind of. I haven't seen it before. It's something that can put references in your text. Oh really? I think we will have more and more problems with AI tools. I've seen tools that can create literature reviews, etc. So they are developing more and more tools that are aiming to get students and researchers. So you feel it's risky because one mistake by the tool, just like with the AI, one mistake by the tool and BDI would investigate it, we'll find it and then it's done and there's a giveaway. I think it's risky. Of course. GPT-4 is much better at finding references, much more accurate than GPT, much more, but it still makes mistakes and all you need is one mistake. I think it will be great for those of us who are trying to catch the misconduct for these tools to come in because they will over claim and people will use them and we will pass them. We have one more comment and I couldn't agree more and writes, it's a pity that so much of this issue arises because we are in a position where we have large classes and no longer know our students and able to follow that progress individually as they build other skills and develop their own writing voice. Absolutely, which is why we have to change our assessment practice really, but it's a holistic approach. If we want to change our assessment practice, we need to police assessment in a way that supports positive assessment practice, therefore not police it, rather than invidulating and detecting all of those things. In order to restore the culture and the right learning environment. Yes, I think this is going to be a wake up call for the whole higher education, basically. In many ways AI is good. I can see in Sweden that we have some much more interesting academic integrity conferences in academic integrity, etc., which we didn't really have in the same way before chat GPT. So perhaps this is a good wake up call and perhaps we will have to rethink how large classes we are supposed to have and how we do create assessments, etc. Oulu, thank you so much for today. Thank you. And I would like to invite everyone on our next INAI webinar. It's going to be in September 8th and we will have Elisabeth Bick with us. Thank you so much. It was really interesting. Being a linguist myself, I think your approach is really exciting to see how you can connect these two really interesting areas, linguistics and academic integrity. So thank you so much. Thank you for being here with us. Thank you. Bye bye. Thank you for coming. The presentation and the video of this webinar will be posted on our website. You can also find all our previous recordings on the INAI website. Thank you so much.