 Hi, everyone. My name is Michael Brennan. I'm here from Drexel University in Philadelphia, Pennsylvania. The title of this talk is Deceiving Authorship Detection Tools to Write Anonymously and Current Trends in Adversarial Stylometry. Before I start, I just want to acknowledge the folks that work in our lab. We have a large lab back at Drexel. We do a lot of work in this area. A number of them are here with me, including our advisor, Dr. Rachel Greenstat, Sadie Afros, who will be talking a little bit later, and Eileen Kaleskan, who's doing some work in this area as well. Unfortunately, the two lead developers for the two tools we'll be introducing a bit later weren't here, but they are available through email, and you can contact them. I'll put their emails up later when I'm demoing both of those projects. Before I get into the talk, can I ask how many of you were here at my talk two years ago at 2063? Okay, so some of you, but not not most of you, which is good. So I'm going to have to go over some background on what Stylometry and Authorship Recognition is, and it might not be as exciting to you folks in the beginning, because you've heard some of this stuff before, but it's required to make the case for the rest of the talk. But I do want to highlight the fact that there is a bunch of new work even in this review, that even though you were there last time, I hope you got something new out of it, including a new dataset, a new method that we're using, and more robust results in general. For everyone else, and all of you included, I'll give a quick overview of what we're going to be talking about. First, I'm going to introduce this concept of Authorship Recognition and Adversarial Stylometry, and then we're going to talk about the threat to anonymity that Authorship Recognition can present. We'll go over the experiments and the research that we've done, analyzing deception concerning adversarial Stylometry, or concerning Stylometry in general. I'm going to introduce two tools that we've created that both help you do Stylometry research and help you anonymize your writing style to deceive Stylometry methods, and then Sadiathros will come up and talk a little bit about detecting deception in Stylometry. So the first basic question is, what is Authorship Recognition? Well, you want to know in Authorship Recognition who wrote some document of unknown authorship. Now, in this, there's a subset of Authorship Recognition called Stylometry. Stylometry deals with determining authorship of a document, but purely through linguistic means. So we're not talking about handwriting, we're not talking about where the document was found or the historical context, we're just talking about linguistic features like the syntactic structure of the document, the words that were used, the sentence lengths and paragraph lengths, the grammar, things that would generally be considered to be sort of like context, less context dependent. Now, the reason why this works is because individuals have unique writing styles, and they have unique writing styles because we all learn language on an individual basis. So we develop these nuances in our own style that are unique to us. And in this presentation, I sometimes use Stylometry and Authorship Recognition interchangeably, because a lot of this stuff applies Authorship Recognition in general. But keep in mind that Stylometry is a specific subset of Authorship Recognition that deals with this linguistic aspect of the problem. So what is adversarial Stylometry? Well, this is where we look at applying deception to writing style in order to circumvent methods of authorship detection. And we have to ask these questions, was it possible to write or modify your writing style? Is it possible to deceive Stylometry by doing so? And we'll see that the question, the answer to both these questions is yes. And what are the implications of looking at Stylometry in an adversarial context? So how can Stylometry be a threat? Well, there's two basic problems in Stylometry, the supervised one and the unsupervised one. I'll explain this briefly and give you a short hypothetical scenario for both of those. For one, in supervised Stylometry is when you have a set of documents of known authorship and you have an unknown document, but you believe it to be one of the authors in that set of known documents. It's a supervised classification problem. And a hypothetical scenario here might be Alice, the anonymous blogger, versus Bob, the abusive employer. Alice is one of Bob's employees. Bob is an abusive employer. She wants to publicize these facts about the company publicly by posting a long blog post. Bob could potentially use Stylometry to identify Alice because he has a set of known suspects in that context. He has his employees, he has writing samples from his employees, and he could, to a high degree of precision, probably identify Alice in that scenario. The second question is the unsupervised question, where when you're given a set of documents of unknown authorship, you want to cluster them into groups. So you don't know how many authors there are. You don't know how much writing there is per author. You just want to try and figure out, you know, what the layout is of the data that exists. And a hypothetical scenario here might be some sort of anonymous forum versus maybe an oppressive government, where you don't know how many participants are on the forum. You don't know how many messages are attributed to each participant. But if an oppressive government could use a solid form of unsupervised Stylometry, they could segment this into author groups and then maybe apply those author profiles to a supervised Stylometry segment, maybe compared against members of parliament or something like that, to see if any members of the government are participating. Now, that's a bit of a scarier hypothetical. Unsupervised Stylometry, though, there's a lot more research that needs to be done to make it more effective. Supervised Stylometry, however, is very effective and that's going to be the problem that we're concerned with in this talk. So are these scenarios purely hypothetical? Well, interestingly enough, a couple of members of WikiLeaks wrap my talk here two years ago. And one of them wrote a book about his experience with WikiLeaks and in it he mentions being in my talk. And they were talking about how at the time essentially the organization was just the two of them. And they had all these fake personas for their PR person and their lawyers and their volunteers. And they had a laugh because they thought, wow, if someone actually applied this research to our work, you might be able to determine that it's actually just two of us and not hundreds of us. And I thought that was a pretty interesting scenario because it shows a real-world application for this work. It's also been stated publicly by organizations like the FBI that your writing style and or people's writing style are being looked at actively as a means of identification. So this is something to consider and I think it's something that's not in widespread use right now, but has the potential to be. So we should consider this as a concern from the perspective of anonymity and privacy. So let's review the research problem and how we analyze this and then I'll get into the tools that were that we developed. So first we're going to understand the threat model in adversarial stylometry. Then we want to build a data set because we have to create our own data set because an adversarial, which I'll explain in a minute, adversarial stylometry data set doesn't really exist in order to evaluate this data. And then we want to evaluate current methods of authorship recognition and stylometry against adversarial text where people are trying to hide their identity and analyze these results in order to develop tools. So what's the threat model? Well, the threat, like I said, is kind of obvious. Authorship recognition can identify you if there are sufficient writing samples and a set of suspects. If you have about 6,500 words or more per author, if you have 500 words of this unknown document and you have 50 or less suspects, you can with a very high degree of accuracy identify the author of the unknown document. And this is not a strict threat model. There's plenty of research that looks at shorter messages. There's plenty of research that looks at authors on the order of 100 different authors up to thousands of different authors and still show very positive or very high accuracy in identifying authors. But for the purposes of this research, we're taking this kind of middle ground between the old stylometry problems where they just kind of look at two or three different authors and these new ideas of looking at hundreds or thousands. And we'll expand into those later. So this threat model is based on an old assumption, though, that writing style is invariant. That it's like your fingerprint and you can't modify it. You can't help it. Well, we think that maybe that's not true. So we came up with a few different circumvention methods. One is the obfuscation method where an author attempts to write a document in a way that just hides their own writing style. The second is an imitation method where an author attempts to write a document such to specifically imitate another author. In our examples here, we actually chose the author, Karmic McCarthy, for people to try and imitate. And then the third circumvention method is a translation method where you take it, the idea is you take your writing sample, you translate it to another language using machine translation and then you translate it back to yours. I'm not going to be talking about this and I have an asterisk there because we actually found this method not to be very useful in anonymizing your writing. But we were doing some further research in different languages and in possible applications of machine translations and Eileen, one of our colleagues is here, so feel free to ask her any questions at the end of this talk or afterwards if you have questions about that specifically. So we have to build a corpus and when I say corpus, by the way, I mean a data set of documents. That's what corpus means essentially. It's a data set of just text. And the reason we have to build this is because yeah, there's plenty of writing samples all over the web but we need writing samples of people to know their writing style but then we also need attempts by them to actually hide their writing style and that's harder to come by. So we ask people to submit 6500 words of existing writing that's not modified. We ask them to write a 500-word obfuscation passage where they try and just hide their writing style by any means necessary. We ask them to write a 500-word imitation passage where they try to imitate the work of Cormac McCarthy and Cormac McCarthy, by the way, has written a couple books like The Road and No Country for Old Men. So we picked an author just so people could kind of sink their teeth into and kind of get an idea, a feel for. And then, you know, the authors here had no formal training or knowledge in linguistics or installometry. Now we had our original corpus which we presented two years ago which looked at 12 different authors and these participants that we got, we got through the university, they were through friends, you know, they had a motive there to participate properly and follow the directions because they cared about us, we hope. And we had one-on-one interaction with these participants. This corpus which we used for our previous research is publicly available on our website and this website will be up multiple times throughout the talk. You can download it now if you want to do some of your own research on it. But it was good for preliminary results but we need something better because this is too small and it was too homogenous. So we went to Amazon Mechanical Turk to build a bigger dataset. Now for those of you who don't know, Amazon Mechanical Turk, as my advisor put it earlier, is like artificial artificial intelligence. You source out a problem to a whole lot of humans for some amount of cost. So say you have a million photos that are of either dogs or cats and you want to classify them and separate them. You could use some computer vision algorithms to do that or you can, you know, farm out that problem to humans for maybe, you know, a quarter of a cent per photograph and get through them in a much faster way and maybe even cheaper depending on, you know, how you're doing it. So we did this here. So the same task of previous course, but as you might imagine, when you gather that financial incentive, you also have the problem where people are trying to gain the system and they just want to get paid and they're not actually honest participants in your study. So we had to come up with some rigid guidelines for what submissions we would accept and out of 101 submissions we got, we actually ended up accepting 45 of them. But we established these guidelines before we put that task out there because we wanted to make sure we didn't spoil our dataset with our own bias. We determined what we thought would be the things we need to do this research and we applied that strict set of guidelines to all the work that we got. So some of the guidelines are up there about, you know, we ask it to be formal in nature. We want people to not submit a lot of dialogue and quotations. We want people to refrain from submitting small samples in addition to the other requirements of having 65 hundred words or more and having the sources be for multiple documents. So this is released today also and this is what this work is based on. It's available on our website. This corpus is large, it's diverse and it's unique. There are no other datasets like this in Stylometry. So we hope that if some of you are interested in working with this stuff and doing some research that you'll take this and run with it and, you know, share your results with us because we would love to hear more about it as well. And then we, this is just a quick slide, we evaluated this new dataset against the old dataset and found really similar results. Some methods it a little bit worse, some methods it a little bit better, but we feel that this is a solid representation of writing samples and that the conclusions we find follow the original preliminary conclusions on the original dataset. So I just want to go over the couple of methods and go over results here and then again I'll get to those tools. So we looked at three different methods of classification. The first method that we looked at was a basic set of nine features. This is just nine data points that we'd extract from text including how many unique words there are, what the lexical density of the text is, a couple of readability indices. And we passed these nine data points into a neural network classifier and classify the results. The results here when you get to high numbers of authors is not that strong but it's still significantly above random chance. And the point of this method is to demonstrate that even a simple, very basic measurement of text can give you some pretty strong clues as to the identity of an author. The second approach is a synonym based approach where the only thing that is looked at are the word choices involved in the text. So say you choose to use the word verdant instead of the word green. Well that might be highly indicative of authorship depending on how often the word verdant is used in the dataset you're comparing it against. And this is interesting because it's really just a one feature point analysis of text but it does a really good job of identifying authors. The final method that we looked at is write prints which we consider to be kind of the gold standard of stylometry at this point. Now the full on write prints method is very detailed and has a, excuse me, has a method that has like a complex algorithm based on principal component analysis and has sliding windows and looking at the data and throws in a whole bunch of stuff. They also developed these really big feature sets. But what we did was took their baseline feature set which is about, I think about 800 different data points from each piece of text and just applied that feature set to a support vector machine and we get accuracy that is at least fairly comparable to the write prints data. And we have some other research looking at write prints as a whole but it can be kind of cumbersome to run so when we're doing these experiments we, you know, we thought this was a good approximation of the write prints approach. And write prints does very well. In fact in their paper they claim that they can get up to 95% accuracy looking at 100 different authors in some data sets which is pretty astounding. So we look at these three methods through four experiments. First the baseline, we want to evaluate these methods and see how good they are just on your base text that's not modified. We want to look at the obfuscation passage and see how they do when someone tries to hide their writing style. We look at the imitation passage and say, well, how do they do when someone tries to imitate someone else? And then we want to look at the imitation success rate. What percentage of the people trying to imitate McCormack McCarthy actually tricked these methods into thinking that they were McCormack McCarthy. So this is the baseline data set. I'm just going to explain this graph a little bit. I know there's a lot here. The on the x-axis on the bottom is a number of authors. So we have this data set of 45 authors but we look at randomly selecting five authors, 10 authors, 15 all the way up to 40 because you see the accuracy degrade over time. The important thing though is not the degradation of the accuracy but how much it degrades or in some cases how little. So the original nine feature set, it starts off at about 65% accuracy with five authors which again is already pretty good compared to the baseline, compared to random chance of 20%. And though it decreases to about 25% accuracy it's still well, well above random chance. But what's more interesting is that the synonym based method and the right-prints method sustain very high accuracies in determining your identity even when faced with 40 unique authors. And right-prints does the best in this case only reducing from 95% to about 82% when you get to 40 unique authors. So that's interesting. So the point to take away from the slide is these methods are accurate, right? They do a really good job at figuring out who you are. So remember that assumption that like oh your writing style is invariant and you can't actually get around this because it's just ingrained in your subconscious. Well let's look at what happens when we apply the obfuscation passages. They do really poorly. All right so these methods that did really really well when people who are not experts in this at all just try and hide their writing style they can do so with a high degree of accuracy. Now there are differences between them but they mostly get down to about the level of random chance. Even more interesting is when you get to imitation passages accuracy drops to near zero. I'm sorry? Why are they much worse than random chance? So our thought here about why they're worse than random chance and therefore do worse than the obfuscation passages is that when you give someone an imitation passage to yeah yeah when you give someone of someone to to mimic the authorship style of they do an even better job and it confuses the systems a lot into into moving them towards other authors in the data set. And then when you look at the success rate and people actually successfully imitating Cormac McCarthy you see that a lot of the times they're actually pretty successful in doing that. Even with 40 different authors when you look at the synonym based approach almost half of them are successfully tricking the method into thinking that they are Cormac McCarthy. But there's another important and new result here which is that the right prince method while it's susceptible to these passages is not as susceptible to maybe only about half as susceptible as the other ones. So our hypothesis for a while has been that some methods will be more resistant to these obfuscation and imitation passages than others and this is the first really obvious example that we've seen of that. And that's important because this is, stylometry is a developing field and we don't want to give anyone the impression that like if you can circumvent these methods that you are necessarily going to be anonymous you can't really measure in that way. It's possible that other methods down the line will study your authorship and out your hand writing, I'm sorry, your writing style in different ways and still be able to identify you. So that's a recap of the work, some of it which I talked about two years ago but that I hope you see is a lot more robust now for those of you who were here last time. And now I want to talk about two tools that we have developed. One is called J-Stylo which is an authorship recognition analysis tool to help researchers like us do this work. And the other is Anonymous which is an authorship recognition evasion tool. These are both free and open source, they are under the GNU GPL. The alpha releases for these are available today at this website, feel free to go download them now if you want to download them later come talk to us. And right now they're just tar balls with the source code and the jar files in them but we'll be migrating over to GitHub soon so more folks can be actively participating in the development as it goes on rather than just waiting for us to release tar balls every now and then with the source code in it. And I want to stress again that these are alpha releases, these are our first releases. So there are issues with them, there are some bugs and things like that. We welcome all input whether it be a bug report or a feature suggestion or, hey this is how I'd like to use this software and I can't do it that way right now, please submit that to us. So developing J-Stylo, I went out on the problem that we're trying to address by developing it. The main problem is that stylometry based research is difficult. We think that this research is important but existing tools are limited. You have things like Weka which is a suite of machine I'm sorry of a suite of machine learning classification tools but these are not tailored for text analysis, there's no built-in feature extractors, you have to write your own code that's going to extract these features and then feed those feature vectors into Weka. It can be a cumbersome process, it takes computer scientists or a programmer to learn about this stuff and then implement it. And then we have the Java graphical authorship attribution program which has a strong basic tool set for stylometry but is limited in some of its features. Though again this has a strong API and it's meant to be extended and built upon and that's what we did here. So the nuances of stylometry are not easy to grasp and there are many open research questions related to authorship. The most common questions I get asked and it will probably happen here today too is after I'm done talking I get all these questions that are like have you looked at authorship recognition in this domain? Have you looked at it in this domain? What about these documents? What about this book? And we don't have enough, there's so many questions that could be answered that would be interesting and we don't have the resources to do it and we often say yeah if you want to get started just do it and let us know what you find out but that's really difficult because there's a steep learning curve to get involved in this research sometimes so hopefully we can reduce that learning curve and encourage more research and analysis in this area so that other folks can be interested in as well. So J-Stylo is built upon a framework of JGAP and WECA. It has the two existing adversarial corpora that we are releasing here and there's a new corporate building functionality so you can quickly put together and easily put together a bunch of documents and save it and hold on to it so you can analyze those over and over again. There's a wide selection of feature extractors and the ability to add new extractors. I mean right now if you want to add them you have to write a function to do it and then recompile the code but you can do it but you can actually configure new features based on existing extractors right from the GUI. There is a wide selection of machine learning-based classifiers mostly coming from WECA and there is what we think is an intuitive GUI that walks you through the process of this research step by step which hopefully is helpful. And again the Alpha release is available now. So I'm going to give a quick walkthrough of it. I hope you don't mind that I made slides and didn't do a live demo. I'd rather talk and point at things than like type and click around a lot but so bear with me on that. So the first screen you get to is just where you can load in your test documents and your training corpus. So your test documents are the things you want to analyze and your training corpus is the data that you are going to be comparing against your set of suspects in this supervised and laminate problem. You can then go on and start working with different sets of features. So we have three built-in feature sets that we've been working with but you can add your own and configure your own and make your own. You can look at the and understand each of the individual features like here I have complexity highlighted and you can see you know not just name and description but what the pre-processing steps are. So complexity for example is the ratio of unique words in a document to the total number of words in the document and some pre-processing steps we want to take are to strip the punctuation out of it unify the case across the whole document before we extract this feature. So again this is useful because these are all steps that were there's no easy way to put them all together. There's all this disparate code and people said hey can we do some of this work and you send us some of your code and be like yeah sure and send them a bunch of Python scripts that were a bunch of research code read you know not easy to understand and it was difficult for people to do that work. Then we have the classification step where you pick a classifier you can understand that the details of that classifier again soon we'll be adding support for multiple classifiers at the same time and then you analyze it and essentially when you analyze it you're going to be doing one of two things either you're going to be analyzing these test documents that you put in there to see how good your method does against it or you're going to be building a new method that you want to analyze and validate on itself. So in the first case when you're doing the test documents I put in a bunch of obfuscation passages and if you zoom in on that you can see which each document was classified as all these are mostly classified as author bb and one of them is classified as author c and then when you're doing the overall validating a method of telemetry you can run that through and get an overall accuracy and basically metrics on how good that classifier that you just built is and how effective it is against some dataset. So we have some development goals for J-style going forward I mean that was just a quick walkthrough of it we want a wider section of classification methods and features including reprints the synonym based method which are not built into this yet this is not the tool that we've used this tool in part to do our existing research but the graphs I showed you earlier were still created from this hodge pod of different methods that we're using and we're hoping to unify that in one tool it's a and in we want to add ensemble classifiers with weighted averaging we wanted to be easier to understand for non-technical users so we want people to be able to plug in different future extractors that other people built and share those and we want some better visualizations and logging and graphing of results over multiple experiments such as visualizing the document the authors and the classifications so moving on to anonymous anonymous is our deception tool and the problem here is that authorship recognition can be a legitimate threat to privacy and anonymity our intuition in changing our writing style goes a long way but it may not be enough and may not be sustainable over a long period of time some methods are better than others and still determining who you are and you might need assistance in knowing what you need to change what the most identifying aspects of your writing style are and what needs to be done about that fully automated text anonymization is an intractable problem you can't make a piece of software that you click a button and it maintains the meaning and the content of your writing completely and still anonymizes it if you can let us know you're probably some awards out there for that or something but so we need a solution that explains authorship recognition the nuances of it as they're needed to an individual and helps assist them in making the most useful changes towards NNMD we're talking about this more the artificial intelligence that's involved in the classification problem acts to augment your own intelligence it's not meant to replace your intelligence and do it for you but to help you focus your own intelligence and how to modify your document and how to change it so anonymity is built on top of J-Stylo I mean the main reason J-Stylo came out is that you can't like we can't have an anonymity without J-Stylo so we need to build the analysis tool and then build a deception tool around that analysis tool that can look at the deception aspect of it and Princeton's WordNet which is a large lexical database it has the same corpus feature extractor classifier functionality J-Stylo that's all really good but the main selling point if you will of an automaut is the suggestion system for modifying documents to evade authorship detection the ideal value for each feature or each data point in a document is calculated and presented to you it's calculated through a modified k-means clustering algorithm where it finds basically the different clusters in which authors exist around for certain features like you know most of most people a bunch of people have an average of about 20 sentences per document and a bunch of others have an average of about 12 but you probably want to go towards the 20 because that's the ideal point for you to be anonymous in this set of authors that you've given us and if it's possible we highlight the existence of these features and explain to the user how to change that feature to help anonymize their document and it takes an interpretive approach to anonymizing writing style at the moment where you know you make the changes you need you reprocess it it then tells you what the other features are that still identify you you make those changes and you reprocess it and move forward with that and once again the alpha release is available now on our website psal.cs.jaxl.edu so I'm going to walk through an automouth now you notice that an automouth actually is very similar to J-Style except the first and the last panel which are kind of like the guts of an automouth so the first panel in this case you have your document that you want to anonymize you have your sample documents that establish your writing profile because you need to know what your writing profile is compared to the set of authors that you are trying to mix in against and then you have that set of authors that you're trying to mix in against and be anonymous among and you can preview the document and stuff like that and you can also create problem sets in corpora or in here just like you can in J-Style so that's useful again the feature selection and the classification screens are basically the same because you're setting up the methods by which you want to analyze your document but then you get to the editor and the processing step so in this case you edit your document I'm sorry you bring your document in you process it and in this example it's showing that this the red indicates that it still thinks that you are you it thinks this document is attributed to you and on the right here on the right here we have the set of features in this case again for simplicity sake we just picked that nine feature set for the purpose of the demo but you can have much longer and more complex sets of features and we as a next step start walking through those features now you click on the first one you'll see it says hey actually your first one complexity it's actually in a good spot you don't want to change that that's actually not the thing that gives you away and as you go down the list you find that also the character space and the unique word counts those things don't give you away but you know what gives you away is your sentence count see you have 19 sentences in this document and really you want to have it more like 28 so I went through and I took this document and I just did something simple again this is a really simple example don't think this will anonymize your document just by like throwing a bunch of periods in it I but in this case I threw a bunch of periods in it and brought that target value brought it up to about think about 28 and then reclassified it and now it thinks that I am another author so again a really straightforward example but features can be more complex for example if we were looking at the average syllables in a word in this case it's saying you know you have maybe take it as a compliment right like you're using really complex words there are 1.8 syllables on average in your document you really want to get it down to more like 1.6 so here are all the here are all the words in your document that have too many syllables that we think you can easily change so why don't you go through and try and modify some of those and bring that value down so that you can modify your document and there's much more to this there's more complex feature sets there's better highlighting methods but again I just wanted to take you through a simple example to walk you through kind of like what a non-mouth is intended to do so there are some real interesting research questions and challenges in developing a non-mouth the main one is that features are often not independent so if you increase the number of complex words you will also increase the average syllable kind of your document if you reduce the number of times a specific word occurs you're also going to increase the affect the lexical density so all these features into play against each other in this process and how can we create an algorithm for anonymity for anonymizing a text document that generates an obfuscated document with minimal effort and without circular feature modification this is I mean we think our software does a pretty good job right now but this is still an open problem and something that we need to tackle and that's our one of our main research focuses going forward is addressing this problem and we have some development goes from non-mouth just like Jay Stallow so we want to streamline the suggestion system we want to improve the automation on certain features where they can be more easily automated improve the cluster algorithm so you can more easily see the path to anonymity that you need to take in terms of what features you need to modify when and in what order we want an improved editing interface with better phrase and words and then support that allows you also to edit by block text not simply feature by feature because maybe it says you know you need a bunch more sentences and you break up the last paragraph but that leaves the rest of the paragraphs on touch so those paragraphs might still have your identity stamped all over them so we need to analyze it maybe at more micro levels and we want to look at wider sets of features and classification methods this is tied back into the Jay Stallow development goals and we need to do some usability and anonymity user studies so there's a long way to go but we think that this tool is a good first step towards allowing people to take some steps to anonymize themselves in text documents because currently there's no method out there other than just your intuition in order to do that so and your intuition will get a lot better if you study silometry and we don't want to subject you to like having to come sit in our lab for a few months just to be able to anonymize the document hopefully you can just download this tool and do some work on your own so again I don't want to promise that like I make no promises that using this tool right now is going to anonymize your document this is meant to be played with and toyed with and see where it's good and where it fails and you know let us know and let's work to improve this and if anyone wants to join in that process you know please get in touch with us so we're going to continue to develop this at our lab but it's it's open it's free and we are always interested in linguistic experts security advisors and user interface experts to help us make this better so if anyone's interested again feel free to contact us I'm going to turn it over to Sadia Afroz now who is going to talk a little bit about detecting deception in stylometry and then I'll come back up and recap Hi Thanks Mike I'm Sadia I'm going to talk about stylistic deception detection as you have seen in the previous slide that when people change their writing style it's very hard to detect exact authorship so we here try to solve a more general and simplistic problem that is given a bunch of document can we distinguish regular document from the obfuscated and imitated document and we found that with a large enough linguistic feature set we can distinguish regular document from imitated and obfuscated document we used very different three different feature sets the one that worked better what worked the best is the right brain feature set which is a really broad set of linguistic, content specific and non-contact specific features and we have about 85% accuracy in detecting imitation document and 89% accuracy in detecting obfuscation document the second feature set we used is the line detection feature set which consists which contains 16 linguistic features that were found to be useful in detecting line in a written document the reason we wanted to use this line detection feature set is that when people lie they generally use simpler languages and shorter sentences it might not be true for people who are very good liars but it works for most people and so people change their language when they're lying though that changes involuntary and in our case we are asking people to change their language voluntarily so we wanted to see if people change the same kind of languages when they're lying and we found that that the line detection feature is much smaller than the right brain features but even those features are very salient in writing style change detection and we saw that it can detect imitation with 75% accuracy and obfuscation with 60% accuracy we also used nine feature set that are generally used for authorship recognition and we saw that it does not work very well in detecting in distinguishing regular documents and deceptive documents so we looked into the documents to see what people actually do when they're trying to change their writing style this this graph shows some of the major features that were changed mostly and we found that in the obfuscation in an obfuscation passage people used more existential there more adverbs and they use simpler languages that's why the readability index decreased in imitation passage we also we noticed a decrease in readability index and people use less adjectives and adverbs but more particles and personal pronouns one problem with our data set is topic similarity all the deceptive documents were of same topic when we asked people to write obfuscate their writing style they wrote about their neighborhood and in the imitation data set they wrote about their morning and they imitated exactly one author so topic similarity might be a reason why we can distinguish this different kind of writing style though we use non-content specific features to see if topic similarity is a reason and we found that non-content specific features works gives us exact almost same accuracy as content specific features to see that content similarity is not at all an issue we used a different data set which is the Hemingway and Faulkner imitation corpus we collected 36 documents from 36 winning articles from the Hemingway and Faulkner imitation contest and in this contest participants were asked to imitate Hemingway and Faulkner to write a 500 word document about anything they want so all these documents are written by 36 different people and they imitated Hemingway and Faulkner in different ways and they used various number of topics topic similarity was not an issue here and even in this case our method works more than 80% case and we can detect regular Hemingway and Faulkner document from the imitated Faulkner documents in both of the data set that I discussed now people change their writing style just once so and they in our data set they used on average 30 minutes to an hour to change their writing style so we wanted to see how our method does if someone someone developed a new writing style over a longer period of time and uses his new style over and over in that case the simplistic linguistic changes that are evident in our data set might not be evident in long-term deception because he had enough time to modify and edit his writing style and he became more fluent in that to see how it how our method works in long-term deception we downloaded we collected what was from a gay girl in Damascus this blog was originally written by and a 40-year-old American male Thomas McMaster and he pretended to be a Syrian gay woman Amina Araf and he wrote about Syrian political and social issues and he was so convincing in his blog post that people from journalists from CNN and New York Times started quoting him during Syrian uprising so we took this blog post so this author he opened this blog in 2010 but even before that he was writing as Amina in Yahoo Groups from 2006 so he had a long time to develop a new writing style so our deception detection method did not work in this case because as I said before he had enough time to develop a new style and the simple linguistic features that were used were not evident in this case but when you're when you're maintaining two different writing style more than one different writing style it's hard to be consistent so in these cases regular authorship recognition can help to detect that there is something wrong with the writing style and when we applied regular authorship recognition we found that more than half of the blog post were attributed to Thomas so this is one example that why we need a tool like Anonymous so that you can use change your writing style more easily and be consistent with it thank you so just a quick recap of our work what's available now is our original style which has 12 authors our new adversarial style which has 45 authors and the alpha releases of J-style though and Anonymous what you can look forward to if you're interested coming out of our lab are beta releases of J-style and Anonymous that include some of the development goals we were talking about academic publications of our new results because we're in academia after all and we need to do that and continued analysis of deception detection and short message classification and continued research on improving this partially automated anonymization method or suggestion system and tool that we've been talking about we think there's a lot of interesting research problems there that we hope to work on so here's all of our contact information including our website one last thing before I take questions I want to say is that we are looking for interested grad students in postdocs so if you're interested in this area if you're looking into starting a graduate degree or looking for a postdoc please come talk to us or send us an email because we are we are looking for folks to work in this with us so with that if anyone has any questions we'd be happy to I think we have what 10 minutes for questions maybe so first of all it's online yeah thank you for your talk both of you there we have I would say really 10 minutes for questions we have an audio engine going around and a signal engine in the IRC but first before we start with the questions don't forget to give your feedback in the pantobarve again don't forget to take out your trash especially the three matter bottles who fall down some minutes ago and yeah let's start with the questions please Kyu hold up your hand can I see anything here thanks okay not knowing anything about linguistics I have two questions is it unicode compliant your tools and but what language does it work with first of all where are you I can't know where I'm from okay and can you repeat that question I'm sorry sure is it unicode compliant I mean your tools and what languages do they work with they're written in Java and I didn't hear the first question are they unicode compliant what human unicode compliant yes they are unicode compliant okay and yeah like English Swedish whatever those kind of languages yeah so we haven't so there's a lot of work to be done which our colleague Alina is working on regarding how different languages respond to authorship recognition most of research has been done in English languages so far but these tools do so they you know unicode and coded text will should run for these tools just fine so you can if you want try some other languages and let us know how it works out just a short introduction for the people in the back who ask a question could you please stand up the light here at the moment is in such a way that we really can't see you at the moment so the next one please okay I'll take one from the front okay I'm thinking about the reverse use case for this stuff so let's say somebody's writing Harry Potter fan fiction and they want to match the style of the original Harry Potter it seems like they could use these tools to just more or less have the fan fiction be more like like Harry Potter in style of course not in content the content is completely ridiculous but just to match the style but does it sound like a reasonable use case yeah yeah it does sound like a reasonable use case that kind of functionality is not quite built into our tool that you could certainly spend some extra time and get around it so our tool when you put a document in there it's gonna try and give you enough information to anonymize it so you could keep running it through until it until it shifts towards the author that you want though even though the suggestions aren't telling you to go towards that author but yeah certainly I think that if you are building anonymization tool like this it's I'd be interested if anyone has any ideas of how to build it word that use case would not be possible but yeah that is possible yeah I may be asking a stupid question because I missed the first couple of minutes of your talk but this was a pretty technical talk and when I think of the topic I have many political and cultural questions in my mind for example I'm thinking of mandatory in America when you study or go to I don't know if it's done in high school too if you study in America odds are that you have to submit your papers to an affirm that collects a huge database of of writing styles actually and what are your thoughts about the political implications of your work or are you a little bit timid to to take a position on that I think that there are implications for this work in all areas both positive and negative I think where you're talking about if is the idea of plagiarism detection which is really common in universities and colleges we haven't done any direct work to see how this will affect plagiarism detection nor have we done work and really like how much time it would take you to anonymize a 10,000 word document as opposed to just write it yourself in the first place so so I think there are a lot of implications for that but I can't we haven't I don't want to make assertions without data to back it up and I have not done research in that exact area I do think again that hopefully these tools will allow people to ask and answer questions like that that's part of the point of us releasing these tools here what level of what level of guidance did you give to the mechanical turkers for the obfuscation task because presumably the language you use there will affect the techniques they use in obfuscation yes we so for the mechanical turk task we so in in Amazon mechanical turk you can actually request that certain request certain turkers so as we called turkers so we requested turkers that were from the United States and who's native language is English yeah who's native language is English yeah I guess you you should probably address that more if you have any more to say about that we just said it's the same way we did our original we collected our original sample we just asked them write in a way that you don't usually write and we didn't give any specific direction you didn't guide them towards specific techniques no we didn't guide them towards specific direction but after we had the data we saw that most people changed the same thing that everybody used shorter sentences and simpler language simpler words but we didn't guide them anything anyway going back to the example of the American blockwriter could you also detect whether somebody is writing in the native language we are working on that native language detection that Eileen might yeah our colleague Eileen actually maybe can answer that because she's working on please use the like the word here native language they want to know they want to know if you can identify the native language of authors yes you can definitely identify the native language of authors and you can like do more things on it like identifying their language family if you're not able to identify the native language so this is another research area so I'm going to take the question from the IOC now one of the questions from IOC is if everybody uses the same anonymization software to anonymize their documents what is the result that is a good question again I mean in short and I like like we we don't really know there are the hypotheses that we've thrown around are do people move towards some sort of like standard tongue for that people cannot talk in this more anonymous way or some sort of simple English but I'm not really sure I think that's the the best answer I can give is that it's going to be context dependent one major point of psalometry that's accepted in psalometry research in general is that there is no one-size-fits-all solution if you want to study specific area of psalometry and identify an author you need to look at the clues and the context of the domain that you're looking at and my again hypothesis we don't have enough data to back this up my hypothesis is that you will have the same sort of constraints when you're trying to anonymize your document so the means of anonymization and what you do to change your your document so the steps you take to anonymize a scientific research paper as opposed to a personal political blog may be different but we'd like to research that and come up with some actual hard numbers to explain that and also we we ask people to set their own corpus towards which they want to go so if different people will if different people choose different corpus then they're even if they're using the same tool they will be leading towards different and the second question is what happens if I write a text sorry no our spelling spelling mistakes are also considered for comparison maybe I no one could remember like that's what E and H like yeah spelling mistake was one of our features those are specifically included in the right prince data set or spelling mistakes okay and then one more what happens if I write a text together with others will it be more complicated to identify a text where more than one author was involved in one text there is some research looking at the question was is it is it harder to identify the text if there's multiple authors or can you even identify what authors wrote what parts of the document there is existing research on that that shows that you kind of identify when there's multiple authors in a document I think that there's more research to be done in in how to obfuscate that or maybe even how to obfuscate the fact that there were multiple document authors there in the first place and if that's even possible and do you have free and open lists with distinct linguistic features of text that you suggest are creating the individualistic style so that we can look up at those lists and find out what are the distinct features so these lists are in some of the papers we published but actually the easiest way to look at the kind of features that we've used is download these tools and you can look at the existing feature sets and just go right through the list and it explains most of the features have descriptors they say what steps are taken to pre and post process those steps those features all sorts of stuff so I recommend downloading in that case I would recommend downloading Jay Stylo and just taking a look at the different features so you can kind of get an idea of what things are used there do you know if there's any corpus of linguistic samples that are sorted by partially de-identified sociological data so if someone's trying to move their fingerprint and the the deception fingerprint shifts over time but the identification fingerprint goes back and forth you know so is there any corpus of data that's actually like someone of a lower socioeconomic class native speaker female region or anything like that compiled in a database so people can look at the fingerprint styles well other than our corpus no we do have demographic information associated with the samples that we've accepted oh maybe I mean has something else to add but we do have demographic demographic questions like what is your education background how old are you what's you know your native language and and things like that associated and get people volunteer this stuff so but okay I know that there are there are data sets that are collected in this way the linguistic data consortium Penn has some of them but they're they're you have to buy them generally they collect different demographics and things like that and so we didn't want to I mean we couldn't include them in these sorts of things plus for example Alina is working on and you can say so in Belgium there's the ICLE data and this is international corpus for learner English so these are people that I study in English all around Europe from 16 different countries and you know their native languages and they have upper intermediate English so this is what we used to look at their native languages and there's the Elvan to Eltu transfer fact Elvan is your native language and Eltu is the language you learn later which is English and from this we can identify the native speakers so that's still a fingerprint um I have a question yeah about the Hemingway detection example did you tweak your algorithm to detect Hemingway and if you did how did you avoid overfitting because Hemingway is that and you can't get any more cross-validation examples out of them no we in our training sample contained Hemingway Faulkner and Cormac McCarthy and a bunch of other authors too we didn't tweak and overfitted our algorithm anyway to detect Hemingway you've been reaching really hard the whole time okay um I'm curious as to how well or if at all these dialometry features or methods that you presented here work to get indication of an individual's non-authorship of a text thinking ghost writing uh so this is one of the situations where we think that's an open research question that we don't have data to support looking at ghost writers and how and how they can be attributed to back to the the true writer but one that I think is worthwhile and I think we all would agree it's worthwhile that hopefully people can use these tools to do some of that analysis on so getting back to this question of fanfic could you imagine for example that Harry Potter fan authors a big vibrant community might over time tweak on an automouth style tool so that anyone could Harry Potter eyes any arbitrary block of text and that might become a great and a positive externality you know the reverse of what Jacob was talking about last night suddenly we've got this awesome tool anyone can make any piece of prose non-fingerprinted or rather fingerprinted to Rowling and this by the same token could you frame someone by using an automouth to make terrorist rants sound like their pro-style so for the for the first part yeah I think that that is a potential logical extension of this work and I think an interesting one exciting one as of the second part about framing what I would personally hope to people take away from this is that writing style like writing style and stylometry has been used in court it's not used so much recently but there were some older methods that they ended up realizing weren't that good that have been used in court and I hope that before people get the bright idea of using some of these newer methods in court they realize hey wait a second you kind of can't trust writing style as a true means of identity if there's the potential for deception to be there so I would hope that that wouldn't be the case because I'd hope that people would realize that hi you in the beginning told about the possibility to translate something and translate it back maybe with different tools or into different languages but you didn't show any statistics or didn't mention it at all afterwards did that completely fail or why didn't you mention it so again it it was something that we actually I actually had slides for in the last talk I gave a couple years ago and I probably should have thrown some addendum slides so I can slide to them here now but I don't have them if you actually go back and look at the talk from two years ago which I believe the slides are up on the site you can see some graphs that we show where you're the obfuscation attacks and the baseline accuracies don't change the overall actually does not change much I think you get basically like a if it's 65% accurate and determining your authorship and then you run it through a translation system and back you get maybe 50 or 55% accurate so it does reduce a little bit in some cases in other cases it didn't reduce at all so yeah I apologize for not having the data here I just there's a lot of stuff I wanted to cover but that is in the last talk so feel free to go check it out so we'll take one last question anywhere I've seen one wonderful all right some practical advice then if you're with a activist organization and you need really fast to type up a press release or a comment to something and especially if you're not a native English speaker what's your advice what's how what kind of mindset if you have to do this in 20 minutes how do you think pick an author with a distinct writing style and try and imitate that author I think that's the best way so the one thing I want to add is these samples that we did people span between a half hour and an hour writing these doing the whole process so I guess you can probably say they probably spent about 20 minutes in each of these 500 word samples so that's that's a lot of time for maybe 500 words depending on what you're trying to do but I think that's the clearest way to anonymization at this point and I'm not saying it's going to make you anonymous but at least the closest to it is to pick an author with a distinct writing style and try and imitate that author did you test that also for non-native no we have not tested for non-native English speakers and we would be interested to hear what happens if you do so wonderful thank you for these many questions