 And now, it is my particular pleasure to announce Professor Rachel Greenstad, Eileen Kaleskin Islam, and Rebecca Overdorf from the Privacy Security and Automation Lab from Drexel University. They are quite old hands at speaking at the Congress already. And so, well, please give them a warm round of applause. And there we go. It would be absolutely wonderful if you could get the present there. Thank you. Yes, that's much better. Hello, I'm Rachel Greenstad. I'm a professor at Drexel University and where I lead the Privacy Security and Automation Lab. This is joint work with my students, Eileen Kaleskin Islam and Rebecca Overdorf, who will be speaking later. So we're going to talk today about authorship attribution in source code and in social media. So first, I'm going to talk about stylometry, which is how we do authorship attribution in my lab usually. So the idea behind it, the theory behind it, is that everybody's writing style and speaking style, indeed, is unique. Because we all learn language individually on an individual basis. And each of us, even though we might speak the same language, speaks sort of our own individual dialect of it. For example, in English, there are regional differences, whereas some people may say that one piece of furniture is a couch, whereas other people might say it's a sofa. Furthermore, there are words that have similar meanings, but they're actually different words, although and though, and which ones people particularly prefers sort of a stylistic idiosyncrasy. In writing, people may use the same word with different spellings. And there are also just many ways to express very similar ideas. Someone might say the fork is to the left of the plate versus the fork is at the plate's left. So these differences are how we, in writing and documents, can distinguish authors often in many times. And that's a lot of the work that we do in my lab, which is the Privacy Security and Automation Lab. So this is a research lab at Drexel University. It has about 10 students in it, a mixture of graduate and undergraduate students. And in general, we study sort of how to have machines help humans make decisions about security, privacy, and trust, often using machine learning and natural language processing techniques. In particular, we're very interested in what we can learn when we analyze unstructured and semi-structured sort of human textual communication. And this is what we've spoken at CCC about in the past. So Mike Brennan spoke at 2063 on sort of privacy and stylometry and how authorship recognition techniques can be attacked and how they could be deceived. Again, in 2863 with Saadia Fros. And Eileen and Saadia spoke two years ago on applying stylometry to sort of online underground markets. And this year, we're gonna talk about source code and cross-domain stylometry. People always ask us sort of, what about source code? What about tweets? Stuff like that. So we're gonna answer some of those questions in this talk. In general, in the lab, we also do a number of, a lot of work like doing social network analysis of online communities and also textual analysis and also studying the secure machine learning. So instead we're a privacy lab. Sort of what is the connection between privacy and stylometry? Well, so there are very good techniques for location privacy, that the privacy enhancing technologies community has worked on. You're probably pretty familiar with tour, in my t-shirt, and mixes in other types of techniques they can hide your IP address from people on the internet. But in some cases, when you're expressing yourself in text online, that might be insufficient and that's where my research comes in. Stylometry can be used to identify authors based on their writing. And this is important because this is a potential threat to people that are exposing crime and corruption, political organization, especially if they're speaking in firsthand testimonial accounts. And it's also just important for normal people who wanna express their opinions or write code and share it online without necessarily having the thing that they wrote online follow them forever through their life like a dossier. So let's go back to stylometry and let me talk a little bit, just give a sort of short tutorial on how it works. So basically, stylometry methods that are used today use machine learning. And so say you have two authors, Cormac McCarthy and Ernest Hemingway. So they're both authors that have somewhat distinct styles. So Cormac McCarthy might say, what's the bravest thing you ever did? He spat in the road in bloody phlegm, getting up in the morning, he said. And then there's Ernest Hemingway. He no longer dreamed of storms, nor of women, nor of great occurrences, nor of great fish, nor fights, nor contests of strength, nor of his wife. So the question is, how can you tell the difference between these people? So we can just feed the text straight in, we have to extract features from it. The types of features we use, an example of this might be the frequency of function words. Function words are sort of the small little words that don't necessarily mean anything. And we might also look at, say, the frequency of punctuation, it tells us something about the structure. We also use a lot more features in our work, which we'll talk about later. But we feed these into a machine learning model. In many cases, we'll use a support vector machine. Sometimes we'll use a random forest. And a good model generally needs sort of 4,500 to 7,500 words of training data. And greater than or equal to 1,000 features, these are maybe many features of the same type, like word n-grams, for example. Yes? Oh, I'm sorry, am I not being, is this better? I hope the stream is working. Sorry about that. Okay, so to actually use this, say we have an unknown document, which is our test document, just remember the things that you put into your head are there forever, he said. You might want to think about that. And we don't know whether this is written by Ernest Hemingway or Cormac McCarthy. So we'd extract features from this document. Now, this is a very short text snippet. For best results, we need about 500 words. And we'd ask the model who wrote it. And it would tell us that it's Cormac McCarthy, which indeed it is. So in general, stylometry methods are pretty good. When you're dealing, especially when you're dealing with sets of authors in the sort of 100 author range, where that's sort of the world of suspects that you have. Then Basi and Chen have a method that works at sort of above 90% accuracy. Now, these methods can be scaled. People have done experiments, a couple at all, with 10,000 authors, and Yannin at all, with over 100,000 authors. And you can see that even in these cases, the results are much, much better than sort of random chance. Which do allow you to sort of narrow the world of suspects quite a bit. So previously at CCC, the question that we asked in my lab was sort of how strong are these techniques when people are actually trying to fool them? And we found that people in general were able to reduce the accuracy of these techniques by writing in a sort of a specific way to try and hide their writing style or to imitate another author. We actually asked people to imitate Cormac McCarthy in this case. Now, I wouldn't necessarily recommend just doing that if you wanted to hide your writing. You'd probably want to verify in some way that you'd actually done it correctly. So we actually do have some tools in our lab. Jay Stylo is an authorship analysis tool. And Anonymouth is a sort of authorship anonymization tool which is a very much work in progress. These are available on our GitHub page. We'd love to have your comments, help, thoughts, et cetera, on them. And we looked at underground forums. So this is excerpt from the Carter's Forum where people trade sort of credit card information. And to do this work, we had to actually extend our analysis tool to German. So Jay Stylo does work in German. And these are the types of features we use. The frequency of engrams, the punctuation, the special characters, the function words, in this case, German specific function words and parts of speech. That's another case where you need specific parts of specific language, specific features. So the question that you might wonder is sort of, is this purely an academic concern? Do people actually use style-ometry in the real world to actually identify people who might not want to be identified? And the answer to this is yes. So in a rather sensational case, J.K. Rowling, who you may as well know as the author of the Harry Potter novels, wrote another book under a student at Robert Goldbraith. And Yellen Associates is a style-ometry firm and they actually did some work using tools that are part of our analysis engine actually as well on the request of a reporter who'd received an anonymous tip over Twitter. After their linguistic analysis, he felt confident enough to run with this story and indeed did expose J.K. Rowling as the author of this book. And our doppelganger finder code, which we designed to give the sort of the probability of two accounts are the same person, is actually used by the FBI. We pointed them at our GitHub and we don't know exactly what they use it for exactly, but they did tell us that they found it useful. And there are many expert witnesses that use this in forensic proceedings, legal proceedings throughout the world. I know the most about US law where forensic linguistic evidence is covered under the Van Wick opinion, which speaks to sort of how it can be used and how it can be considered. So, okay, so this is all the stuff that we've done and gives you some context. So hopefully you have an idea a little bit about style-ometry is and how it works. But we're gonna talk today about two particularly kind of interesting and difficult cases. The first one is what if you have an unknown Twitter feed? Can you learn its author from blogs or from comments on a news site like Reddit? Like because you might not have a Twitter feed for that person. And the answer to this is yes. However, if you do have a Twitter feed for the suspect then you should probably use that instead. And then we always get this question about what about source code? Can you detect somebody's source code authorship from their style? And the answer is that yeah, we can do that too. And particularly neat about this is even if you run it through an office skater it still works. So I'm gonna now turn the talk over to Eileen who's gonna talk about that work. Hi everyone, I'm Eileen. So now we'll be looking at code style-ometry and here we are trying to find out who wrote this piece of anonymous code by looking at their coding style. And there are two common scenarios we can think of when source code authorship attribution comes to mind. The first one is let's say that Alice's computer got infected and she has a piece of source code left from the malware. And Bob has a collection of malware with known authors. So Bob can look at his collection of malware to identify who Alice's adversary was. And in the second scenario this applies to plagiarism. Let's say that Alice got an extension to her programming assignment and her professor Bob has everyone else's submission. So Bob can look at everyone else's submission compare it with Alice's new submission to see if Alice plagiarized. And in this case we are talking about some security enhancing ways of source code authorship attribution. But unfortunately sometimes security enhancing technologies are actually privacy infringing cases. For example, Said Malikpur, he's a web programmer. He was sentenced to death because he was identified the programmer of a porn site by the Iranian government. And Said Malikpur was held under solitary confinement for one year without legal representation. And his family says that he's also a permanent resident of Canada and he didn't know that the porn site developers were using his photo uploading software. And Said Malikpur also said that if he knew that this was going to be used by a porn site he would have never put his name there because it's illegal in Iran. And after that he says that under pressure he said that he regrets his actions and now his death sentence is canceled. When we look at source code authorship attribution we can define this as a machine learning problem with four main experimental settings. In the first one we can think of software forensics and here we have multiple author which corresponds to a multi-class learner. And this is in an open world setting which means that we don't know the suspect set. In the regular case of authorship attribution which we can also call stylometric plagiarism detection. We have the multi-class case with multiple authors and we know the suspect set here. So it's a closed world machine learning problem. And we can also apply source code stylometry to a copyright investigation where we have two parties in the dispute so it's a two-class problem and it's a closed world problem because we know both of the sites in the dispute. And in authorship verification we would like to answer is this person who claims to have written this piece of source code? Did they really write it or did someone else write it? And this is kind of a two-class, one-class formulation which we will look into detail later. And this is an open-class problem because this was either written by the claimed person or it was written by someone that we have no idea about. And here's a table of summary of our main results. You can see that with the 250-class authors task we get 95% accuracy in identifying them. And this is a very high accuracy compared to previous work. And this indicates that we introduced a new principled method with a robust and syntactic feature set for performing source code stylometry which has not been done before in this scale and in this way. In order to understand coding style we have to look at programming features or programming style features. And for that, first of all, we have a piece of source code and we look at some lexical features like variable names and the use of C++ keywords. Then we look at layout features like the spaces, the tabs and we extract those from source code. After that, we pre-process the source code to obtain its abstract syntax tree which reveals structural features. So it's the grammar of the code. And for that, we use the fuzzy abstract syntax tree parser that was provided by our collaborator Fabian Yamaguchi who presented yesterday. And since it's a fuzzy parser it can even handle incomplete pieces of code. And once we have the abstract syntax tree we extract syntactic features such as the node tabs or the abstract syntax tree node types or node type frequency inverse document frequency. And we saw a recurring subset of features coming up in many of our data sets with hundreds of authors and thousands of programming files. And for example, we see here that most of these in this list are syntactic. And these features are the most important features because they have the highest information gain. And the syntactic features are mostly the node tabs in the abstract syntax tree, abstract syntax tree node term frequency or TFIDF. And also we see some lexical features like C++ keyword type depth and some layout features such as the number of tabs that were used. And this slide illustrates our general method in many different experimental settings. In order to do experiments, first of all, we need a data set. So we went ahead and scraped the submissions of contestants from Google Code Jam. Google Code Jam is an international annual programming competition. And since 2008, Google has been publishing their correct submissions online. So we went ahead and scraped all the correct C++ submissions from 2008 until 2014. And we ended up with a data set with more than 100,000 users. And once we have the source code, we pre-process it with the fuzzy ASD parser yarn. And then we extract lexical, syntactic and layout features. And as a classifier, we use a random forest to avoid overfitting with 300 trees. And these trees by majority watching do the final classification depending on our task. And I would like to give some statistics about our Google Code Jam data set. We saw that in the 2014 data set, which we used as our main one, because it was the largest one in C++, the average lines of code was 70 per solution. And in this programming contest, everyone is implementing the same problem or the same functionality at the same time and in a limited time. And whenever we are performing a machine learning task, we always train on the same problems that people answered to. And then when we are testing, we choose a problem that was not in the training set. So it makes it a further, more difficult machine learning problem because the question was not seen in the training set before. And here on the right pie chart, we see that C++ was the most common language and that was also true for other years. Now, I will go about some scenarios where we can apply source code authorship attribution. And in the first one, like I'll give examples as I'm talking about the scenarios. I would like to explain the first one, which is regular authorship attribution by giving the Satoshi example. Everyone is trying to find out who Satoshi is and we have Satoshi's source code as well. So like from the initial contributions or comments on Git from the Bitcoin repository. We have his code, but we don't know who this anonymous programmer actually is. So we can train our data with a suspect set. And after that, we can test on this initial Bitcoin code to see who Satoshi is. And for this experimental setup, we took 250 authors trained on their files and we had 2,250 anonymous programming files. And when we trained and tested, we got 95% accuracy in correctly identifying these more than 2,000 files. And if we only had a suspect set for Satoshi that we could train on, and we would have the like, if you had a suspect set for Satoshi, this would be the training part. And then we will use the Bitcoin code, like the initial Bitcoin code for testing and we might be able to predict who the Git contributor Satoshi might be. Not that we are trying to do this, but this is just an example. In the second case, we will talk about obfuscation. There are several reasons people try to obfuscate their code to make it unrecognizable. First of all, you might have plagiarized and you might be trying to hide that you copied someone else's work or you might have a malware and you might be trying to make it unrecognizable so that it won't be detected. Or in other cases, you might just be trying to stay anonymous and hide your coding style. But we saw that our authorship attribution technique is not affected by common of the shelf commercial obfuscators. I'll give an example with the obfuscator that we use, which is like, you can buy it like, I think for $400, it's called Stenix. We are not related to it. We just use it because it was the cheapest commercial one and a widely used one. Here in this example, we will see how C++ code is obfuscated. We see some variable names. They are being hashed and all the spaces and comments are being stripped. If there are any numbers, they are going to be replaced with a combination of hexadecimal, binary and decimal numbers. And also if there are any characters, they are going to be replaced with hexadecimal escapes and you can choose different settings for your hashing or your combinations. And we see that everything is refactored, but the functionality or the structure of the program remains the same. And as long as the structure is the same, our features are not affected by this obfuscation. As a result, we saw that when we tried to do authorship attribution on obfuscated code versus original code with 25 authors, we got 97% accuracy in both of them. So our code is, our method is impervious to such common of the shelf obfuscators. But this is only for this obfuscator, which is not changing structure or functionality. Another case is copyright investigation. I would like to give a copy left example here. Copy left software is free, but it still has a license. So you have to, you can modify it, you can use it, but you have to make sure that you still include the copy left license that it came with. And in this example, we would like to see that this programmer take a copy left code and then make it copyright. There was a very famous case in North California a few years ago. It was with Jacobson versus Katzer. And Jacobson had Java model railroad interface code that he put an artistic license on. And the artistic license is less restrictive than the copy left license. And after that, Katzer, who is also interested in railroad models and he's working for railroad hobbyists as a software developer, he took this code and then he put a copyright on it and he started distributing this commercially. And also he filed a patent using Jacobson's code. And after that, this was on court and some people claim that like since this is just artistic license, he can do whatever he wants with it because it's free code. But that was not the case. Even if it's an artistic license, you still have to make sure that when you modify it, it still has an artistic license and everyone else can use it the same way the first person intended it to be used. And this can be used, this can be experimented in a two class machine learning problem. In the first class, we will have the copy left code from Jacobson and in the second class, we will have the copyright code and we will compare them to each other to see if any code was taken from the other one. And in this case, we had 20 pairs of authors which means that we had 40 authors each with nine files. And we tried to identify their files correctly and we had 99% accuracy in identifying these. In the fourth case, we will look at author verification. Here, we're trying to find out if this person who claims to have written this code is he the real programmer or was it written by someone else? And this is a two class problem but it's not exactly two class because the first class is only Mallory. Mallory claims to have written the test code M and we train on Mallory as the first class. We also train on a second class that's a combination of several other authors and all these are the same problem solutions or like each one corresponds to the same problem from different authors. And once we train on these two classes, here we have the code that Mallory claims to have written and we have code from a bunch of other random authors. And in this test, we reach 93% accuracy in 80 different experimental setups. So that means that hundreds of different users with thousands of different files. We also wanted to see if programming style is consistent throughout years because if yes, when we are constructing our data sets, we can mix and match from different years. And we found the contestants that were bought in 2012 and 2014 and here is an example and this is a random example of their code. The same person in 2012 and 2014, the layout features look extremely similar. The structure is very similar. The four comes at the same depth and we see the lexical features such as the variable name TT is very similar except that in 2014 they decided to capitalize the TT. And as a result, we were able to identify 25 authors that were bought in 2012 and 2014, but 88% accuracy. The 88% might seem low to you after hearing the previous results with 99 or 93. But in this case, when we took these 25 authors just within 2012, we were able to identify them with 92% accuracy. So it's just a 4% drop in accuracy which shows that coding style is up to some degree persistent throughout years. We also wanted to gain some insights about coding styles so we wanted to see how people implement difficult versus easier functionality. And we took a set of 62 authors that were able to answer 14 questions. We took the seven easy problems and seven more difficult problems. And we saw that these authors' programming style was more unique when they were implementing harder functionality as we can see with the 5% increase in accuracy. We were able to identify them with 95% accuracy. We also wanted to see the differences between advanced programmer versus a programmer that has a smaller skill set and how this is reflected to their coding style. And we saw that advanced programmers had a lot more unique coding style compared to coders that had a smaller skill set and the difference here is 15%. And this shows a large and very significant difference in coding style. In the future, source code authorship applications, source code authorship attribution can be applied to many different areas. For example, we can use this to find the programmers or the coders of malicious code. We can look at open source repositories and then find the anonymous people who are contributing malicious code and try to identify them by comparing them to other git contributors. Or we can identify the styles of coders who have a vulnerable style by looking at the bug numbers they have on git. Or another thing, as companies might use this, for example, let's say they're interested in a particular coding style, they can train on it and after that they can search for that on git to recruit employees directly from git. And when we compare our work to previous work, we see a huge increase in accuracy even though our data set is larger in magnitude compared to theirs. The last two lines are our results with the 95% accuracy and 250 authors. So this shows that our method is, with the syntactic features, it's doing a lot better and the previous methods did not use any syntactic feature sets. I would also like to thank our collaborators, Dr. Harang from United States Army Research Laboratory, Dr. Claire Vos from United States Army Research Laboratory, Andrew Liu from University of Maryland, Dr. Arvind Narayanan from Princeton University and also Fabian Yamaguchi from University of Göttingen. And I talked about a particular domain which was source code. Now Becca is going to talk about other domains and cross-domain styleometry. Thanks. All right, so as you just saw from Eileen's presentation, we're really good at this. We are very good at this in a lot of domains as well. So the ones I have up here, for example, source code of course, but also anything you really put on the internet we've looked at as a community. So we have emails, chat messages, even things that you don't put on the internet like books or historical documents have been studied. And in a few slides you'll see just how good we are at these types of things. This is Rahm Emanuel and this is his Twitter feed. Rahm Emanuel is an American politician. He's currently the mayor of Chicago. And while he was running for his office, a rogue Twitter feed was developed to imitate his Twitter feed. This is not Rahm Emanuel's Twitter feed. This Twitter feed was written instead by a man named Dan Sinker. And this is a really good example of why we would need to use styleometry, kind of in the real world. And if we have Twitter feeds, we can test on Twitter feeds and we do really well. The problem that I'm going to discuss today arises if Dan Sinker here didn't have a Twitter feed to compare it to. So he is a writer, so he has a lot of writing. So if he didn't have a Twitter feed, what we could hopefully do instead was take a number of suspect authors and during the campaign he was actually named possibly as one of the suspects. So we would have some data on a list of suspects and if they weren't all Twitter feeds, if some of them were blog posts or articles that they'd written, hopefully we'd still be able to identify the author of the rogue Twitter feed. So my main problem here is domain adaptation and styleometry. We're given sample texts in some domain and we're trying to identify the author of some of their document which is in a distinct domain. The features that we use for this analysis, some of them are up here. First is bag of words. Bag of words is really popular not just in styleometry but in natural language processing in general. These are, for example, how many times you use the word the, how many times you use the word computer, et cetera. Another popular one in styleometry in natural language processing are character or word engrams and there's an example of character by-grams underneath. Another specific feature is function words or stop words, so these are non-content words, basically words that don't mean anything, four to the, and part of speech tags and part of speech engrams are also important and not context specific. It's also popular to combine a bunch of features into one, what we call a feature set. Here's a popular one that works well within a bunch of domains known as write prints. You can see it's broken up into lexical, syntactic, content, and on the bottom of the screen should be misspellings as well but you can add other features. That got cut off. When we're looking at domain adaptations specifically and we're talking about different domains and types of places where people are writing things, it's important to look at non-content features because if you're writing in different places, you're probably writing about different things. So the ones on the screen here are some examples of those. The one that's been studied most extensively in this context are function words, so these are stop words. You can see the accuracies are pretty good with these words. The first example up there with 81% accuracy had eight people write different texts in different genres. So they were asked, for example, to recreate the story of Little Red Riding Hood and then ask to write an essay on something else and compare it. This isn't exactly a domain in the way that we're discussing it today. That's more genre or topic. Similarly, books were analyzed in the second grouping up here and were divided by genre and topic as well and all function words were used. So I said we're really good at this. We are. You can see within a bunch of domains, so emails we get 86% accuracy. The bottom two lines are my own work, getting 98% accuracy, almost 99% with Twitter feeds and using blog entries we get about a 93% accuracy. So we do pretty well. The lower accuracies for chat messages and Java form comments are because they're using a smaller amount of text for the testing document. And as Rachel mentioned in the beginning, you want something closer to 500 words for your testing documents. This is a tweet on your left and a blog on your right. These are from our data set and were written by the same person. The tweet has about, I don't know, three real words in it that aren't misspelled or replaced with something else. But you can see the blog on the other side of the screen is very well constructed. There's correct punctuation. There aren't replacements for short words. We don't see any of that. So you can really see the challenge here in trying to identify the author of this tweet or a group of tweets that look like this from a blog that looks like that. And that's really our challenge. The data we collected for this project, we collected 500 tweet and blog users and then 38 Reddit users who also had Twitter feeds. We collected the Twitter and blog users by simply querying Twitter for thephrase.wordpress.com and we're able to collect tons and tons of data linking those two accounts. And then for Reddit comments, there's a subreddit called our Twitter where people post their Twitter handles in order to gain more followers. And so that was a very easy way to link them across accounts. However, they didn't have as much data in there so we were only about, we were only able to get about 38 users for that data set, but it works well to confirm that our methods are working across different domains and not simply for blogs. Possible solutions for this work. The first is looking at write prints which I showed in the beginning, kind of throw as many features at it as you can and hope it works. The second is what if we were very careful about what features we selected instead? We only have fixed features, for example, that aren't context specific. And so we look at function words as others have in the past. And the final is that we look at our own method called doppelgangerfinder which I'll get at later. These are the end domain results for blogs and then tweets and then Reddit comments and then tweets. We have two different Twitter data sets because one was collected with the blogs and the other with tweets. And you can see that we do really well. The purple lines are function words and they don't do quite as well as the write prints which are the bluish lines on the screen. But in general we're doing pretty well with this. The green lines are the cross domain results. So you can see that there's a huge drop in accuracy. So it's if we're testing on blogs and then, or training on blogs and then testing on Twitter feeds or training on Reddit comments and testing on Twitter feeds. And so you can see that we do very poorly and that these results are unacceptable using the first two methods which are write prints and then the careful feature selection of function words. So what do we do about it? Double Anger Finder is an algorithm that was presented. It was created to link user accounts across cyber criminal forums. And this kind of naturally seems like it would work for our problem because really what we're trying to do is link accounts across the web. This method works by calculating the probability that each author wrote another author's documents. And then for each pair of authors it combines these probabilities and every probability above a certain input threshold is considered to be the same person and below it is considered to be different people. For example, we have some author, author A and we find the probability that author A wrote author E's documents and that author E wrote author A's documents and we do this for all of them. And then whichever probabilities are above a certain threshold we use, we say that they're the same author and if they're below we say they're distinct. This code can be found on GitHub at the bottom of the screen. It also appears at the end of the presentation if you miss it. We are actually able to augment this doppelganger finder algorithm to work better in the domain adaptation case. As well here we had to compare A to E, F, G, all of them. Over here we don't have to compare A to B, C and D because they're all in the same, let's say Twitter. If they're all on Twitter they're all tweets so we know they're not written by the same people, they're distinct. We get a little bit of an advantage here on the algorithm and also we don't have to use a threshold which is definitely a huge advantage. We just take the highest of all the probabilities because we know that they're somehow linked. If you're in the open world case and the open world case is one where you don't know the suspect set so you say that I'm not sure if it's one of these people where you're not sure that there's a perfect one-to-one pairing between the two then you'd have to threshold it and you have that same issue again. Here are the cross-domain results for the blog and Twitter dataset. The green lines at the bottom were the green bars on the domain adaptation slide so we do very terribly across domain using those methods and then the blue lines are the in-domain results and the bold red line are the doppelganger finder results using our augmented doppelganger finder. So you can see we were able to recover the accuracy to almost as high as some of the in-domain accuracies. And then the limitations of doppelganger finder. First of all, you need a lot of text even in the training documents or testing documents. And so maybe more than 500 words even you would need of testing documents to make this really work. Additionally, it's made for a specific case which is account linking may not work for more specific cases than this. The next question that naturally arises is what if I'm trying to identify the author of a Twitter feed and I have a bunch of blog data but I have some Twitter data. Do I use the Twitter data? Or what if I'm in one of these limited cases? Should I use the Twitter data or should I try to use domain adaptation in the, with the blog data? And the answer really is if you have Twitter data you should use Twitter data. You can see at the first point there on the screen that 10%, that is that 10% of the data is Twitter data and the rest are blogs and this is just using write prints, support vector machines for machine learning and you can see that we get a huge jump in accuracy from having no tweets to having some tweets. And so if you have any Twitter data you can use it. This is mirrored as well in other domain adaptation methods and natural language processing. Open problems left in domain adaptation. The first is looking at other domain adaptation solutions probably from other natural language processing problems like sentiment classification. Also looking at how topic effects style. So if you are a blogger and you have a Twitter feed they're probably written on the same thing but if you are a redditor and you have a Twitter feed they're probably written on different things and so how does that affect it? Or even if you have a Reddit account and you write in different subreddits on different topics can we still identify you as well if you're not writing about the same thing? Another thing to look at would be other domains and finally is it possible for us to change how a document feels or how the actual content is to make it feel more like the other domain? So for example we had that tweet up there that had barely any words in it. What if we were able to make it look a little more like plain text and make it look like that blog? Is that changing it too much or would that work? And so that's definitely a huge open question that's not very easy to answer right now. So anonymity is really hard. Trying to make yourself anonymous even through a lot of these methods is difficult and it's really not only about what you're writing but it's also about how you write it. And so even if you're doing things like monitoring the content of what you write to make sure it can't get traced back to you or hiding your location through things like Tor we can probably still identify you through only your writing style. So while stylometry can combat online abuses it's also a huge anonymity threat. Finally we're very surprisingly good at de-anonymizing texts across many domains and not just within them. So now all is lost. What can we do about it? Our lab right now is developing a tool called Anonymouth. This piece of software helps you anonymize yourself of anonymize your text as you write it and uses J-Stylo in the background to verify that you're, to monitor that you're not the same author. This is definitely work in progress. It could use a lot of work and analysis and feedback. So if anyone's interested in playing with it or contributing to it or helping with it the git is at the bottom and you can contact us with anything else. So thank you all for listening to all three of us. Special thanks to my contributors Travis.co and SadiaFroze and we'll take any questions. Well thank you very much. And now we have about 20 minutes for questions. Yeah feel free to occupy the phones. We have phones, microphones. And well we'll start with number three. Thanks for the talk. I got a question about the cross-demain research. I was wondering if you ever tried to enrich your future sets by metadata like activity patterns or links used or something like that? So am I on? Can everyone hear me? We've done a little bit. We looked at Twitter specifically because there's just so much metadata associated with Twitter and we found that we could improve our Twitter results a little bit but in the cross domain case it doesn't particularly help. But our Twitter results are already 98.9 so any improvement isn't really an improvement. Do you have any idea why that is? So because my expectation would be that it's like it's a very good fingerprint of someone like at what time that person is writing something or how many links are in the text or something like that. Right so we don't have data, we didn't collect any data for when things were posted for the blogs. So we haven't done analysis with that. When we looked at the metadata we're talking about hashtags, tags and links. So the hashtags and the tags don't really translate over to blogs. And as far as links go I just don't think that there's enough similarity between them to get any real improvement out of it. Thank you. Number four please. Is anonymous limited to English or is it independent of natural language you choose? So I think that the current implementation is limited to English but it wouldn't take a lot of work to extend it to say German in particular because we have the analysis background and we have the analysis backend in German. So it would be just a question of adding a little couple of tweaks to the interface to do that to get extensions to further languages. What you basically need to do is augment the analysis engine to have function words in part of speech tagger for that language. Now it may be a little more difficult to use say Asian languages that require segmentation so you need a segmentation engine for that. But other than that it shouldn't be that hard. Yeah, like I said you could already use the analysis for some languages but the front end of anonymous doesn't do that currently. Is there an abstraction layer in the code yet? There is an API, yeah. Well, thanks for the talk. I think it's a fascinating subject. In the first half of the lecture you were talking about source code analysis, et cetera. I'm trying to understand since one of your results is that the best features to look at have nothing to do with the actual code indentation and stuff like that. But with the structure of the program itself why are you limited to source code analysis? I mean, you dump it into Ida Pro and you get a flow graph of the program. You can analyze, you know. Yeah, you're right. This was the first time we were trying to syntactic feature set because it wasn't tried before. And we first wanted to see that our intuition is really correct and this will get us somewhere. So right now with C++ we saw that this works very well. And as long as we have a parser to get the structure of a program or anything this will be very helpful. So we're willing to extend our work with like different languages and we have a lot of other things left to do in future work. Yeah, and we would like to get to the binary case and that is next sort of on the agenda. But we wanted to sort of confirm this and the nice thing now that we can do is we can compile these programs and then we can like directly compare the accuracy that we get from the source code to the binary so we can see what the difference is. I guess you know that, but it's a really, it's a very realistic problem like an insert and malware research in general. Thanks. Thanks. Okay, so we've said that you used code from code gem to analyze, to check how your program works and did you strip out macros because I know that people in such programming contests use quite a lot macros that they add to every their file because it makes it easier to program later and it's about 20 lines of such macros. And did you strip that out? Because if you didn't, you might even not compare if it is the code of the same outer but it is the same code actually. We looked at the macros, so we had a layout feature just for macros and also our AST parser is on a function by function basis. So most of the time macros were excluded from the structural information, so we kept that separately but we tried to find out if there are any similarities like that in code and we didn't see too many but if we investigate further just for that specific thing we might find more similarities. So that's a very good point, I'll check that. Okay and the second question is you found that for more advanced problems the accuracy of checking is much higher could be it artifact because of that there were less solutions for more advanced problems and because of that less authors. I mean there were more authors writing solutions for easier problems and less authors writing solutions for harder programs. And because there are less authors the accuracy should rise. The dataset sizes are always kept the same so that we can compare the results. So we grouped it into hard problems and easy problems and maintained the same size of. Yes and it was a complete random selection from like hundreds or thousands of users so to make sure that it represents the real world scenario. And thank you for the real talk, it's interesting. Thanks. Number two please. Hi there, there was a saying lost in translation. Have you tried taking a passage and passing it through Google Translate or some other translating program a few times and seeing if it's recognizable when it comes back as being from the original author perhaps with some corrections to spelling and grammar? Is it still? Yeah a few years ago I had a project on that where we would like take the writing translated to German, translated to Japanese, translated to back and we would do that with several different translators such as Google Bing and Language Reaver and a few others. And we saw that in most of the cases depending on the quality of the translator and on that particular language we were able to identify those people with very good accuracy. But again the quality of the translator on a particular language has a very big effect here. We were able to observe that. I mean the longer the path that you translate through like if you go through 12 different intermediate languages you're going to almost be unrecognizable at the end. Now if someone was trying to subvert a system like this could they just do that? End up with the final product after 10, 20 translations and then just make simple spelling correct, not spelling correct but simple grammatical phrasing corrections. Like what sort of length did you test this on how many translations did you run the passage through before bringing it back to the original language? So ours in total was three. It was German, Japanese and back to English or maybe two in the middle. But a recent paper showed that they did many translations I think up to 20 and the more you did translations the more unidentifiable the author became. But at the same time the text lost its semantics. Like there was not much context or meaning left in the text. One thing that we've experimented with in the Inautomath program is actually what we would do on a sentence by sentence basis. Translate it to a whole bunch of different languages and back just one way. But then rank the translations that were produced by the ones that had the most anonymity to the least anonymity and like put the ones at the top and then the person can look at them and find ones that have more anonymity that still are close to the meaning and bring those back in. So that was one thing we experimented with. Okay, thank you. Thanks, number four. I was wondering if your work is one way. By that I mean is how far away are you from producing a quote unquote genuine letter from Angela Merkel or a long lost play from Shakespeare with all the information you have? So text generation is much, much harder than text analysis. It's sort of, I would argue like an NLP complete problem. So I don't think we're very, what we would be able to do is probably help somebody create a letter that was imitating that style. Like you could have it be sort of a collaboration between the analysis engine and a person and that would probably work quite well. But to do it automatically would be much harder. So it could be used to aid impersonation as well. Yes. Thank you. I have two questions actually. My first one is will there be something like an animal for force code actually? Yeah, it's actually available on Git but I didn't do the licensing yet. But if you wanna play with it, you can play with it and I'll fix the documentation and the licensing information as soon as possible. What she means is her analysis code. We have not written an anonymizer for source code. Yeah, we don't have an anonymizer for source code which can be called an upskater maybe. Yeah. You could edit anonymous to probably work okay if you're trying to anonymize source code to some extent. Yeah, but the suggestions would be bad. Yes. Okay. And the second one is when you try to compare different codings, did you also try to compare between different languages or did you just always compare the source codes of the same language? We always looked at C++ in this case because our AST parser was for C++ and C. Yeah, sure. But you think it's possible to find out also similarities between different source codes from different languages? Yeah, since each programming language has a structure and its own grammar, this should be possible as long as you have the parser. So it can be extended to other languages in the same manner. Okay. Yeah. But if the question in case might be tricky, we'd have to do some experiments. I actually wanted to ask the question and my predecessor just wants to say thank you for a great presentation. Thanks. Okay, I have several questions from IRC. Okay. If you might do us all. Okay, the first one was about the case versus Jacobson. We had this comparison between this free code and the copyrighted code and they wanted to know how you got to the source code of the copyrighted code or if it was open source code or which license it was under. No, we didn't compare it because it's copyright code so we didn't try to access it because it's not publicly available. Thank you. That was just an example. Number four, please. Okay, first of all, sorry. First of all, thanks for a very interesting talk and also thanks for doing work on this a non-emouth solution because it would be more concerning if it was only being applied to reduce people's privacy which in some countries can end quite badly. My question is if people use this tool to make their language less identifiable can they then be identified as having used that tool with high confidence? Does it leave a signature if you use a non-emouth and what's the size of the set of people that use it because you're only as anonymous as the number of people using that tool if it's identifiable? So I don't know how many people use a non-emouth probably not a huge amount because if you actually try using it it's kind of difficult but the and I don't know if using a non-emouth itself would create a signature my guess is it probably would given the way that people tend to what the experiment that we did do was looking at the people that we just told to imitate someone else's style or just to try and hide their style without a non-emouth and we were able to create a classifier that was able to distinguish people that had tried to do that from people who hadn't without necessarily being able to identify the original author. That just seems to me that if the stakes were high the amount of safety that you'd get from using this it would be difficult to kind of calculate the function of when it's safer to start using this obfuscator this tool versus just saying less and it'd be nice to be able to have more analysis so that people can make that decision on an informed basis. I agree that it would be nice. Okay, thank you. Number one, please. Hi, yes, many development houses and coding houses use style guides and they're pretty strict about it and like and you'll run things like RubyCop and things like that that'll say remove spaces and use single quotes and double quotes. Have you taken that into account? First of all, we thought that people have to implement the functionality in a limited time so they use the things that they would naturally use or they would express their style because they're limited in time. On the other hand, if you think that their style has to follow a certain format, that would make everyone more similar and in that scenario, the machine learning problem will become even more difficult if they're following a certain style guide. But there is no way for us to tell that because we don't have ground truth information from this contestants about how they were implementing the functionality at the time of the competition. What we can say is it really depends on the style guide because we know the features that we use. Like in the obfuscation case, if the style guide really only talks about spacing and layout and variable names and stuff, it doesn't affect the deeper structure of the code, like the variable depths and things like that, then it wouldn't really be relevant, but if it does affect that, then it would be. So it probably depends on the specific style guide itself. But we don't have any data to suggest that. Right, it's just like in development houses, you usually do a pull request and someone criticizes all your code and you have to change it to make it look like everyone else's and so I was wondering if you could pick out out of the 3,000 developers here, would you want to actually wrote that code and that sort of thing? Okay. Hello, hello, yeah. I think we are going to take the last three questions and then wrap it up. Okay, so number two, please. And when we're done, we'll go to the cafeteria area and sit down in a chair in a table there and if people want to ask more questions, they can. Okay. Okay, so the next question from IRC is what about multiple authors like in open source projects? What happens to the detection of the author in such a case? Okay, so we haven't done anything with source code on this yet because that's I think a difficult problem that we just haven't looked at. We're currently though looking at different wikias that are written by multiple authors and these are similar problems and getting preliminary results. So keep looking forward to it and we'll have something there, I guess. Do you have anything? Well, I landed a preliminary result with Git but those weren't very good because I wasn't using the abstracts industry, so. Okay, number two. My question is quite similar. Is it possible to detect if a text is written by one person or by more persons? Well, I think that's definitely part of that problem. That may be a first step to completing that problem. Yeah, it's something we're actively working on but we don't have any results yet. Does cross domain actually also work across languages? For example, if I'm on one mailing list in German and on one forum in English, would you be able to match these accounts by styles that are independent of the language I'm using for posting? You can use the language independent features or you can try translating the code and then, sorry, not the code, the writing, and then do an auto-specribution with an English feature set and look at whichever one works better. Yeah, I think translating, actually probably the best way to do that would be to translate both of them and then do both analysis and in both of the individual languages and see what the results are. That's how I would go about it because the translation, because it's hard to do like the n-grubs and stuff will be different for the different languages so you'd probably want to translate them. Okay, I think that's that. Thank you so much for coming and we hope that you'll be back next year. Thank you. Zero plus.