 Everyone, thanks for coming and it's so good to see familiar faces again at this time of the year. And today I'm going to do our annual traditional stylometry talk and I'm from Princeton University. My name is Eileen as he introduced me and I'm currently a postdoctoral research associate. And we have been presenting at CCC for a few years now and they have been mostly about stylometry but dear Rachel, on the first day she gave the alternative keynote and it was non-stylometric this time and I'm going to keep the tradition alive and talk about stylometry and machine learning today. So what happened since last year? Last year I talked about the anonymizing programmers for about like 15 minutes. This year the anonymization just became easier and that's kind of equivalent to the fact that there are more privacy concerns for programmers now and also open source software developers. And today we're going to talk about stylometry and machine learning but at the same time we released our most current paper on this. It's on archive and on my website and if you want to read a summary of the talk or the paper you can also check our blog, freedom to thinker. Let's start talking about stylistic fingerprints. By now if you're not familiar with stylometry I'll give you a brief introduction. Stylometry is the study of individual style. Most of the time it has been researched in writing style but we can see stylometry in fine arts. For example, artists can be identified by their brush strokes and in music musicians can be identified by the tones of rhythms that they're using and three years ago we presented that stylometry is also present in unconventional text. And by unconventional text I mean underground forums where sometimes cyber criminals or a variety of people engage in. We can identify them as well. And we have looked at translated text to see if translations can anonymize you and we saw that even when you take your English writing translated to German then to Japanese then back to English we can still identify you. And that sounds like a serious concern for someone who would like to remain anonymous. And now we started investigating source code because if you're going to investigate style in language we can think of source code as another type of language. It's a programming language. And today I will give you the improvements in our source code authorship attribution method. And at the end of this talk we will see that style that's expressed in code can be quantified and characterized. And that's kind of the answer to our research question. So what happens with supervised stylometry? I say supervised stylometry because I'm going to talk about machine learning today. We can identify style in some type of personal data or writing by using machine learning methods. And I will give you a very common setting. Let's say that you have a set of documents with unknown authors and some with known authors. And you would like to find out who these anonymous documents belong to. So what you do is you take a machine learning classifier and then you train it based on the documents whose authorship is known. And then you create a model for everyone, for the people with known authorship documents. And after that you can use your machine learning classifier to test and see who this anonymous document was written by. And let's think about the common scenario for this. There's Alice, the anonymous blogger, and Bob, the abusive employer. So Alice is blogging about abuses in Bob's company. And Bob, as he's abusive, he's going to go ahead and collect everyone in his company's writing. And then he's going to train a classifier so that he can identify who this anonymous blogger Alice is. And Bob can do this by using stylometry and machine learning. I will give some other motivating or scary examples. For example, there was this case with a person called the Connor, or her or his username was the Connor on Twitter. And she tweeted that Cisco just offered me a job. Now I have to weigh the utility of a pet-to-pay check against the daily commute to San Jose and hating the work. And then Timilavad, who is the channel partner advocate for Cisco, alert, he saw this. And that wasn't very good because you can then identify who the Connor is. And her job offer might be in danger at that moment. So what if Cisco takes all the cover letters that were submitted to Cisco and after that train a classifier and try to find who the Connor is by looking at her tweets? Because you can also identify people from their tweets. But that wasn't necessary in this case because since you can find some cached information online, she was identified as Connor Riley. And unfortunately, she lost the job offer after this. So this is one example where this might have been applied. So you need to understand when we are talking about machine learning methods that make it possible to de-anonymize people, there might be some dangers associated with this. Or you might want to be more aware of how you're sharing your information online, keeping in mind that you can always be re-identified. And what happens with source code? For example, there was this recent tweet and it says, I just heard from an internet, Apple, that they disallow her from contributing to open source on her own time. That's illegal, right? It's probably illegal, but Apple can probably find out if someone is contributing to open source code repositories by looking at the code that they have at Apple and then just compare any suspicious code to maybe re-identify who this contributor is. And because of that, we're going to talk about de-anonymizing programmers with code stylometry today. This has been joint work with my great collaborators and some of them are here with us today. And why do we want to do source code stylometry? Like, how can we first start doing this? First of all, we know that as any language, source code as a programming language is learned on an individual basis. And as a result, you develop a unique coding style and that can potentially make you identifiable. So we want to investigate if that's really possible. Do we leave any fingerprints in source code that might make us identifiable? And why else would we do this? Like, maybe we want to gain some software engineering insights. For example, we want to analyze how coding style changes over years or the differences between coding styles of more advanced and less advanced programmers or does your coding style change when you're trying to implement more sophisticated functionality? And the main goal or like the most motivating goal here would be to identify malicious programmers who are maybe trying to contribute malicious code or like backdoors to open source software. And let's think about a common scenario. Alice is analyzing a library and there's malicious code in the library and Bob has a source code collection and he knows the authors. So Bob is going to search his collection with machine learning to find out who Alice's adversary is. And a second scenario about plagiarism. So you're a college student and you get an extension to your programming assignment and Bob, your professor, wants to know if you have plagiarized or not. So he's going to train a classifier on all the submissions by the other students and then he can check if there are extreme similarities between coding styles and similarity that doesn't really belong to your former coding style. Unfortunately, these two examples were kind of security infringing, sorry, security enhancing examples but source code stylometry could also be very privacy infringing. So you have to be very careful when you want to use that. For example, Said Malikpur, maybe some of you remember him from last year's talk. He was sentenced to that because he was identified as the website programmer of a porn site. And unfortunately, the Iranian government found about this and he was sentenced to that but he managed to get out of this entire thing because he was also a resident of another country. But this is a dangerous case where in oppressive regimes your code might put you in a dangerous situation. So I'll start talking about a little more technical stuff and show you how our work improves the state of the art and bring some novel contributions. First of all, this is a list of comparison to related work and there is not much related work as you can see. And the only difference in the previous features used to represent coding style and ours is syntactic features. So we use structural features to represent your coding style. We don't just use your function names or variable names or spaces and tabs that you use. And as a classifier we use a random force and I'm going to show you why these are making very big differences in the results. So in the past, the highest accuracy that has been reached to de-anonymize programmers was 97%. And this is to de-anonymize 30 programmers correctly. But we can de-anonymize 250 programmers with 98% accuracy. So we are beating the highest accuracy in the past and we have a much more difficult machine learning problem because we have a much larger data set of 250 programmers. The largest data set that has been used in the past is 46 programmers and they get 75% accuracy. But after last year's talk we were able to scale our approach to 1,600 programmers and we get 94% accuracy correctly identifying 14,400 source code samples of these 1,600 programmers. So this is large-scale authorship attribution. This is large-scale de-anonymization. How do we do this? So we have our general machine learning setup. First of all, we need data with ground truth. So we go to Google Code Jam. It's an international annual programming competition. And we collected a data set and it was in C++ because that was the most commonly used language in this competition. And we had about 100,000 users from different years. We have our data set. So now what we have to do is we need to find the features and properties that are going to represent the coding styles of these people. So we pre-process the data set, the source code, and we get the abstract syntax tree of the source code with using a fuzzy st parser. And after that we extract the features that represent coding style. And then we feed these properties that are going to represent individual style in a random forest, a machine learning classifier. And then we do our classification with majority voting of these 300 random forest trees. Why did we use Code Jam? So we have data from 2008 to 2014. Now we actually have 2015's data as well. And the most important point was that everyone in this competition is implementing the solution to the same programming tasks. So they're implementing the same algorithmic functionality. And the only thing or the most prevalent thing that can differentiate these source code samples are the coding styles of these programmers. And at the same time they have to implement this functionality in a very limited time, which means that they don't have a chance to go back to the code, improve it, make it nicer, and then copy paste some stuff from Stack Overflow. And as contestants are able to complete rounds, the problems get harder. So we kind of have a control of when someone is implementing a more sophisticated functionality. And as a result, we can infer which programmers are maybe more advanced. And as I said, C++ was the most common language, so we decided to go ahead with C++ in our experiments. How do we do this? How do we represent personal coding style? First of all, we have the source code sample. Here we see this is just five lines of code. And from that, we can look at lexical features. For example, integer n, n is a lexical feature because you chose it as n. And then bar and full function names, you chose those. So those are lexical features that come from personal input. And then we have layout features like the spaces and tabs and where you put the curly brackets and things like that. But the important thing that was able to represent coding style in a very strong manner in our experiments was that we used structural features. And we can do that by converting the source code sample to an abstract syntax tree. And then you get the grammar and structure of the source code. And here you can extract a rich set of features. And these are also more difficult to change as opposed to just changing full and n because these are kind of embedded. So we extract features from the source code. We extract features such as edges, nodes, term frequency, inverse document frequency, or the average depth of a statement node, for example. And then we built our feature set to represent programming style. And why did we use a random force? First of all, random forests by nature are multi-class classifiers. For example, as opposed to a common support vector machine, which is a two-class classifier, random forest is more successful in classifying many classes as opposed to support vector machine classifying just two classes, making a binary classification problem. And also random forests, since they use decision trees and information gain during the training process, they avoid overfitting. So we want to make sure that we are not overfitting to a bias in the data set or to someone's very peculiar property. And what we do is we get our data set, we extract all the features, and we do k-fold cross-validation, which means that, for example, for each programmer, we have nine source code samples. We train on eight source code samples from each programmer. And then we try to test the ninth one from all of them and see who it was written by. And then we validate our method on a different data set to see if the features that we obtained are really making sense. Let's talk about the general cases. And here, I will talk about how we were able to improve the method. For example, there is this general case. Who is this anonymous programmer? This is programmer authorship attribution, programmer de-anonymization. And maybe this can be applied to Satoshi Nakamoto, who was the founder of Bitcoin. And we don't really know who is. What happens here? So we have 1,600 programmers. And each have nine code samples. And we do nine-fold cross-validation. And we extract features from their code samples. And once we train our classifier and test on the 14,400 samples, we get 94% accuracy. How can we do this with Satoshi, though? So if we had a suspect set for Satoshi, what we would do is we take the suspect set's previous code samples. We train a classifier on that. And after that, as test data, we take Bitcoin's initial Git commit, the first original Bitcoin code. And then we try to see who this was written by in the suspect set. And many people ask us, so who is Satoshi? And the thing is, we have a suspect set. But unfortunately, the main suspect in our set doesn't have any former code samples. So we are just leaving this slide like this. What happens if someone tries to obfuscate code? So why do people obfuscate code? They would do that to make their code become unrecognizable. Maybe they're plagiarizing. Maybe it's malicious code. Or maybe they're trying to be anonymous. But we are going to show that this is not going to make them anonymous. So our authorship ad revision technique is impervious to off-the-shelf source code obfuscators. And this is one example. This is a commercial off-the-shelf obfuscator called Stunix. And what it does is, like, it's available. You can go online, buy it. And it works for many languages. What it does is, if you look here, it will take all the lexical features, like the function names, variable names, and all the commands, and also all the spaces, and refactor them. But it's not going to make any difference in the structure of the program. So all the spaces are ripped. Everything is refactored. Characters are refactored with hexadecimal ASCII representations. The same goes true for the numbers that are here. But since the structure of the program remains unchanged, we still get the same accuracy in de-anonymizing programmers once we obfuscate them with such a common off-the-shelf obfuscator, which is not changing the structure of the program. And the example here is, like, we use 20 C++ programmers obfuscated their code. We are able to de-anonymize them with 99% accuracy with their original code. And we are able to de-anonymize them with 99% accuracy from their obfuscated code, because we have the structural features, which are very powerful. What happens if we try to use a more sophisticated obfuscator? Here, our example would be Tigress. It's a visualizer. And it enables you to apply various kinds of obfuscation methods. So we have code. It's like 14 lines. And we obfuscate it. It becomes like 800 lines. It's completely unreadable. So I don't think, like, if you're an open-source software developer, I'm not sure the people in your project would be happy if you contribute code like this. But this works better at anonymizing your coding style. So we took C programmers for this experiment. And then, again, we have 20 programmers. We are able to de-anonymize them with 96% accuracy. But when we obfuscated it with Tigress, changing the structure of the program as well, but the functionality remains the same, we get 67% accuracy. So there's almost a 30% drop in accuracy. But compared to the random chance, which is 5%, that's like the 5% chance that you can correctly identify someone just randomly, 67% is a very high number, which shows that your code is certainly not completely anonymized once you apply this obfuscator. So this kind of gives the answer that obfuscation is not the solution to anonymization in source code. What happens to coding style throughout here? So we want to see if coding style is consistent, because if that's the case, we can take someone's code from 10 years ago and then try to test on code from this year. And for this, we took 25 authors from 2012, trained a classifier on them, and then our test data came from 2014. And we were able to correctly identify these with 96% accuracy. And if we were trying to do this within 2014, the accuracy was 98%. So there's a 2% change in the accuracy, which shows that coding style is somehow still persistent throughout years. And we wanted to generalize our approach, and we wanted to do this very quickly, because we would like to see how feasible this is, and is this a general approach that can be applied to other programming languages, and how easy would it be for someone else that wants to use our code just for a different programming language. And for this, we only use structural features, and we use the AST generator that comes with Python to generate the ASTs of Python's source code. And we were able to de-anonymize 229 programmers just from their abstract syntax tree features and structural features with 54% accuracy. And if we do top five relaxed classification, which means that you do classification, your classifier would return your probability, saying that, OK, this is the most probable person, and this is the second most probable person. And if you look at the first five probabilities, if the correct programmer is within that set, then that's considered correct. And in that case, we are able to increase the accuracy to 76%. And if we do this for 23 programmers, we get 88% accuracy. And with top five relaxed classification, we get close to 100% accuracy. And wherever you do relaxed classification, let's say that you have a huge data set, and maybe you're willing to do some manual analysis, but first you would like to start with reducing your suspect set size. So you can just relax classification, and then maybe look at the top 10 manually to understand it better. And in our results, we see that we are bringing a new principled method with a robust syntactic feature set for de-anonymizing programmers. And this shows that there is a serious concern for anonymity when you're trying to be an open source software developer, or just a programmer, because we will soon talk about executable binaries, where there is no source code. And for future work, we are planning to look at multiple authorship detection. For example, git repositories. Can we find the multiple authors? And can we identify which part was exactly written by whom? And we would also like to look into anonymizing source code, because we saw that obfuscation is not the answer to that. And then what about stylometry and executable binaries? So executable binaries are compiled code. When you compile code, the coding style feature still persists to the compiled version. So this is what happens. We have source code. It's like 20 lines. I don't have all of it here. And then once you compile it, you get binaries, zeros, and ones, like thousands of them. And I don't think I can personally understand anything from this. I don't think we can de-anonymize it by just looking at it like this. So now I'm going to talk about the second part of this talk. And what happens when you compile code and you try to de-anonymize programmers from their executable binaries? And this is the paper that just went public today. If you want, you can look at it on my website. Why would we want to do that? First of all, the research question. Does coding style exist in binary code? And is there a threat to privacy and anonymity? Can we de-anonymize programmers from compiled code? And maybe at the end, can we use this for malware family classification? This is the approach they had in related work. So since I've shown you the machine learning workflow that we had, I think you're kind of getting the idea how machine learning would work. You have your data set. You need to extract features that are going to represent a class. Once you do that, you feed it to the classifier and then test to see which class one sample belongs to. And in related work, they took executable binaries and then they disassembled them with reverse engineering methods. And they obtained the control flow graphs as well. So they extracted features from the assembly instructions and also the control flow graph. They used some information gain methods to find the most prevalent stylistic features and then they used the support vector machine. And I would like to remind you that we don't use support vector machines that much for multi-class classification problems. That's why we use random force. And then they de-anonymize a programmer. And what we do is we take our data set and we use the same data set with them so that we can make a real comparison. We disassemble it with reverse engineering. We also decompile it. And the decompilation, we can get the source code representation of the binary. And again, we can apply all the source code feature extraction methods such as the abstract syntax regeneration to this. We also get the control flow graphs and then we run information gain on these to see what feature belongs to an author instead of just being a random property of the code. And then we do our classification to de-anonymize the programmer. And some features for these would be, for example, we have our common ASD features coming from the abstract syntax tree that's obtained from generating the structural properties. And these are things like just node unigrams in the abstract syntax tree, ASD bi-grams or edges. And we do similar things for the control flow graphs and we get the unigrams, bi-grams. But remember that like since this is decompiled code, the abstract syntax tree and control flow graph is like 10, 20 times longer than the original one. So we get a lot of features and they look very similar to each other because they have been reversed engineered with the same tools. And for example, when we have 100 programmers, we extract their features from their 900 binary executable samples and we get about 200,000 features once we get all of our features. Once we run information gain on this, we see that only 426 of them in this particular dataset represent coding style. So we focus on these 426 features. And what happens when we try to de-anonymize 100 programmers? So we wanted to see how much training data we need. First of all, like how many binary samples do I need to accurately de-anonymize a programmer? And with one binary sample and 100 programmers, you can still re-identify them with 20% accuracy. And once we used eight training samples, we got to 78% accuracy in de-anonymizing 100 programmers. And it seems like if we had more source, more binary samples, you would be able to increase our accuracy further. But our dataset wasn't really letting us do this with 100 programmers. And relaxed classification, which I mentioned towards the end of the first part. With 100 programmers, when we relaxed the classification to a set of size 10, we get 95% accuracy in reducing our suspect set size. So let's say that you start with 100 programmers and they're binaries and you want to focus on 10 of them. With 95% accuracy, the correct programmer would be within that set of 10 people. And we also wanted to see what happens with a smaller dataset size. So here, when we just, and relaxed classification, we can get close to 100% accuracy after just relaxing it to a suspect set size of four. And we are getting certainly 100% accuracy at a suspect set size eight. And we also wanted to see what happens when we just use one training sample for 20 programmers. So with 20 programmers, if you just have one sample from 20 programmers, you train on 20 files to generate 20 classes. And then once samples are given to you, you can correctly identify those samples with 75% accuracy. And that's kind of scary. So if you have just one binary out there that belongs to you, and that's none. And if you're in a suspect set size of 20, there is a 75% chance that your anonymous binary would be identified as it belongs to you. And we wanted to scale this up. So we went from 100 programmers to 600 programmers. And in this case, we see that the accuracy is gradually decreasing. And with 600 programmers, we get 52% accuracy. And here I would like to mention that, for example, in the previous part, we had 1,600 programmers. But since we are using the same data set, we had to compile the source code so that we can use it in a controlled setting with the same compilation options. And we couldn't obtain 1,600 binaries after that compilation. Some code just didn't compile. And some programmers didn't have enough code because they were missing. So we had to get rid of all of those programmers. So there hasn't been much work done on this area. There is one major paper that has been published by Rosenblum. And it's a great paper. It's the previous workflow that I have shown you in the beginning of the second section. And for example, with 20 programmers, they get 77% accuracy. And they use more training samples than us. It's not very clear. But the least number of training samples they use is 8, and it goes up to 16. And we kind of know that when we use more samples, we are going to get higher accuracy. But in our case, with 100 programmers, we get 78% accuracy. And for example, when we look at their 100 programmer data set, we see that they get 61% accuracy and we can get 78% accuracy. And again, they are using more training samples. And at the end, we are able to scale our approach to 600 programmers. But their largest data set is almost 200 programmers. And their 200 programmer data set gets the same accuracy with our 600 programmer data set, which was a more difficult machine learning problem. So this is a great improvement in accuracy. What happens if we optimize code? Is that kind of like the translation of obfuscation in binaries? Like, are they going to anonymize code? So for the first time in the literature, we wanted to try compiler optimization and stripping the symbols. And we saw that with 100 programmers, without any optimizations, simply compiling code, we can de-anonymize 100 programmers with 78% accuracy. But after compilation, once we strip the symbols from the binaries, we get 66% accuracy. And once we start applying more optimizations, like the common level one optimization, level two optimization, which is cumulative. It takes the previous optimizations in it as well. And it makes program more efficient and maybe faster and smaller. We see that the accuracy is not decreasing in a tragic way. With the highest level of optimizations that we tried, level three, we get 60% accuracy incorrectly de-anonymizing these 100 programmers. So compiler optimization is not our solution to anonymizing binaries. We also wanted to see how we can find out features that are remaining in binaries that represent your coding style. Because since binaries are so cryptic, it's difficult to tell what's going on, what kind of transformations are happening after compilation. So for this, we came up with a machine learning setting where we have the same code samples and we have the numeric representations of these code samples for the original code and the compiled code. And what we tried to do was, by taking the compiled code, can we predict the features in the original code that has not been compiled at all? And once we did that, we generated the predictions for the new features in the original code and we wanted to see how similar this is. And there is not a very simple way to make a direct comparison between these predictions. So we looked at cosine distance similarity and we saw that the new predictions were 80%, 81% similar to the original code features. We chose that and also we did one more experiment. So we just took the original code features and we took the compiled code features and we looked at the similarity between those two and the cosine distance was 0.35. So it was about 35% similarity, which is much less. And this is kind of showing that like coding style and properties in compiled code are certainly getting transformed. But this transformation is not wiped in a way all coding style features. So somehow they still remain embedded in the binary. And this might be a concern for de-anonymization and remaining anonymous. We also wanted to see if we can gain any insights. And again, we looked at the differences between the binaries of more advanced programmers that were able to advance to more difficult rounds. And we saw that even in the binaries, you can see when a programmer is more advanced as opposed to other programmers with a smaller skill set. And for doing this, in the dataset, we generated two subsets. The first one was with people that were only able to complete the same seven problems. And the second one was people who were able to complete 14 problems, including the seven in the first subset. And we just used the seven samples that were same between these subsets and see how well we can de-anonymize these programmers. And for the more advanced programmers, we got 88% accuracy in correctly de-anonymizing their binaries. This is a 20-programmer dataset. And for the less advanced programmers, we got 80% accuracy. So somehow there is more coding style present in the source code of the more advanced programmers, and it's getting transformed better after being compiled. And to validate this, we tried the same setting with the six-problem setting, a subset that was only able to complete six problems and a subset that was only able to complete, that was able to complete 12 problems, including the six in the first subset. And again, we see the same result. So the accuracy is lower because our training samples are less. So our machine learning model might be less accurate than as opposed to using more samples. So we get 87% accuracy in correctly identifying those, whereas we get 78% accuracy in de-anonymizing the less advanced programmers. So we have been working on Google Code Jam, which is a very controlled environment for running these experiments for the reasons I explained to you. And we get the question from many people asking that, is this de-anonymization being so successful because of your Google Code Jam data set? And we wanted to see what would be the difference if we tried to do de-anonymization in the wild, and we tried to collect a data set from GitHub. So for this, we parsed GitHub and we found single authored repositories. And these repositories had at least 500 lines of code, and they had to have at least 10 stars, and these people should have several number of repositories on GitHub. So we had some requirements. And after that, we ended up with 49 programmers after all of our restrictions. You can refer to our paper for the details of these data sets. And we had 117 repositories, but unfortunately, again, GitHub code is sometimes very difficult to compile. So after compiling these, we ended up with 12 authors and 50 binaries. And on this data set, we are able to de-anonymize the programmers with 62% accuracy. And we tried to generate the exact same data set from Google Code Jam, use the exact same number of binaries for each programmer, and we were getting 68% accuracy. So it shows that this is a very promising result for running programmer de-anonymization in the wild. And for future work, again, we would like to look at anonymizing executable binaries because here, we are showing you a problem, like there is a privacy problem here. We are able to de-anonymize programmers with very high accuracies in a very simple machine learning setting. And we can do this on the large scale. And for the executable binaries, we show that optimizations are not the result, the solution to anonymization. And we would also like to look at de-anonymizing collaborative binaries that have been written by multiple people. And we would really like to find out if we can extend this to malware family classification, but for that problem, we need a data set with some ground truth. So if anyone in the audience is interested in that and has some ground truth data, a data set that we can work with, at the end of the talk, please come and talk to me. That might be amazing for us. And we have some available tools on these projects. So you can find all of them online. You can send us emails, like if you want to run things on different settings. We have announced these in our previous talks as well. So we start with the main one, the programmer. So the source code, programmer de-anonymization one, is on my GitHub account. So you can just like Google for it and like find it and run it. And we also have an authorship attribution framework where you can have some authors with known documents documents with known authors. And then you might try to identify an anonymous document and it will give you many machine learning options to do that and different ways to generate different features and so on, so that you get a hands-on experience in like running this machine learning setting and see how anonymization works. And on top of J-Styler, we have Anonymot that's built. So Anonymot uses J-Styler as the back engine and it's a framework that will help you anonymize your writing style. So once you give it a suspect set and you're also in the suspect set and you wanna make sure that in that suspect set you're anonymized, you can use Anonymot so that it will identify all the authors in the suspect set and then it will give you certain recommendations and suggestions so that you can make your writing more anonymous. And this might be a very helpful tool for people, for example in oppressed regimes or like who really wanna make sure that they would like to write anonymously so that they don't wanna get in trouble. Because even for example, let's say that you are, as with the like Alice and the abusive employer, Bob example, even if you are writing or blogging through Tor and you think that you're anonymous, your writing style would make you identifiable. And I would like to thank all my collaborators for this great work. Without them it wouldn't have been possible. And if you like, I have some backup slides but if you think that you have a lot of questions we can just go to Q&A now. And thanks for coming. It's very exciting. Thank you very much. We have about 20 minutes for Q&A so there's plenty of room for questions. The first two questions go to the internet and the people will feel that they have to leave. Please do so quietly. Okay, the first question is whether your technique also works on shell commands so that I can see who actually did something in my system. Can you repeat the question? If it also works on what? Can you speak a little louder? Oh, I'm sorry. Does it also work on shell commands? So if you type or look at a session of someone logged in on a computer to analyze who is it? So I'm not sure I understand the question. Are you asking about sections? Now if you have a lot of unique command lines that you see that someone has entered somewhere can you also try to find who is the author of these command lines? Oh, just command lines. So as long as you find the correct features for that I believe you can do it but this doesn't, our research, this current research does not exactly apply to that but since we can do this on various kinds of textual data and we have shown that we can do it on many other things that have other than mentioned in this presentation I believe it might be possible to do that. Okay, next question from the internet. Do you also look at comments like style of writing comments or position of comments? So for these datasets to make sure that there was no personally identifiable or identifiable information left that would just bias the classifier we removed the comments so we don't have comments in this but we also looked at it with comments and it usually makes it just like the de-anonymization accuracy just increases. Before moving to questions from the room please do remember that questions are short sentences and then we have a question mark. First question over here. Hi, what do you think that chances are of having something like a compiler switch to automatically anonymize code by doing structural refactoring switch then like your anonymous thing that you mentioned about? That's a very good point and exactly it's kind of a similar thing to do with anonymous because you're just trying to like convert the features that make you anonymizable and that might be a good experiment to try in the future so I will keep that in mind when we're like starting our anonymization work. Thanks. Next question over there. Yes, thank you for the talk in the beginning. Let's say I am writing code for I'm in Iran and I write code for a porn website and I want to obviously complete the anonymization is not possible, anonymization. Can I produce, can I maximize my chances to not be identified and executed? So first of all with the site Malik, for example, he was identified because his name was on the code so that was like a direct identification but as you suggest, if you are very careful about being anonymous maybe you can try to follow very strict conventions and make sure that everyone in that project is also following the same conventions so that all of you look very similar and you cannot be re-identified, you cannot be like discriminated from each other in that case so that might be one solution so far or it might at least help you. Okay, next question over there. My question is I assume all of those numbers that you had where you had a set of people that you programmed into the system and you were giving them one sample of a person that you knew was in that set. Have you ever tried giving them a completely unrelated code sample? Is the software able to tell that that's someone else who is not part of the reference set and if so at what probability? So you can look at the slide and this is kind of a verification problem in machine learning and this is a one class, two class kind of classifier that might help you verify if the anonymous code sample comes from someone that is in your suspect set or like that's in the set of programmers that you trained the classifier on and for this setting what we did was so we have Mallory and Mallory claims that this code has been written by her and we wanna find out if it was really written by her so we have two classes, the one only samples from Mallory and the second one has samples from random people which represents the outside world, it's anyone but Mallory. And in this case we can take the code that Mallory is claiming to have written and see if it was really written by her like is it going to be attributed to the outside world who is not Mallory or is it going to be really attributed to Mallory? And for this case we get 91% accuracy with 80 repetitions of such a setting and if you wanna find more details about the probabilities of how we can like threshold the verification, look at our first paper, we had the details of that or we can chat later. So but you haven't tried, when you don't know who it could have been, if you say okay we have like a suspect list of 50 people, you haven't tried seeing if it would be possible to determine that it's not one of those 50 people. So with verification like when you, so instead of this two class setting let's think about a 50 class setting with 50 programmers. If a sample is being attributed to one programmer below a certain probability then you might be able to say that this looks like this person but this is catchy, maybe this is not this person because this is not a very confident classification. Does that answer? Oh I think I'll just talk to you in person. Okay. Let's move to the next question over here. What about coding style guides? Most languages have a default coding style guide and you have coding formators, you can run automatically. Would that help? So yeah that's I think the previous question was similar to that. If you follow strict conventions and if everyone does that, that would normalize, that should normalize your writing style to some degree or coding style and but we saw with the compilation case that compilation is also kind of a normalization over your coding style because it's converting everything to a set of rules and it becomes very similar and even in that case we are able to de-anonymize programmers but with lower accuracy. So that might help you be more anonymous but I'm not sure if it would be the exact solution. But no numbers how much it would change. So the problem with that experimental setting was that we don't have such a data set to test this on but if someone from the industry has a data set with like where programmers are following like strict conventions and so we should be able to get more answers to this question. Thanks. I think the internet might have a few more questions. Yes, can you use your technique to actually forge code to look like it was written by a certain programmer? So yes, this is possible with machine learning because like for example with the anonymous case that's where you anonymize your writing style not coding style. What we do is like your within a suspect set and you try to de-anonymize yourself in that suspect set and what you do is try to like bring in features that do not represent your style but if you bring in features that belong to someone else exactly and you can see what those features are then your style would be more similar to that person. So once you have this framework for source code that should be possible. Next question over here. Hey, you said there's a difference between advanced and less advanced programmers. So my question is when you have a given set of binaries can you tell which ones were written by advanced programmers? That's a very good question. We haven't tried that but I think you can come up with an experimental setting to test that and we haven't done that yet. Okay, thank you. Hi, two questions. First question, have you considered using skip grams where like skip n grams where n is greater than three then you get more support. Second question, how about unsupervised learning? Yes, I'll answer both of them. For the skip gram or like multi n grams, you can do that. We tried it. It just takes a very long time to extract those features and then to run a classifier like that and we saw that when you get more n grams or skip grams or like a variety of powerful features it doesn't really help especially the source code because it's already so high. We also would like to avoid overfitting to the data set by generating very detailed features that might bias the classifier. At the same time, we are lucky that we have Amazon research grant where we can run our experiments on EC2 so it's making things much faster but other than that I wouldn't always want to extract like thousands and maybe millions of features if you're gonna go into such details. And for the second question, yes, you can do unsupervised learning. You can just try to cluster these based on certain properties found in code and these properties could also be coding style properties and that's possible but we haven't gone through that setting. Thanks. Yep. Hello. So have you thought about researching stuff of using metadata of code to actually classify it so like Git commits, commit messages or workflow or even set of used libraries or something like that and of course how to protect from that classification. So in the beginning I started this project by using exactly a data set like you described like all the Git commits messages and like it's possible, yes, you can de-anonymize those programs and it probably makes it more powerful but then with GitHub it kind of becomes a multi-author problem so it relates to our future part where we would like to look at multi-author source code, authorship attribution and the same for binaries and we are working on that currently. Thank you. Hi. Using your approach, do you think it's possible to abstract these models even more so that it becomes possible to train on data sets that are constructed from one programming language and then do classification on another programming language or with text or samples from another programming language and did you do experiments in that direction? Thank you. That's a very good question. We haven't done experiments on that again like it's difficult to find ground truth data for this like even with Google Code Jam some programmers write in multiple languages but that's a very small set so we haven't looked at that yet but like when we for example compare C and C++ yes, you can do cross-training and testing because like the nature of the two languages are very similar when you are coding. Next question goes to the internet. Yes, yes. Would that also work on assembly because you do not really have something like variable names? So with assembly, I will go back to the slide where we, oops, I just closed it just a second. We have a bunch of important features that come from assembly and let's go. And it's very difficult to understand what they actually mean because oh, it's close. Oh, I see. Okay, here we go. So with assembly features here, we get assembly features from two different disassemblers to make it stronger and richer and we don't get that many but we have like close to like 100 and something and it's very difficult to tell like what those exactly mean because like some of them are just unigrams and it's very, it looks very overfitting but since we show in our reconstruction experiments that somehow preserves coding style in it somewhere as well. Does that answer the question? Yeah, online. It's not online. Over there. Hi, do you think it would possible to combine your research with for example social graph analysts to find corporations between programs or groups? Yes, you can do that. You can just add that as an extra machine learning feature and just like improve your classifier but in this research, we are particularly interested in finding coding style and quantifying coding style so we wanted to exclude like any other irrelevant information which might make them more identifiable. Did you do any research in the direction of different compilers of the same language? So for example, GCC analyzes better than LFOM or can it help anonymize my code if I use different compilers for different binaries or projects? A mix and match might help a little. We haven't investigated that question in particular because when we look at related work, we see that compiler detection is kind of a solved problem. So as long as we know the compiler because like it can be detected, then we can come up with a setting where source code has been compiled in that setting or like try to get rid of the properties that that compiler would bring in it. But as you said, if you start mixing and matching, that might help you anonymize it a little and that might be a good way to start our anonymization experiments. Thank you. Hi, I have a question regarding the real world application you talked about. What is the statistical probability that this is just based on pure luck? I mean, you have like 70, 60 or 70% of the anonymization for 12 programmers. How high is the probability that this is just random? Yes, it's difficult to talk about statistical significance in this case because like we have a smaller data set, but at least we can kind of compare it to the Google Code Jam data set where we know there is statistical significance. That's why we think that it might be possible. And it's in our future work to like apply these to larger real world data sets where we can talk about statistical significance for sure. And that's a very good questions that we have been thinking about. Thank you. What may be the last question goes to the internet? Well, yes. So the question is what variables were the most important in the random forest analysis? Let me open that slide. We should have a slide for that. So the most important ones are bigrams, but do you mean real variables or are you talking about features? Let's not. So for source code, authorship attribution, the most important features for the random forest are word unigrams, first of all, and word unigrams are things like your function name or your integer or double choices or things that come from your source code. And we have the biograms that come from the abstract syntax tree, even though their percentage in the information gain data set was less than the word unigrams, their information gain is almost equivalent to this entire information gain set. So in this case, the AST biograms are the most important. Okay, thank you very much. Time is up. Thank you.