 I'm Eileen Kellescan and I used to be Rachel's PhD student but I just started as a professor and this is my first talk as a professor so it's very exciting for me. And I'll go over the first half of slides half set of slides and after that Rachel will continue with her new findings in our research and today we'll be talking about how we can de-anonymize programmers based on their coding style. Stylometry. Stylometry is the study of language, study of style in language and when we say language we mostly think about natural language which is for example English that we speak or our native languages but there are also artificial languages for example programming languages are artificial languages and when we say stylometry we wanted to look at all kinds of languages and on the natural language side we have been looking at English or English as a second language to identify the native language of a speaker or translated text so that we can again identify the native language or the translator that has been used as well as the author and we have been looking at underground forum texts where underground forum users engage in business transactions and so on and we're still able to identify the authors from their messages and in artificial languages we wanted to see if coding style is unique for each programmer so that it becomes a fingerprint for them and we have been focusing on Python as well as C and C++ and we looked at source code and we saw that there's very high accuracy in de-anonymizing programmers for source code so we wanted to see if we can do this with binaries as well and this work the tools that we have developed and made open source are being used by many researchers or different agencies as well such as the FEI or expert witnesses they can use the scientific information in court while testifying and European high-tech crime units are using this to for example identify suspects in different online platforms and regarding artificial languages focusing on code DARPA is interested in this project as you might imagine since they are part of the Department of Defense and they might want to know the identities of malicious actors as well as uh oops as well as expert witnesses and the U.S. Army Research Laboratory which we are collaborating with and this has been an ongoing collaboration for four years now okay why would we like to do an anonymized program first of all it's out of scientific curiosity I would like to know since we learned programming on an individual basis do we end up developing a unique coding style and this can be used for software forensics or detective plagiarism but at the same time we can use this for example verification of authorship or uh this can aid in copyright investigations but at the same time such um for example security enhancing technologies can be very privacy infringing and it can be used at the same time for surveillance and to track some programmers uh Saeed Malikpur is one example where we can see security enhancing technologies can at the same time be very privacy infringing uh Saeed Malikpur he's an Iranian citizen and he was identified as the web programmer of a porn site and when he went to Iran he was arrested and he was sentenced to death and he has been in prison for years now and he couldn't get out even though he's a Canadian resident just because he was identified as the programmer of this site that is against Iranian government's uh views okay how can we use source code stylometry uh from a machine learning perspective um I'll try to go not into too many details but try to give you the basics about machine learning so that you understand the flow of how we can do anonymous programmers and we're looking at different tests such as multi-class or two-class machine learning uh tasks where we can for example the software forensics plagiarism detection copyright investigations in two-party cases as well as authorship verification which would be a one-class to open world machine learning task and in order to do this we have the traditional machine learning workflow where first of all we need training data that is representative of what we are looking for and then with this training data we extract some features that are representative of coding properties or coding style and we feed this these features into a machine learning classifier so that we can train the classifier to learn each author's coding style from the features that we extracted and after that take the test samples and use the machine learning classifier to identify who this uh who this source code sample or binary or text sample belongs to and in this case we are using random forest because uh by nature they are multi-class classifiers and they then they don't tend to overfit and when we use this classic uh machine learning workflow we see that we get very high accuracies in de-anonymizing programmers which is solving that uh if programmers would like to remain anonymous this is a serious threat to their anonymity if they want to for example contribute to open source repositories and they would still like to remain anonymous and for example with light-scale de-anonymization of 1600 programmers each with nine source code samples that are on average 70 lines of code we get 94 percent accuracy in de-anonymizing or identifying the authors of 14400 code samples and in order to do this we need to develop a method and while we are first developing the method uh we need a controlled environment to do this and for that we chose google code gem as our development data set google code gem is an annual competition where contestants from all over the world try to solve algorithmic questions uh within a limited amount of time and as they are able to provide their solutions correct solutions they get posted online uh google post them and they uh they go to higher rounds where programs become or the uh problem becomes harder and they have to implement more sophisticated functionality so that we can control for the uh difficulty of the problem as well as how advanced the programmer is okay we have our data set and we collected uh source code samples from 1600 programmers and we pre-process uh the code samples especially to get the abstract syntax trees from source code and for that we use the fuzzy abstract syntax tree parser which is able to uh even parse uh incomplete source code and uh with the abstract syntax tree that represents the grammatical structure of code uh we start extracting features, feed it into a random forest and each tree of these 500 trees in the random forest are voting for one particular programmer as the most likely programmer to have written uh a particular disputed test sample and then we do classification and when we are talking about features uh we look at different categories of features for example uh when we look at the source code sample on the left side we see how function names or variable names uh are chosen by programmers and these are higher level features that can be more easily changed uh and they are called lexical features and spaces or the formatting is also part of these lexical features but there are also syntactic features such as the grammar of natural language in this case this is the syntax the properties of this programming language and when we get to abstract syntax tree we can see how complicated this uh structure can get and based on this we extract features such as okay from the lexical side things such as uh variable names function names uh spacing by grams or uh features like this but on the abstract syntax tree side we are looking at more structured features and we abstract things from about uh 50 different abstract syntax tree nodes such as function and statement two nodes that are connected to each other with an edge and then how that the average uh depth of uh uh node is for example and these are very identifying features and they are not as trivial to change quickly and how can we use these in real world scenarios okay we uh extract these features and let's try to see how we can represent and or replicate real world cases let's say that we would like to find out who Satoshi Nakamoto is and let's say that uh we have a suspect set of size X we take the suspect set of uh source code samples from the past so that we can uh train classifiers with this training data from the suspect set extract features and then take bitcoins initial git code as the test sample and then try to see which programmer is most likely the author of bitcoins uh first git commit and when we try to replicate this scenario take 1600 programmers from Google code gem though this is not a suspect set uh with these 1600 programmers we use nine files for each of them and then uh we get 94% accuracy in correctly identifying these and we use nine fold cross validation for this uh what happens if a programmer would like to stay anonymous and knows that coding style would give them away uh obfuscation is the first thing that comes to my mind and uh off the shelf obfuscators such as tunics and it's available online many programmers use it and when we take it to obfuscate our code here we see the original sample and we can see different spacing or formatting with uh a certain abstract syntax tree uh structure as well as different function and variable names and once it is obfuscated all the function names or lexical features are refactored uh with random representations and all the comments are replaced with hexadecimal asker representations so everything is refactored spaces are stripped and so on but we see that the de-anonymization accuracy is not affected by such obfuscations at all because when we look at how the obfuscation happened we see that refactoring uh did not change that abstract syntax tree at all and it remains unchanged so our method is impervious to such off the shelf obfuscators okay what happens when we use more sophisticated obfuscators such as tigress I take about 15 lines of code and obfuscated this function virtualizer and then I end up with about 500 lines of code it looks much more cryptic and I cannot really tell what's going on easily from a higher level and uh at the same time the abstract syntax tree most importantly changes in this case okay and this affects accuracy significantly so the accuracy of being able to identify uh C programmers was 96 percent for 20 programmers and here the random chance of correctly identifying a programmer is 5 percent when we obfuscated with tigress the accuracy drops down to 67 percent okay there is a significant drop in accuracy but compared to 5 percent of random chance 67 percent is still a high threat to anonymity even when we obfuscated with such sophisticated obfuscators and another case uh another real world case would be uh authorship verification for example someone comes up and says that okay I'm Satoshi Nakamoto and in that case we can ask for their past coding samples take those samples to train a classifier where uh this person Satoshi or Mallory uh is one class and then the second class is the open world it's programmers random programmers from the open world and then I take bitcoin uh source code and try to see who it belongs to does it belong to Mallory or to the open world and based on this we can see if it belongs to this Mallory person and at the same time if we have uh training data from this person in the past in different open world scenarios okay what about executable binaries though when we compile code it goes through various transformations does coding style remain in compiled code and uh again we have a few lines of code and in binary it looks quite cryptic we cannot tell much but thanks to um improvements in reverse engineering methods we can uh generate rich feature sets even from binaries and in this case we know that malware authors would like to remain anonymous and do not have any identifying information out there in the public and there was this fun interview with the law about malware author and this used to be recent but it's 2016 September it's not recent anymore uh but when this author is asked who are you the answer is just some guy who likes programming I'm not known security researcher programmer or member of any head crew so probably best answer for this would be nobody malware authors or people that would like to remain anonymous would like to be nobody's but if in binaries coding style is embedded then that shows that that is a fingerprint identifying information for certain uh users online okay again we have our classical um machine learning workflow we need source code samples uh for controlled environment which we take from google code gem compile it and then reverse engineer it to get this assembly features assembly features the compile it to get uh the source code so that we can extract or generate abstract syntax tree as well as the control flow graph and then for 100 programmers we are left with about a million features so but a million features I cannot really tell what's going about what's going on about style of these programmers so we apply uh attribute selection methods to select the features that are most representative of style in binaries and then feed this again into a random force of 500 trees and then do a classification to de-anonymize the programmers and the features that we are talking about are for example once the code is uh once the binary is the sample disassembled we have assembly features and we take assembly token biograms or two consecutive lines and so on and from syntactic features again from the abstract syntax tree we are taking node biograms or the average depth of a certain node and so on and from control flow graphs we have similar features to abstract syntax trees and once we extract all of these we have a lot of features that we are dealing with and for that uh when we are applying dimensionality reduction the first thing we do is uh apply the information gain criterion so that we take features out of these 700,000 features uh that reduce the entropy when they are taken out of the feature set and then we are left about with 2,000 features that keep the accuracy at its highest and it's most representative of the coding style in binaries but again if I want to understand where code is in binaries I won't be able to see this from 2,000 very low level features that don't really mean much when you first take a look at them and for that I also apply correlation based feature selection which is taking the features that have the highest interclass correlation which means that for an author it has the highest correlation but it has the lowest intercorrelation with other programmers and it becomes the most identifying for individual programmers and I'm left with about 50 features now I can get a better understanding of what might be representing coding style or in binaries and when I try to analyze these 50 features even though they are very low level we still get low level properties that are representative of style that remain in binaries and we have things such as arithmetic or logic operations, stack operations as well as file input operations and variable declarations and initializations and these are not very trivial to basically a refactor or change to hide your coding style. Okay we said that we have a controlled environment we take code samples from Google Code Jam and then compile it and the reason for doing this was so that we can control for optimizations which might affect the anonymization accuracy or the anonymity of these samples and when I take 100 programmers and I apply no optimizations when compiling I get 96% accuracy again with nine samples with nine fold cross validation and when I apply optimizations and then as well as stripping symbols the accuracy keeps decreasing. With optimizations it's not affected a lot it drops to 89% accuracy but with strip symbols the accuracy is affected more but again here we see that with strip symbols we have 72% accuracy but the random chance of correctly identifying these programmers is 1% so even stripping symbols is not anonymizing these people. Okay what kind of optimizations can I apply to anonymize myself in an automated way and for that I used an open source project open LLVN and applied three different types of optimizations. First of all bogus control flow insertion where code will never reach that but it's still in the binary so it looks like a feature and what if I substitute instructions with equivalent instructions making the codes shorter or more complicated and so on or flatten the control flow to mess with the control flow features and we see that optimizations are decreasing the accuracy to 88% from 96 for 100 programmers and again this is showing that such obfuscations are not sufficient to hide coding style even in binaries coding style remains after many transformations, compilations or obfuscations. What happens if we try to increase our class size and we have 600 programmers we see that with 20 programmers having 99% accuracy and correctly de-anonymizing them with 600 programmers we get 83% accuracy where the random chance of correctly identifying these people is less than 0.2% and we see that the accuracy degrades gracefully in this case. Okay what about real world cases? This was a very controlled environment Google code jam people are implementing the source code, the functionality in a limited amount of time and it's small snippets of code and so on. For that first of all I parsed GitHub repositories and ended up taking a bunch of codes from hundreds of programmers and after that I compiled those and many of them did not compile and GitHub repositories are working progress so that's okay but it took me days. I'm left with 50 GitHub programmers and able to de-anonymize them with 65% accuracy. Okay this is one real world scenario. What about malicious programmers? I'm currently actively working on that but one case study I had in a published paper was with six malicious programmers and 10 samples. Some of these samples came from the Hacker forum, Null.io forum and once the forum was leaked I was able to find some live links to malicious code that they were selling and providing to their customers and I was able to download those and find the ones that was relevant to my training set and reverse engineer it to get the features as well as some malware orders from security reports and so on. And please, please if you have a data set with known orders of malware or if you have good automated methods for quickly reversing malware or malicious software or encrypted or packed or so on anything that can help it please comment talk to us after the talk and we see that with these six malicious programmers we have 100% accuracy but I would like to make this experiment much larger scale and for that you need help with the data set and now I will leave it to Rachel so that she can talk about more fascinating details about programmer de-anonymization. So I'm going to dig a bit deeper on to programmer de-anonymization on GitHub. When we did this experiment we got much lower accuracy than in the original Google Code Jam data set and with the experiments I'm going to show over the next couple slides I think a lot of this comes down to the fact that a lot of these repos we only had like a couple of files per author and it turns out that this sort of I think that's the thing that matters the most. There is a lot of noise where sometimes people will like link in other things and so on but I think one of the biggest issues is that when you only have like two files to train on or one file in this case and one to test as opposed to the nine files that we used for our experiments I think that makes a big difference. So we've been up till now we've only talked about situations where people are writing code individually on their own. So most people probably don't actually code that way in real life most of the time. Most code is collaborative. So when we started presenting the initial work we had a couple tweets about it, Halvar who some of you might know said I'll believe that this code stylometry stuff works when it can be shown to work on big commit GitHub histories instead of the Google Code Jam data sets and Zuko talked about hearing from an internet apple that they just allow her from contributing to open source on her own time and so we were interested from this like perspective both of you know privacy like if I want to contribute to something and I want to know if that particular commit is going to cause me problems later and also just you know to be more to validate this stuff more in the real world we wanted to do some experiments. So in this case right we're only carrying maybe who wrote a small piece of code or we want to de-anonymize some pseudonymous account on GitHub who we have like several snippets or segments of code right so we don't have these whole files nicely written and in this case we're using the same feature set that we used before we're trimming it down to about 3400 features so like quite a bit more than Eileen was using earlier on but these are very small segments and snippets so sometimes we need more features for that and you know the ultimately when we were doing this we get about 73% accuracy at identifying the author when we're talking about 100 programmers for a snippet of code that's about five lines long. So we are interested in kind of understanding when this works and when this doesn't work so what we did was we built this calibration curve and what that does is it just shows us like in general what the confidence of the classifier is relative to its accuracy for individual samples and we can see that in some cases we have pretty like high confidence and in a lot of cases we don't have such high confidence and yeah so this can help us understand even though we have 73% accuracy we can say given this you know answer that the attribution has given us should we believe it or not based on this confidence and then we can know like if we know it with high confidence we have much better belief that this is actually the programmer that we're looking for and we're also interested in you know how long these snippets need to be and how many snippets you need to train on in order to get good results. So here's a like an interesting kind of we have this sort of curve that happens where so say we have fairly large snippets that are like 38 lines of code in general with the Google Code Jam data that we were talking about before they were about 70 lines of code of the files but in this case we only have four samples for each author or programmer right and this gives us about 54% accuracy on 90 programmers but even if we have smaller samples but much more of them typically our result goes up so even when we're only looking at single lines of code and I'm a little nervous about this result because it's really preliminary but if we had like 150 samples to train on we can usually identify the author about 75% of the time so we're still trying to kind of understand what makes lines of code more attributable or not I mean certain lines of code are going to be pretty generic but other ones not so much but what happens when we want to identify accounts not necessarily individual commits right this actually works a lot better because what happens here is that the errors aren't correlated so we get close to about a hundred percent accuracy if we have four snippets meaning like the way that we the way that we analyze this is we ran like get blame on these repositories right so that's how we get the repository clipped into snippets and it turns out that the errors this 73% are not typically not that correlated so if we if we try multiple times and then we vote our results go close get close to a hundred percent not perfect and so you can see this sort of heat map here where the red area is basically over 90% accuracy and it tends to happen once you end up having more than about nine samples and more than about five account five snippets to train on right like that you're or to test on so once you have like a certain amount of training data and a certain amount of like testing data meaning the account that you're wondering around about has committed you know more than four or five times then your results get pretty good interestingly we were also one of the other things we tried was instead of saying okay we're going to identify these little snippets individually and then vote what if we merge them all to make a big sample and it turns out that's better than the initial snippets but it's not as good as doing them sort of one at a time and averaging because then the errors tend to get compounded in a single merged thing okay now for something a little bit different I'm going to talk about deep learning because it's the new hotness and so we talked about in one of the things that sort of novel about this work is using the abstract syntax tree in the past most people had just used lexical and layout features and sort of they've been doing this sort of code attribution stuff since the 70s but it gets it tended to get around 80% accuracy and not scale above like 30 programmers or so so using these AST type features allowed us to get good results but the thing is an AST itself is not a feature right a tree is not a feature you can't just feed this into a random forest and have it you know tell you who it is we manually chose these features as Eileen has mentioned we chose unigrams and bigrams and depth and so these tend to give us like very local features and very global features but what we want actually is the ability to get maybe more nuanced features than that so enter like a deep neural net so we're going to try and automatically learn a new feature representation so what we do is we first map the AST nodes into a vector and we just use this embedding layer to do that and then we create these subtree layers and they're going to be using LSTMs or bidirectional LSTMs in order to learn new structures of the AST and then we have a softmax layer to actually do the classification so little bit of background on LSTMs if you've not learned too much about them so we have our neural net here on the right it's an RNN which basically allows us to handle sequential input and actually have some memory to be able to remember information that's where those little feedback loops are and an LSTM actually adds memory cells to this to have sort of more useful memory again so we can not just have super local features so these cells have gates in them and they ask sort of what should I remember of this information that's going in what should I just ignore and what should I forget that we've already learned so that we can over time develop a richer representation of the AST so in this case we're only trying to use the AST features we're not using any of the layout or lexical features and that's why these results for the random forest are lower than what Eileen showed but you can see in this is using Python and C++ that we get you know 86 percent accuracy on 25 programmers or 73 percent accuracy on 70 programmers in Python again just using the AST so this layout and lexical stuff matters too but when we use our new AST features we do get a big jump here and so this new feature representation does seem helpful and it's nice to have a sort of a sole AST representation because it does allow us to it's as we mentioned before it's much harder to obfuscate it's easy to port and so on. Yeah so this allows us to learn better ASTs without doing this manual feature engineering and it's language independent and in our future work we'd like to actually combine these features that have been learned with the random forest and the fuller feature set to see if we get better results or if this just overlaps with some of the stuff that we're learning from the layout and lexical features. Okay so what about other languages as I mentioned basically porting this to a new language requires an AST parser which exists for almost everything and lexical and layout features that you choose for the language so so far we've done things we've done things for C++ C Python and JavaScript and we get similar accuracy so far on the Google Code Gem data set the results with just using the AST and we tend to vary more which is kind of interesting. One of the Holy Grail kind of applications of this would be able to test on to sort of train on one language and test on another language and we don't know currently like how much does your programming style change when you actually change languages. So to do this we need some sort of universal intermediate AST representation or some sort of just pairwise you know porting between two languages. There is a project to work on this it's like the Babelfish project but it doesn't really appear like ready yet for this kind of application. It's something we're planning to look into a little bit if people know about sort of generic AST representations that'd be another thing we'd love to get your feedback on. So I'm going to end the talk with a couple interesting sort of software engineering insights that we've gathered as we've done this work about like what makes programming unique which I think is kind of fun. So in general we will we started with looking at attributing groups of people with so there's another programming contest called the code forces contest which has a team competition and the teams can compete on sets of problems. We looked at we have very preliminary results with 118 teams with about 20 submissions each and they get about 67% accuracy. Now I think this is one of the hardest cases for group attribution because the way that code forces works is it gives you a big group of problems to work on as a team together. So I think people are mostly splitting those up so it's not actually group coding so I'm kind of surprised that it even works as well as it does at identifying the team. So in the future we'd like to work again with some more code repositories to get a better sense of like stuff that we know and can control for how much collaboration actually went into it. Difficult versus easy tasks. It turns out that implementing harder functionality makes programming style more unique. So when you're solving you know and we can kind of control for this because the problems in the programming contest are supposed to get harder as they move on. So if we look at the same set of 62 programmers solving seven easy problems we get 90% accuracy which is pretty good. But when we look at the same set solving seven harder problems the accuracy goes up to 95%. So that tends to matter. Also programmer skill matters. So programmers who got further in the contest which is some measure of skill perhaps were easier to attribute. So in general like the coders that advanced less far we got 80% accuracy on them on the again these are on the easy problems because they did that much. But then when we look even on the easy problems at the people who got further in the competition were able to classify them with 95% accuracy. So it's kind of interesting that as you develop programming skill your style tends to be more unique. We're also interested in how coding style changes over time. So we looked at again this competition where people are training and testing on our competing in both 2012 and 2014. So when we train on 2012 and we test on 2014 the accuracy goes down from 92% on this 2012 set to 88% when we test on the 2014 set. So it's a little bit of a drop. I'd be more interested in maybe looking at like when we look at even larger time scales than that or sort of particularly formative sort of years maybe like university and things like that and how it affects people's programming style. Lastly we're interested in coding style by country right. So one of the things that this contest does it has contestants from all over the world. So when we were reporting this to JavaScript we grabbed a bunch of JavaScript files, 84 files written by programmers in Canada and then programs programmers in China. And we were interested in just a binary classification whether we could tell whether the file had been written by a Canadian or a Chinese programmer. Now this we expected to be like particularly easy because there's like a native language difference which may show up in things like variable names and so on. And in fact it worked pretty well. So it was around it was 91.9% accuracy for this task. In the future we're planning to look at much larger set of countries and a much larger sort of set of files and see about sort of if there's actually kind of if this is a native language effect or maybe sort of an education system style culture effect and what's going on there. But I think it'll be interesting. So in future applications we're as we said we're really interested in whether this actually works to find malicious code authors and also you know what sort of anonymous contributors have to worry about when they contribute code online. We're interested in breaking this stuff. So writing better obfuscators all the obfuscators we've tried so far have not been really targeted specifically to the AST. So we think that can happen. There was some research done at the University of Washington building on our work showing that people can kind of imitate other people's style when they're given that as a task to some extent. So that's you know there's hope right. Don't leave your thinking you can't ever write anonymous code again but be careful. When we and in particular like if you're going to contribute to a repository anonymously you might want to create a new account for each commit even though that's annoying. To find authors who write vulnerable code we're interested in looking at source code and understanding kind of what software engineering type stylistic features lead to vulnerabilities and also you know some people have talked to us about finding out who to recruit directly but looking at how unique their coding style is and whether that suggests something about their programmer skill. So this was not work done just by Eileen and myself alone. I have lots of students and other collaborators at Drexel University and at Princeton and at the Army Research Lab and at Goddigan in Germany who've all worked on various aspects of this project. So thanks to Bandar, Edwin, Rich, Andrew, Spiros, Arvind, Frederica, Mosfika, Dennis, Conrad, Greg, Claire, Mike and Fabian for all of their contributions to this work. This is our contact information, our code to do all of this stuff so if you actually want to try and figure out who Satoshi not going to motois and have a like actual suspect set you're welcome to try that it's not something that we're going to do we respect privacy. So but you know the code is out there and we have I think about four ish minutes for questions so if people have some questions we would love to take them and then after the talk we'll walk out the back and ask any and you can ask any more questions that you have. Thanks any questions? Yeah. How do we do this? There's seriously no one ever does Q&A. Maybe we should. I intentionally left time. So for the coding styles that you were going through for the people for the Google coding challenge, you said you were able to look at people who are going the furthest in the challenge. Did you see trends that were going along that we could later help make better coders later using that information? No, we have. So for those of you that are leaving again go out through the back door do not go out through the side door. So yeah, so we have not done much analysis of sort of what makes the coding style of people who get further in the programming competition, you know, different or more attributable, but I think that that would be a really interesting direction and we'd like to look at it. Do you have anything? And one property was that more advanced programmers tend to write longer code. That's one trend. Yeah, I mean it's it's tricky because we don't know if that's like the causal thing or just sort of a side thing. But but yeah, the in general the code was longer, which helped. So yeah, I just have a comment. This is very interesting. One thing I would see this going towards is cataloging programmer reputation, which is kind of like what we've gotten into with penetration testing. So instead of going towards like a completely open source ecosystem, we're pushing for statistical testing. And this can now further be used to look at programmers and see their history with with security and then give a score to the code in that regards. So do you think there's any value to cataloging people in terms of looking at security for code? Or that's just an underlying ecosystem problem? Yeah, there is ongoing research for automatically understanding security properties of code. And they are working with similar properties. And does this answer your question? Alright, so let's give the speakers a hand.