 Alright, so we're starting your talk starts at 30, right? Ah, I think so, yeah. Cool. Alright, perfect. I don't know what I can do with this. Alright, cool. So I've run this class for a whole month. He's going to be talking about the duplication of large amount of code. Thanks. So, hey, my name is Romain. I'm a former intern at Solst. And I'm going to talk to Mike. Wait, wait, wait, wait, wait, wait, wait, wait, wait. What? Oh, I have to talk in this? Yeah, alright. Just, wait a second. I have date on... Oh my God. Okay. In your pocket. And also talk as loud as possible. Because there's no... The mic is only for the recording. Okay, let me go real quick. Alright, you're good. Okay, so I was saying I'm a former source intern and I'm going to talk about a project I worked on during my time blur. It's an open source project called Gemini. And it basically tries to deduplicate large amounts of code. So, before talking about how Gemini works and the results I got while applying it, I'm first going to go over some elements of deduplication. So, first off, what are code clones? So, code clones at a high level are snippets that share little to no differences. So, in natural language, usually we can basically say whether two sentences or more are similar just by looking at syntactical features. So, here I highlighted, for example, exactly similar words. And even if we try to make this problem a bit harder. So, for instance, by using synonyms like this. Today in NLP, so Natural Language Processing, we're able to vectorize tokens in spaces which will yield similar vectors for words that have similar meaning. So, however, for trying to find similarities between code, this is a much harder problem because you usually have syntactical as well as structural language that affect the semantics. So, what happens when you compile the code? A few different snippets. So, there's been extensive research done in this field which have led to a taxonomy of different kinds of clones. So, we usually use four types. So, the first is basically copy-paste clones if you remove all things that are not code in the snippets. So, for instance, comments, white spaces, stuff like that. The second type is structurally similar code. So, imagine two snippets with different identifiers. The type three is a combination of both with minor changes like additions, insertions, deletions, stuff like that. And finally, the type four which is the most hard to detect is semantically, semantical clones. So, just snippets that compute the same calculations and output the same result but can do it in different ways. So, as an example, here are two functions that do exactly the same thing but do it in completely different ways with different identifiers. Well, just looking at these syntax, one would not be able to judge whether these snippets are the same or not, at least not easily. So, I'm going to talk about a baseline approach before talking about Gemini. So, this is a Lideja Vue paper which was released in 2017 which tried to deduplicate about 480 million files using a free-level granularity scheme. So, the first thing they did was, like I said before, we moved all comments, all white spaces and just hash each file, computing a file hash and then trying to see which hash is all the same and which are not. The second level was after extracting tokens and creating bag of features, as you can see here, they hashed the strings of tokens which produce a second hash. And the third level was using all these extracted tokens, they used a tool called Sosoware CC which is able to basically tell you that two snippets are similar if they share about 80% similar clones. So, the results they found on applying this on about 400 million snippets of files of code was that basically if you look at a file on GitHub, you have about an 80% chance that a similar file exists somewhere on GitHub, which were pretty impressive results. So, however, from what I just explained to you before, I think you can deduce from these methods that they can't actually do much more than detect type 1 and type 2 clones. Possibly type 2, it might be a problem if too many tokens are altered and type 3 and 4 are probably out of reach. This is why at SOS we developed a project called Gemini. So, I'm first going to outline the main steps we do when applying this algorithm and then I'll get more into exactly what we do and how we do it and why it works. So, Gemini has four steps. The first steps is that out of the dataset we are going to extract a certain amount of features which are going to be syntactical as well as structural. I'll get more into depth into what kind of features we use and why we use them. The next step is creating a pairwise similarity graph between all snippets. So, basically in this graph, each node is going to represent a file and two nodes will be connected if they are all similar enough. Again, I'll get more into depth into what this means. So, the third step is out of this huge graph, the most interesting parts of this graph are actually parts of the graph or groups of the graph which consist of files which are all connected one to each other either directly or indirectly through hops. So, this is in graph theory, this is called connected components and we extract this from the graph of similarity that we've created. The final step in Gemini is on each of these connected components we apply community detection in order to find the final cloned communities. So, I'll go a bit more in depth into why we do that later on. So, let's first go on the first step, so extracting syntactical and structural features. So, as you can imagine, we can't actually extract structural features just from plain text code. To do that, we have to go to a lower representation of code, so namely abstract syntax trees. So, for those of you who don't know this data structure, it's used in compilers for syntactical analysis and it looks roughly like something like this. So, this is clearly a very simplified version of an AST, it's probably not even correct, but one of the properties that each of these nodes have is an internal type. So, for instance here, you can see identifiers, statements, declarations, operations and so we're mostly going to be able to extract structural features from a list data structure. So, before describing each of the features, you can imagine that differing languages are going to have different kinds of ASTs. So, for those of you which were here before, Vadim talked a bit about this. We developed, at Solst, a project which is named Babbo Fish, which aims at creating a unified abstract syntax tree. So, giving any file of any language, you would obtain the same data structure which can then be used in a unified way. So, this is pretty useful because that way we don't have to create code for each kind of language and we can just transform everything into universal ASTs and work on that. So, first I'm going to go over the two syntactical features that we used. So, identifiers and literals. So, identifiers are basically variable or function names and literals are values. So, for instance here, I highlighted them in red and green and the identifiers are in blue. So, I think it's pretty straightforward. So, for the structural, we used four kinds of structural features. The two I'm going to talk about are graphlets and children. So, a graphlet in a graph is basically a given node as well as all its children. So, as you can see here, we don't actually care for the syntactical data. So, we don't care that we know this code through X or return. We actually realize purely on the node's internal type. So, we create... So, here I showed you how we extract all the graphlets. So, the children feature is actually just each node and the number of children nodes that it has. And once you convert this to a weighted bag of features, it looks something like this. So, for instance, we have the identifier and no nodes and no children nodes which appears free time in this example. So, as I said, we used four kinds of structural features. The two others are basically just, how would I say, concatenations of different internal types that we obtained by traversing the ASTs in two different ways. One using a depth-first search and the second one using a random walk. I won't get too much into why we use these features, but we thought that they would be able to capture longer-term structural information of the snippets. So, we thought that these features were good. However, we don't actually have any metric or any pure reason other than intuition that these were good features. However, as you'll see a bit later, these actually yielded some pretty good results. So, to go back, now, yeah. So, as you can imagine, extracting these features from snippets of code actually yields a lot of features. So, in order to reduce that, we used an NLP technique which is called term frequency inverse document frequency which basically amounts to calculating a weight for each feature and then fresh-holding in order to reduce the amount of features. So, that's it. And it basically is able to reject all where features, so features that only appear maybe once or twice for only one file or one snippet. And it also is able to reject very common features, so features that would appear, I don't know, 10 million times or something like that. So, things that are not very discriminative. So, at the end of the day, feature extraction revolves about lists. So, first converting all files to UASTs, then extracting the weighted bag of features of each UAST, applying TFIDF in order to reduce the amount of features. As you can see, it was a fourth step. So, this is actually a step that allows us to look more into biased versions of our algorithm. We could use purely the weighted bags of features as is, but if we wanted, for instance, to look at more of the structural differences between code, then we would be able to give more weight to the structural features and, inadvertently, we could do the same thing for syntactical features. So, that was the first step of Gemini. So, the second step is the hashing step. So, we're going to hash the features using an algorithm I'm going to explain in order to create the pairwise similarity graph. So, at this point, you might be wondering, we could simply choose a given distance metric and then apply it to each file in order to compute similarities between files. However, that doesn't really work. The reason is that, well, it scales quadratically with the amount of files you use. So, if we're trying to do this at scale, it would basically explode. So, in order to introduce the algorithm we're going to use, I'm going to have to explain a couple concepts. So, the first thing is weighted jacquard similarity. So, giving two sets of features, A and B, low weighted jacquard similarity is basically just the intersection of both features divided by the union. So, if both A and B were equal, then you can see it would be equal to 1, vertically to 0. So, here I displayed it for integer weighted features, but it would also work for real valued features. So, imagine half the sun would be the same. So, we use this because it actually has a pretty interesting property, and that is that if we pick a random permutation over our set of features and apply it to A and B, then hash all the elements that were permutated and select the smallest hash, so the min hash, then the probability that these values will be the same is equal to the jacquard similarity of both snippets. What's interesting with this is that since it's a random permutation, if we actually pick the second or the third or the fourth or the nth element, then this still is true. So, if we take k values creating what we call a min hash signature and concatenate them into what we call a min hash signature matrix, then we have the interesting property that for each column and each row, the value will be equal, will be the same with probability equal to the jacquard similarity of the files associated to each column. So, from this, what we're going to do is we're going to cut our matrix into bands, so B bands of all rows, and we're going to call candidate pair any two files which hold the same value in at least one band. So, we're now going to calculate quickly, don't worry, the probability of two snippets to be a candidate pair. So, if we call S the similarity between two snippets, then the probability that the signature are the same in one band is just S to the power all. So, if we take the opposite event, that is they are different in one band and we can calculate this, if you do this for all bands, since we have B bands, we just put that to the power of B, and if we take the opposite event, the event that these two snippets are a candidate pair, you get this. So, for those of you that don't often do math or haven't done some for a while, this might seem a bit abstract, but basically, this function regardless of R and B has this shape, so what we call an S curve. Now, if we choose R and B appropriately, we get actually this kind of curve, which is exactly what we want, because that this curve means that if we have a candidate pair, then with high probability, they will have a jacquard similarity over a certain threshold that you can choose beforehand. So, here, this is an example with, I think, a threshold which might be about 0.95 for the similarity. Okay, so, at the end of the day, this basically allows us to compute the similarity graph by simply creating signatures for each of these snippets, then selecting a threshold deducingly constants R and B, and then on each sub signature of each file, hashing these sub signatures and put them into buckets, and then, well, simply any snippets that land in at least one bucket in common are going to be our candidate pairs, which, theoretically, are going to be over a certain threshold of similarity. So, the next step is pretty straightforward, it's the extraction of the connected components. So, it basically looks something like this. Hopefully, you can see the edges, but it doesn't ... Yeah, that's not great but you can't really see the edges, but I imagine that these are all connected components. So, I haven't talked too much about list step, the community detection step. So, why do we do this? So, you could imagine that all snippets in one connected components would be clones. However, in practice, this isn't true because if you take three snippets, for example, A, B and C, and imagine that A and B hashed to a same bucket and B and C hashed to a same bucket, let's take a threshold of 0.8, then in the worst case, even though there's a similarity of 0.8 between A and B and B and C, you only have maybe 0.64 similarity between A and C, and if you increase the amount of hops you have to do, the similarity is going to drop. So, this is why we do the community detection and it looks like something like this. You can see it you kind of put into the same groups files that are relatively close, depending on the structure of the edges. So, that was it for the Germany algorithm. Now, I'm going to talk a bit about the results I got while applying it. So, I use a public data archive. So, a public data archive is the largest data set of code source in the world. It's maintained by a source and is basically created from all the wrappers in GitHub, which have over 50 stars. It has the neat property that we didn't use in my case, that it includes all of the commit history. But as I said, I didn't use that and I only took the head which store amounts for 54.5 million files. So, I was also restricted with files from which I could extract unified ASTs which meant that I could actually only use at the time when I did this files stemming from five different languages. So, Python, Java, JavaScript, Ruby and Go. I decided to apply the pipeline on all five languages at once because I wanted to see what the algorithm would be able to detect structural differences between different languages. So, when I applied this pipeline, this is what I got. So, for those of you that know Spark, you know what I'm talking about. And unfortunately with this kinds of logs, usually they hide a certain amount of problems. So without going too much into detail, I had to face a certain amount of challenges. So, for example, co-op files causing errors from Babelfish, GPUs not responding for no reason at all. During the TFIDF step at some point there was basically infinite garbage collection. So my tasks never finished. And yeah, I also encountered some GitHub repositories made by very clever engineers which were designed to not be cloneable. So we were able to clone them. However, giving lots of structure which basically imagine a folder with 10 folders inside of them and then each sub-folder will have 10 more, etc. etc. So yeah, exponentially increasing size. Yeah, so these were actually pretty hard to process and at the end of the day I was only able to process 7.8 million files. So as you can see most of them were JavaScript and Java files and I still had good amount of other Python, Ruby and Go files. And out of all of these I was able to extract about 6.2 million distinct features. So looking more into the features we extracted so most of them were less syntactical features as you can see. So identifiers and literals which amounted for about 75% of all the features I extracted. However, when looking at each file individually I saw that they each had about a thousand features and most of them were actually structural features. So this was an average made on all of the features from files of any given languages but this was about the same when looking at files of a specific languages. So next what I did was that I applied the hashing with two different thresholds, the 80% which was also used in Deja Vu as well as the 95% threshold. So as could be expected there are a much bigger amount of connected components for the 95% threshold. But that's actually just connected components of only one file and if you look at the ones which have more than one file there are actually more for the 80% threshold so about 10% more and the connected components are also a bit larger so on average they had about one more file per connected components. Looking more into detail into the size of the connected components as you can see from this log graph the size well most of the connected components actually had a relatively small amount of files with at most for the 80% threshold a thousand maybe a couple thousand files in one connected component. So one thing I said earlier was that I applied my pipeline on files of all languages in order to see if they were able to find if they were able to be separated using our algorithm. That was mostly the case apart from one community which was actually pretty huge which had a file we call about five files and which had the files from all five languages so when I looked a bit more into this I found that it was because all of these files were very short and had very similar syntax so what I did was I reapplied my pipeline adding some bias towards the structural features and this large community actually exploded into monolingual communities at the end I mean connected components. So looking at each kind of connected components per language you can see that four connected components of over one file we have a big difference between JavaScript and Go which have much more files in these kinds of connected components than for Java will be in Python so that can be explained with kind of intuitive explanation for example Java which is a much more object-oriented language would probably have files which are much more distinct and invertly in Go for example we use a practice which is called Vendoring it's kind of logical that would be a lot of these same files that would appear less these results. So then I applied the community detection with the Walktrack algorithm so as you can see if looking purely at the connected components with over one file the amount of connected components was about the same however when doing the community detection we can see that law was a huge increase about 50% of the number of communities detected when going from the 95% threshold to the 80%. Okay so now I'm going to show you a couple examples but before that you might notice that up until now I haven't used any metrics which is kind of odd in a machine learning problem that's because this is actually an unsupervised problem and well as you can imagine I wasn't going to look at each of the 7.8 million files to kind of see okay this one is 80% similar this is not like parallelized it would have been a bit hard so in order to judge whether this work was relevant or not I had to look at communities and connected components individually which was how I judged my work. So as I was saying before there were a lot of connected components which had a lot of go files and one thing that was interesting was that a certain amount of least connected components actually files which only had one file name. So for instance this is a connected component with only files which are called text parser.go so the white nodes are actually not files here they are representations of the buckets but however so yeah as you can see or maybe you can't see from afar there was actually 4 communities except that 2 of them are minuscule here and here but as you can see especially for the blue community that was detected there are very strong links between each file however this doesn't mean that each of our connected components were only formed of files with one file name so looking at this larger connected component which only has a Ruby file you can see that we detected 3 communities and looking more at the file names inside this connected components you can see that although they do seem to cluster together it doesn't actually mean that the community is formed of only one kind of only files sharing the same file name. So this is the final example I'm going to talk about so this was actually a pretty interesting thing because it was one of the larger connected components I found and although the files in this actually stem from 3 projects only 25 files came from 2 projects and all of the others came from probably 4,775 other files so as you can see that's a lot of duplication for only one project and when looking at the GitHub repository of the Microsoft Azure SDK you can actually see that there's this very repetitive kind of structure with a lot of files incoming and as you can see we were able to detect lacked so I got this representation when I created my graph on Gefi so that was it for me I hope you liked this talk I liked working on this and I found that the results were very interesting if you want to look more into the results you can go on the blog I created a blog post which goes much more in depth into different aspects of the results I found on the Applying List if you want to check out the code and possibly use it you can look at the source the Gemini project and if you are interested in the PGA dataset well you can also download it for free although it's a bit big so you might want to save about 3 terabytes of storage thanks so we have some time for questions focus on duplication where is the results too roughly adjusted because it's really hard to understand questions yeah so the question was whether the list technique could be used for refactoring cloud or if it was much more aimed towards duplication so I think I'm not sure what you mean by that question I think it would be able to find codes which are similar if you only focus on structural features so in that sense if you applied it to let's say one project in order to find all of these structurally similar files it might be applied to refactoring I don't really think that it would necessarily be impacted by the roughness of the estimate because if you give enough weight to the features and if you apply TFIDF with a relatively low threshold then you can keep enough features for I think the similarity to be relevant in your case so probably it would be interesting to see no