 Okay. Well, hello everyone. My name is Andrew Walker and we'll be talking today about like auto detection of code that's been copied from Stack Overflow in your application. And before we dive into that, I want to talk just briefly about like who I am and how I wound up here. I'm actually still in college at Baylor University over in the United States in Texas. And this past summer, I spent a whole month over here with Red Hat working with Povltivnowski in the QA division working on these tools, one of which copies or finds clones between two different repositories and then the one I worked on, which is finding code copied from Stack Overflow. So that's kind of how I all the way from over there got tied in with the Red Hat and wound up over here. So just an outline for the talk. It's going to be I'm going to go over briefly like the types of clones that you're going to find in your application, the types of code clone detection techniques that there are. I'm then going to talk about the architecture of the tool Stack Oversight. And then I'll discuss kind of the preliminary results and then future directions and considerations that you need because this is still kind of a working prototype work in progress type tool. So just starting off, what is code clone detection? Well, it's the process of finding two pieces of code that are duplicates of each other. Sound self-explanatory. It's a little more complex than that. But you can see a great example in here, which is this code is essentially the same. Yeah, those classes are slightly different classes than each other. And the, you know, variables are named different things. But in all essence, these two pieces of code are completely duplicated. Why should you care about the fact that those pieces of code are duplicated? Well, one of them is bug persistence and tracking. If you have a bug in a piece of code and developers have copied that piece of code throughout the entire repository or even multiple repositories, well, then congratulations, you now have a bunch of places where you have to go fix it whenever it's found. And when you miss one, you're going to get that same bug report back again when it happens the second or third or fourth time. It also is going to make your repository and application a lot bigger. Duplicated code is something that could be refactored and condensed, but instead it's going to be duplicated in multiple places. It's going to be larger files, larger binaries in the end. It's also quite frankly sloppy. Copying code is the easy way out. And I know a lot of developers do it. I mean, I'm still in school, so I definitely do it. But yeah, you want to be doing the best you can to kind of make sure that your code is refactored and nice and not just, oh, well, it works here. Copy and paste it and just move on with my day. And lastly is, yes, lazy developers. You want to make sure that people on your team aren't just copying code and kind of patch working together, stuff that other people have done. Developers are hired to develop. So we'll get in now to kind of the different types of code clones that you'll see, because they are broken into a couple different types, not just straight copy and paste. Although straight copy and paste is the first type of clone that you'll find. These are exact copies with absolutely nothing changed. Same comments, same space scene, white space, wine breaks, everything completely the same. And that'll lead you into your type two clones, which these are copies of code, but unique identifiers have been changed, which means instead of, you know, variable foo, it's variable bar, you know, little name changes like that stuff that will make, you know, just straight string comparison, no longer work, but the code is still the same, which is the example that you saw earlier in the PowerPoint. You have type three, which is still a code clone, you know, it's still the same thing, but you can add lines, you can remove them, you can swap around a couple lines. And these are a lot more difficult to discover because the issue inherent in this is, well, when does it stop being a code clone? Right? So if I have a line, if I have a code block of 40 lines and someone copies it and adds in three lines, okay, that's probably still a code clone. But if they add in 10 lines, is that still a, is that still a clone? If they added 15, 20, 30, you know, that's kind of something that's really up to the person finding the clones to, you know, figure out kind of, hey, when does it stop being a code clone? When is it his own thing? Lastly is type four, which are semantic clones. And these are quite different than the other three. Typically, you'll see code clone detection tools find either types one through three or type four. And I don't think there's any tool out there that'll find all four. Semantic code clones are not syntactically similar. They're only semantically similar. So you're not going to be able to find them using the typical techniques that you would use to find types one through three, which is why they're typically kind of their own thing. So as an example, you have like a type one clone here. Like I said, exact copy and paste, literally control the control C. Type two, it's a little different. You can see like, the comments can be removed. The variables can be changed, that sort of thing. But it's still ultimately kind of the same piece of code. You get into type three, which is when you get into reordering and adding lines, you also could have removed lines, just changing it up just enough. And then type four, this is kind of the classic example that everyone uses, which is factorial. You know, these two pieces of code will do the exact same thing. They'll find the factorial of that number. The difference is one is iterative, one recursive. And so syntactically, they are not similar at all. But semantically, they do the exact same thing. So we'll get into kind of the different types of code clone detection tools that you have. The first of these is textual, which is kind of the most naive and basic one that you can do. That's just straight up string comparison, which is going to be great for finding your type one clones, but is not going to help you at all for your others. Another really popular one is token based code clone detection, where you use something like a lexical parser to break up the code into a stream of tokens. And then you can do token wise comparison. And this one is helpful because those tokens representing like variable names, class names, function names, you can just generify those and that way it will find types one and two. Syntactic is another way of doing that. It uses like typically abstract syntax trees or another way is metrics. So these can take the form of almost anything, but you will see a lot of times like counting up the number of if statements and variables and doing that sort of thing and just seeing how similar they are using that. Semantic, even though this is semantic in type four is called semantic, these can be used to find all different types of clones using semantic comparison, which is that it uses something like a program dependency graph to kind of infer the meaning behind the piece of code. Of course, modern computer science, you can use any sort of learning techniques, machine learning, data mining, those sorts of things will also help you find code clones, although these have only recently started to hit the accuracy that you would see with the other forms of code clone detection. Typically, and the most common tools that you'll see are going to be in that token based or syntactic code clone detection techniques. So I really didn't need this slide in here, but I put it in anyways. This is what code looks like on Stack Overflow. It's going to appear in the questions and the answers, and it's going to be in these nice little gray boxes, which will do all the syntax highlighting, and you can see people post code snippets and all sorts of things on there. So which brings us in the Stack Oversight. That's what we've named our project, and it was kind of that collaboration effort this summer. And the goal of this is we want to give it an application written in Python preferably, and I wanted to spit back out, basically highlighting the areas of, hey, this developer at these lines copied this piece of code from this question on Stack Overflow. And so this is right now kind of broken into three pieces. The first one we called Oversight, which is the scraping portion of it. Obviously Stack Overflow is immense, and one of the issues that we ran into is you can't on the fly make, you know, 10,000 requests that be pulling these snippets in order to do your comparison. Stack Overflow does not like that at all. And it really will kind of rate limit you IP range block you, we have accidentally taken down Stack Overflow for the university. Sorry to all the other students there. So we have kind of this local database, which will transfer the snippets from Stack Overflow to the actual code detection section, which is corpuscc. And we have this local database right now. It's really not a great solution. It kind of came about because we were developing corpuscc and oversight separate from each other. And because we hadn't kind of figured out how the whole pipeline was going to work yet. So basically oversight just stuck all its snippets in a database and now we're pulling from it. That is not a great solution, but we will be looking at others in the future. And then corpuscc, as I said, is the actual codecone detection tool, which is going to do the tokenizing the parsing comparison, that sort of thing. So this is just kind of a diagram, as I said, oversight local database into corpuscc, which is where you'll feed it the source code of your application for testing. So just to kind of get into oversight stage one, what we did is we did a search for all questions that are tagged with Python. And we took the links to all of those questions. And it wound up being, as you can see, about 1.3 million questions that we stole the links to. Stage two is to scrape every single one of those pages for the code snippets on there. We started officially scraping Python code snippets in mid December. It's now almost the end of January. And we are almost halfway through, kind of close. Yes. You know that BigQuery has a dump of the Stack Overflow data set. So you can download everything from BigQuery, or do you need the processing on Google Cloud? So you don't have to scrape. Right. So we're scraping because we wanted to kind of integrate it into a pipeline, which then didn't work out. And so now it's just kind of what we stuck with. There's no kind of inherent like need to do it any faster than that. And the on the fly kind of scraping and parsing is working just fine now. And in the future, once it's kind of once we've gotten through this initial scraping of like the entire data set, we'll be able just to keep maybe every, you know, three, four days, just look up and say, Hey, has there been any new questions? And if so, go ahead and scrape those and add them. And a lot of that as well is that we want to throw out a lot of the code snippets that we get. Right. So as I said, we've seen about four, I should say about four snippets per each question, which is about 5.2 million total ish, but they're not all created equal. And a lot of those have to be thrown out. So here's an example, or a couple examples of the type of code snippets that we've seen pop up on stack overflow. The first one, you know, it has some kind console in there, it has some output. The one on the upper right corner there also is all like what you would see in the console and the output, this one just kind of fork on the entire loop body. And then the one in the bottom right is embedded in a bunch of other text they like to just kind of highlight the variables to show that it references back. And none of these are usable. The fact is that while we could put in the effort to go through and kind of clean the data and figure out, okay, what actually constitutes like a very valid and compilable Python code snippet, there's going to be enough leftover after we throw out the invalid ones based on the initial test that we've run that we are more than able to just throw them out and not waste the computational overhead of trying to, you know, clean them, figure out what's going on, that sort of thing. And as well, you know, something, even if like, say this one didn't have that kind of loop body in there, it's incredibly short, which doesn't really do us much good. You know, I talked about those type three clones and that's really one of the big issues with regards to this tool is if you have a statement, like just a single statement, you know, when it's like x equals call some function, the amount of times that that is going to show up in your application is enormous, it's going to show up all over the place, you'll get the report back, and it'll say, oh, cool, you have 98% of it copied. It's not true. And so we've taken care to try to only pick snippets that are meaningful enough that that they would actually be something that you'd want to look for in your application. So we wrote corpus cc separately. I say dirty microservices just because two of them do share the same database, which I know is a no no with microservices, but we did cloud based and kind of separate microservices just because in the future we were focused on scalability and the ability to kind of spin up more instances of one of the pipeline operations if indeed it's like getting too slow and bottleneck there. And this is going to handle all the tokenizing storage and code clone detection in the process. So it's kind of in four services right now. The first one is going to take like find all the files in the application. The next one, the cc detection is what actually kicks off the code clone detection process. That's what you'll feed your application into the cc tokenizer is what tokenizes and count and finds all the metrics about the code snippets and the cc pipeline is what connects to that local database that the stack oversight is feeding all the data into and that's for loading the database and it's the cc detector and pipeline that are kind of tied into the same database. And so that you can see it kind of expanded out there. The database is just a really simple no sequel. It's got all the metrics tokens and raw snippets and just kind of a view of what that looks like. One thing you'll notice is over here that we calculate a lot of metrics besides just the source code itself. So we're tokenizing it, but we're also counting the unique tokens. We're trimming it down at various stages and hashing it to account for type two clones and doing a couple other metrics for use when we actually do the detection. This allows us to query like with higher accuracy and cut down on the amount of extraneous snippets that we're going to have to do a comparison for because every response that we get from the database we're going to have to compare to whatever piece of code in the application we're currently querying. And so the way that works right now is we do a method wise. It's kind of search and so for every method in the target application we'll take it, we'll tokenize it, calculate its metrics, find similar ones within the database and then have to do a set similarity comparison to try to find the results from the database and find one that hopefully matches. And then yeah, we use that with a threshold and if it's above it then it's a code clone. So I want to briefly talk about some of the results that we have. Our preliminary results we actually ran in Java because we could use a Java library for tokenizing. That was pretty simple to get going. Now we've switched over to using Antler for Python parsing. But basically we got over five thousand snippets in under an hour which is pretty sweet. We thought that was big enough to kind of run the initial test and we ran it against the spring framework Java repository just because it's, you know, Google said it was pretty popular and pretty big. And this is what we get and wow that's a big number right? Six thousand clones in one repository. But what you have to remember is we talked about kind of those type three clones which is they're not always relevant. So when you actually kind of go in and you look at the type three a lot of them aren't as relevant and even in the type two a lot of them can be filtered out pretty easily to say these are not as important to the overall, you know, kind of understanding of what was copied. And really you're only left with maybe a dozen that are relevant copy and paste on stack overflow. So some future work. We want to do Python as I said which is currently ongoing like I said about halfway there. We want Redis or a similar kind of messaging queue to connect the microservices and the scraper just to make it all kind of a more asynchronous pipeline to hopefully facilitate everything a little bit easier and then as I said we want it to be a little more scalable for those requests because the comparison is pretty time intensive. It does take a lot to kind of compare an entire application to a repository of millions of snippets. So we want to be able to kind of parallelize that and scale it as needed. So this is what we're hoping it'll kind of look like in the end with the queues. Yeah this was we got funding from NSF so they require us to put in that little acknowledgement and that's all I have for you guys if you have any questions now is good for Q&A. Yeah there's a lot of places we could go to for finding snippets. We pretty much just chose Stack Overflow because we thought that it would be a reasonably well used and pretty huge repository of snippets but we have also looked at using GitHub which would be a possibly even larger data set once it's all scraped and done. Also there's some other common sites that people use that we thought might be good to go to but for now we've just focused on Stack Overflow but yes we have definitely thought of other places that we could go to. Sounds like a fairly high false positive rate. What is your target audience? So this was kind of made oh sorry he has what the target audience for this is given the high false positive rate and we wrote this in conjunction with kind of like the Q&A team and so that was kind of our the idea that we had going into it was like for quality assurance and like I said there are ways that we have I chose it as use like the full-blown kind of like number of results that we got but we are able to kind of filter out the extraneous false positives and tailor those down now based on this kind of length accuracy that sort of thing and there's a lot of metrics in this whole process that can be configured by the user that would definitely lower or raise the false positive rate. Basically with the Q&A team in mind we looked at kind of looking at it more from like a high-level point of view and making sure that there's not like any one piece of code or any one developer that's over utilizing Stack Overflow just as another metric to have during that whole kind of process it's code from Stack Overflow can be good in a lot of cases if you you know are using it for you know how do I do some things or just getting like a general overview but it's when you get kind of very specific like how do I do this one thing that isn't working for me and then you copy that code and you don't have an understanding of how it works that you see the potential for in future releases and developments that you can have bugs going into that. Could it make sense to like this approach finding the duplicates in your own code and try to find ways to extract some information? Yeah so he asked could you apply this to finding duplicates within your own code base if you had a really huge code base and the answer is yes although this tech the way we set this up is probably not as applicable there are a lot of great code cone detection tools out there that utilize like enter or code cones just within a single application and that's been kind of the main focus of all code cone detection tools up until now it's only recently that people even thought about hey finding it with other repositories or other sources of data and in that regard the issue with using this kind of approach is because we have this database that has these snippets that's unchanging and we've designed it we want it to be unchanging we want that huge data bank of all the stack overflow snippets it would be difficult as you start to like you know iterate over your own application and suddenly you know you've deleted a file and so you know all those snippets that you had in the database are probably not relevant any longer but code cone detection in itself is a great tool to be using on your application if you're not using it there are a lot of great tools out there that can do fairly quickly code cone detection within an application and you probably be shocked at a lot of the results yes you also said unchanging but that's not 100% true sometimes stack overflow answers do get edited so there is a window during which between initial posting and edit somebody could copy and paste and your database would not have the original buggy solution that's true um it is a small and we thought about that uh yes uh so because you can edit answers you know what happens if it sounds like post a buggy answer and then later it's a they edit it and now it's you know correct and somebody had copied the buggy answer before then uh it is a fairly small window as you so it's probably not super relevant but we also figured at the point at which people are viewing it most often it's already been kind of you know sorted through and people have changed those answers if people add answers we do we can sort it so that we get the most recent updated and that's the way that we've been doing it so like once we get this initial just huge dump of all the data we can kind of continue update as people add answers as far as editing that is a little more complicated but we're choosing to just kind of ignore that problem especially given as I said the the data set that we're going to have is so huge it may not be as relevant that we have you know every code clone out there or every snippet out there thank you guys