 49 okay cool so next up so thank you everyone thank you for being here so I'm Daddy the new chief I'm a research fellow here in Brussels so actually I didn't work a lot to arrive here but I work in the sister University of well be so I work in the Frey University of Brussels if you've been in Brussels for a couple of days you know some stories about Brussels so I've never been in Solbosh so today I'm going to talk with you about mining source code about mining idiom usages and edits actually I will show you that these different aspects in reality have a lot in common and I will try to show you the limitation that we have in applying this kind of techniques but also the novelty of applying this kind of techniques and the advantages that we could have in a few years of research so actually I'm a Sunday researcher in Europe so I moved a bit during my life I'm not Belgian I'm Italian I'm an Italian from the South but then during the last years I was in I worked in death and after a while I moved to Brussels so today I'm going to talk with you about mining software repositories we could think about that a software repository is just a git repository or maybe an issue tracker but actually when we talk about software repositories we talk about not only versioning systems and issue tracker but also about marketplaces or about communication in which in which you developers you store all your information so actually what we would like to do is to gather this information to mine these repositories just to extract this data and to create a common history for a software project so after having all this information what we can do is that we can for example apply machine learning techniques and today you are a lot about this but why we need this because actually if you think about you are providing a lot of data to public repositories but there's something that these repositories can do for you and actually this is what we are trying to do so actually we are trying to get all your information but just to help you so to provide you some information that you can use so that you can reduce the time that you spend coding your projects actually because no one likes to debug the silly mistakes so last year in Brussels we started a collaboration with a company that is called Rincut and is based in Brussels so the goal actually Rincut is a company that is working on compilers so they are very good in modernized software languages and maybe it seems to be a bit strange but even if today we have really modern languages we have really a lot of code that actually is still in legacy code so actually we are trying to help Rincut people to modernize languages because actually this is an important market there is a growing needed to have this kind of techniques but one of the problem is that today what Rincut the people are doing is pretty much manually so they are doing everything manually without any IDE or any to support and for example just to understand the complexity of what they are doing in some cases they are trying to replicate the behavior of a compiler without running the compiler so you have some code you try to understand how this works this code really works and we are talking about cobalt or something that is stranger than cobalt and then you have to to modernize this code to something that you can run on Azure or on your cloud or something like this I'm talking about Azure because Rincut works with Microsoft so goal of this project is actually to automate the migration that requires pattern discovery but for doing this we actually we have we are trying to mine three different use cases for three different objectives so going in details I will show you that actually mining codidiums or mining library usages or mining systematic edits now I will tell you what I'm talking about is not very different but why we are trying to do this so first of all we would like to comprehend these these programs because the first issue that you have when you work with the legacy code is just to understand what the code is doing then what we would like to do is to do some anomaly detection so actually if you have a large code base maybe starting from a new piece of code you can understand if you are adding you are making a mistake but what is very our final goal is to build a full modernization assistance that actually is able to tell you which mistakes you are making and to to create a better languages a better language without a lot of effort so to transpose your code from a language to another so our approach is actually divided in three different steps the first step is actually only to import the code because our idea is that these assistance assistance should be language parameter so parametric so what we would like to do is to have a common framework that is able to work with the different languages not only cobalt not only Java not only Python but to have a common framework for doing this we need a meta model actually this meta model should be able to gather the information and to provide them with this information to our algorithms our pattern mining algorithms but first of all we have to work on importing this data to import this data actually and at this point of our project we have different importers for example we are able to import Java code from open source projects but we are also able to import legacies legacy systems for example we are able to import cobalt code the second step after having these this common meta model is to run our pattern mining algorithm how I will show you briefly in theory it seems that everything is easy that we have everything that is needed to deal with this kind of issue but in practice I will show you that is not so easy and maybe we have to work several years on this then the third part is actually to provide the patterns that we were able to mine to developers so that developers can really understand if there are some common usages for a library or whether there are some syntactic items or for example whether some commits are repetitive so first of all I will just provide you some information about idioms actually this morning Miltos was talking about idioms an idiom is just a syntactic fragment that recourse across software projects and the serves a single semantic purpose actually this definition is a bit bug because it's a bit hard to understand what is really an idiom but let's let's try to just find a simple example so in this case we have three different piece of code you see that in all this piece of code you have a try then you have a condition and in this condition you are invoking the the method move to first then you have a body which we are doing something and then finally you have the you have the finally for the exception and then you close this this object so as you can see in the end this three piece of code they are doing actually the same so and you can generalize this three piece of code with this idiom and maybe when you have to modernize and to move from a version of your system to another you can use this kind of information now the problem is that actually we are talking about only three pieces of code and actually it's pretty straightforward to understand that these three pieces of code are something in common but when we talk about ecosystems or when we talk about huge software repositories then we found out that it's not so easy and of course a developer could not do something like this so the idea if we would like to apply a really theoretical approach it's just to run a frequent item set algorithm so in theory this problem is not very hard to solve because what we can do is that from a piece of code we just create a representation with that tree because based on the code you can just create an abstract industry and then you can run your algorithm maybe without any problem but in practice when you run it you discover that most of the idioms that you find out in the end are not so interesting in the end they are boring and we boring we mean that okay if I if I tell you that given a Java class a Java class is composed by attributes and methods then you say that okay this is not so interesting it's it's not something that is new we know about it but the problem is that actually it's it's led to have a trade-off I mean from my point of view you would like to discover very particular very tricky idioms but from the other point of view you have that after a while your your search just explode so in the end you are not able to gather patterns and most of the cases you have only auto memory issues so it's you cannot really do anything so what we are trying to do it's to explore novel pattern mining algorithm for source code and then we are going to incorporate this this this information in our tool so actually there are various application that you can use for mining when you mine code idioms first of all you could discover new synthetic pattern then you could discover but a code that is deviated from from a pattern and then you can you propose new actions to developers so this is the overview our framework so as I was telling you telling you in the general view we have a source code importer that is able to gather this information in a common representation then we have some mining preprocessors that helps us to clean up the representations and then we have the pattern mining miner actually it's important to say that now we are running a pretty simple algorithm that is frequently mining but we are trying to to explore all the possibilities that you have with this kind of algorithm so in the end using the algorithm per se is not very hard but adapt this kind of algorithm to our to our case study is not so easy also because we have to remember that it's true that we are talking about at least now for the first year of our project we are talking about Java projects but in theory we would like to apply this knowledge to all the languages so it's one of the problems that after that you set up your algorithm maybe your algorithm will work just for your case study a really small case study I will not generalize it all so after having our pattern miner and actually we are exploring different kinds of algorithms the second step the first step is the pattern matcher so after mining a lot of a lot of patterns what you can do is that to apply your matcher to discover if in another piece of code you have these patterns and only after these you can provide this information to the developers so that actually developers the developers will will able to to improve their code base so the first problem of this kind of this kind of application is that is highly time consuming and tweak in the algorithm is not so easy a really something that is maybe in some cases very people don't consider about much learning is that really every every small detail matter and changing a bit to the configuration means that in the end you are going to spend a lot of execution time and maybe after spending all this time there is such that you will have are not not very useful so the another problem that you have is that in some cases we are able to generate really a large amount a large amount of patterns as well as redundant patterns some patterns are more related to the grammar of the language and not about the use of your of the language so what we are trying to do is that to solve these issues we are trying to put some constraints that help to reduce the search space actually the search space in our case is very very large but we are trying to fix this kind of issues another problem is that the setting constraints is that setting the constraints is not so straightforward I mean really every details matter and setting this but this details is not so easy another issue the last one is that evaluating maybe is the most important issues that evaluating the patterns is not easy at all I mean one of the issues that we have we had in the first year is that we were able to mine patterns but in the end even saying what is boring and what is interesting is not so easy so it means that maybe after hours of computation we are going to have a lot of patterns but you don't know whether they are useful or not and if you don't know this it means that you are not able to tweak your algorithm so in summary we developed a language parametric framework to mine co-digiums now we are running this kind of algorithm and we are doing some work in progress to reduce the search space applying heuristics and constraints and we are trying to understand the real value of idioms to improve the mining mining process the second part of my presentation is about mining usages now you know the difference between libraries and frameworks so libraries are different by frameworks because if you would like to use a library you just need to execute to understand the API and to use the APIs while if you would like to use a framework actually what you have to do is that you have to implement or to extend some parts of the of the framework now one of the issues that you have today with framework is that from the point of view of the developer actually you don't know how actually you don't know how to use a framework and how to correctly extend a framework given a functionality so actually what you do today is that you just take your framework and you start analyzing the different way in which this framework could fit your your test study so first of all from the developer point of view we would like to understand how you should extend a given framework but also another problem that you have in this case when you try to mine mining libraries is that from the maintainer of the framework point of view actually the the maintainer doesn't know anything about how people are using its life is library so in the end it's possible that if you are a maintainer a library and you would like to remove a functionality that maybe no one is using in the end you you don't know this I mean you based on comments or based on what is your opinion you are going to add or remove APIs but in the end that you don't know if if really you are me you are what what will be the the impact of your change so when you would like to extend a framework what you have and this is a definition that is from a from a paper of a few years ago is that you have an extension point that is the part that you like you should extend and then you have a usage that actually define how you are going to to extend your your framework but there are different kind of extension point for example a really simple extension point is the one which actually in your codebase you're just using a method this is the easiest thing and actually if you think about this use extending a framework in this way is not very different to use a library the second thing that you will do is that you not only use a method but you also customize this method before using it the third way is that you actually extend a class providing new functionalities now if you have a look to this this picture this is another work with that we and another part of Intimalson which we are working on and then it seems that the framework is different to be with respect to the previous version that I show you but if you think about it it's not very different because in every case we need an importer and then we need a miner and then we need a matcher and then afterwards we need a visualization tool so this for telling you that in the end when you try to mine different different aspects of the source code in most of the cases there are not so many difference at least from a really high point of view now we performed a case study in which we developed an importer for Scala so first of all we are able to mine the source code we are able to tag each part of the source code and based on this tagging what we are able is that first of all we are able to extract all the data that we need so we are going we we clone the projects then we compile the projects we do type resolution and just after these we are able to understand which are the extension points after this part we now run our miner with respect to the previous approach in this case we run a sub graph mining algorithm that is a priory and then we visualize the data but what is important is that also in this case we tried to evaluate our approach mining several projects from Scala we mined five different frameworks so Sparka, Kamokito, Adup and play and actually we saw something that is a was pretty much interesting first of all we saw that mining this kind of patterns is not very hard but also the accuracy of these patterns in the end is very high I mean most of the cases we had that the accuracy was about 90 something percent but one thing that for example we are not expecting is that in most of the cases to extend a given class you don't have you don't have a lot of extension points so in most of the cases given a class we have at most four extension points and also we discovered that applying our miner very most of the patterns that we were able to find were pretty much simple so from this point of view you can understand that as I told you before you have a trade off the trade off is that on the one end you have that your your extension your patterns could be simple on the other end you have the accuracy so the higher the accuracy the simpler the patterns but is also this means that what we have to do in the future is that we have to try to gather most some patterns that are more complicated this means that maybe our accuracy will will start failing but it's also true that as a developer you care about complex patterns because if you are going to use a library you don't need you don't need this kind of technology if you're the extension is easy last I will talk to you about mining edits so until this point until this point we'll talk about the snapshots so you have your snapshot you analyze your snapshot from your snapshot you analyze I mean you analyze your snapshot from a syntactic point of view or from the point of view of a given library or a given framework now when we talk about systematic edits we are talking about comments so what we would like to do is that we will let's consider a repository and in this repository we have these changes and then these changes and then these changes so we have three instances of a change but in the end we have only one systematic edit so actually it could be interesting to know about these changes because what we will do is that we would if we have a new change and the this change is different with respect to the systematic changes it's it's more it's probable that actually we are going toward a bug so systematic edits can be tedious and of course they are manually mining this data is very hard because you can just add errors now what we did is that we proposed a tool that actually based on a software project in Java is able to mine all the story of a project and is able to it represents the source code changes as ST nodes and is able to apply a frequent item set mining that is a bit different with respect to the first that I show you in the previous slides now the application are similar to the one that I show you before so first of all you like to detect error from code then you would like to assist developers and the best thing will be to generate automatically some transformation based on existing instances so in this case the approach is a bit different because we don't start from a snapshot but we start from a git repository with different code revisions and then we apply a change distiller so we are able to gather all the edit scripts then what we do is that first of all we group these edit scripts then we apply an equivalence gravity criteria and all in the end we apply our frequent item set mining algorithm now let's consider an example because it's much easier to understand let's consider that we have this commit so in this commit we have a we are adding two lines of code actually these two lines of code are more or less the same there's only a really small change now so what we are doing first of all we are inserting an if statement then we are inserting a method invocation afterwards we are adding this expression and then we are computing the equals on the on the argument to the TCP afterwards we are returning we are returning an integer now we can say that we have two different groups of changes this one and this other one but in the end these two groups are doing the same now what is the grouping criteria first of all what we are going to do is to group the changes into transaction actually what we do is that we can we group the changes based on the method on which they are so we are going to have these two groups of course there are some limitations in this approach because our approach could be a bit naive in some cases for example because we are excluding changes that occur across the across methods across different methods the second thing is to understand given two different group of changes is to understand if they are the same otherwise we cannot compute the ad script now the change equivalence criteria is based on the change type the subject and the context the change type is pretty much easy to understand because for example in the case before we are adding a lines of code so in the end this is an insert so given this this this edit in the end we know that this is the subject and actually what we do is that we anonymize the identifiers just because otherwise it's it's hard to gather together the patterns and then finally we define the context but the context if you think about is just the place in which the subject occur so given these two group of changes we transform the the edits in this kind of representation what we did the moment okay so what we did is that we evaluate the correctness of the tool and actually we applied it to our code base of our company that is in Belgium we mine different repositories from Android projects actually and it's important to say that for really understand if our ad scripts were correct or not we had to manually evaluate the scripts and as I told you before this something that you always have to do I mean one of the problem in doing the research in academia is actually sometimes we don't have real developers and so it's it's a bit tricky but in the end we we try to be developers so what we we say the for we understand from this study is that first of all in most of the cases you have that the majority of systematic edits have a few instances so in the end you see that and that is one of the problems of applying our mining algorithm user having a few instances it means that you have to you have to to spend more computational effort then another problem because you you cannot cut the search then another problem is that in most of the cases you see that on average three or four ST level changes this regards the number of instances so that you have that larger instance size can occur with the small number of instances so in summary we are we have a technique that is able to identify systematic edits we in most of the cases we see that in 12.5 percent of commits they contain systematic edits and mostly of these instances are pretty pretty much small and but they can have a large size as a future work we are trying to explore the different configuration and we are trying to mine for the specific type or systematic edits and also we are trying to mine across commits because one of the problem that we have today is that we mine a single commit per time so if we have changes that are across commits we are not able to to mine this kind of information thank you yeah the question is that when we were studying systematic edits if we see if people yeah okay so the the question is that if we were able to mine some best practice across the systematic edits actually this is a nice question and actually is what we would like it's the final goal of this kind of study I mean the the final goal is to understand if there are some best practice that you can apply everywhere but what you have to understand is that most of the cases for systematic edits you have a really small number of instances it means that if you mine a software repository and you have for example for a systematic editor you have three instances for example in this case we saw that for 2500 systematic edits we have only three instances so this cannot be our best practice but one of the problem is that one of the problem applying this kind of technique is scalability because what we were doing is that for each project we are going through the history and we were analyzing all the the history of just one project per time now what do you like you you should do for having an information like that is that you should mine a complete ecosystem so that you should be able to understand whether there are some some best practice that different developers are applying because in our case most of the repositories are I mean most of the people that work on a repository are not a lot I mean in most of the cases you don't have thousands of developers so it's it's really hard to to mine a best practice that you could apply everywhere you find that okay so the okay so the the question is that if we're mining a different different projects we found similar similar idioms actually this is a nice question and it's it depends what it still depends what is an idiom for you because for example I could say that when you one of the example for an idiom is that let's say that you have a project in Java 8 you would like to go in to a new feature that you have in Java 9 that what you can say is that okay I will if I mine the projects that are moving from one version to another I will find this idiom so if you think that that is an idiom then you will find this kind of idiom across projects but there are some idioms that are very peculiar about and they are really about the developers that are implementing the the project so in these cases you will not find them I don't know I don't know how I want to eat I can't imagine you want some I will finish the presentation and I want to take a sound