 Okay, we're good. So, please roll the bus for the last talk of the day. Hello, everyone. Hello. That's like that. It's really amazing to see the room full of people, even though this is the last talk, the last session, and the last day of the conference. So, thank you and welcome. So, this is a collaborative work from my colleagues and supervisor from Athens University of Economics and Business. So, all the good part you are seeing going to see today is all belongs to them. All the problems that you may say, what is that? You can point to me. And, well, let's start. So, what we are going to talk about, we will discuss about smells, code smells, and some earlier speakers have made my job a little easier. So, I'll probably skip those parts very quickly. So, we'll talk about smells. We'll talk about how people detect smells today and we'll talk about how we tried to detect smells using deep learning. And, I'll go a little bit in technical detail what we did and what kind of experimental setup we put up and what kind of results we got. This is what we are going to do. Sounds good? Great. So, before going into that, let me briefly talk about myself and give a disclaimer that I'm not really a machine learning expert like most of you. I'm a software engineering researcher. I'm completing my PhD hopefully in a couple of weeks. So, and these are the topics that I'm mostly working. So, refactoring, smells, code quality, these are the things that I mostly work with. Okay? So, let's start with the term smell. Who coined the term smell? Any guesses? No, no, you just say who coined that term. Anybody? Not really. Actually, the term came up in the book by Martin Fowler, but it was coined by Kent Beck in 1999 in the famous Martin Fowler's book, because that chapter about smell was written, written, code written by Kent Beck. And they defined that term very casually like certain structures in the code that suggest or sometimes scream for the possibility of refactoring. And later on, and especially in academic world, many people defined that much more vigorously and you can find all the definitions here. If you are interested in the smell catalog, I put together at least the known and common smell. And right now there are 264 smells belonging to different domains and subdomains of software. If you are interested, you can find that catalog here. So, let me give some examples before I delve into the detection part. So, when we talk about smells, we can classify them based on the granularity based on the scope and based on the impact that they make. So, the lowest granularity is the implementation smells which you can detect or you can sense just looking at for example, a method. So, magic number, complex method or a empty catch block, these are the kind of smells that you can spot just looking at a method. The next granularity is design smells that are mostly at a little high granularity and when that kind of smell that you can detect by looking at this and the classes abstractions and the relationship among them. So, examples are like multifaceted abstraction when a class is doing much more or realizing much more responsibility or insufficient modernization which means more commonly known as God class. When you see a software from components and the relationship among them and that granularity and you are detecting smells, you are detecting architecture smells and these examples are basically architecture smells. God component or feature concentration. Future concentration is nothing but a component is doing more than one. Scattered functionality is like a responsibility scattered across multiple components. So, how people are detecting so far smells. So, there are basically five different categories if we try to analyze all the different smell detection algorithms of the methods available today. There are basically five different categories. Matrix based, rules are heuristic based, machine learning based, optimization and history based and most common ones are matrix and heuristics and what people do normally this is what people do. So, you have source code and you prepare a source model out of that. Okay, that's not really working. Okay, so you prepare source model and source model could be anything more simpler an example could be AST, Abstract Syntax 3 and from that source model you compute matrix and then you apply a certain threshold to classify whether that smell is present in that particular method or class or component. Recently some attempts have been made to detect smells using machine learning and this is again a very high level view what people do when they want to detect smells using machine learning. So, you have source code and you again prepare some sort of a source model. You have a machine learning algorithm. Again, there are different kinds of machine learning algorithms that you may apply. There are some existing examples especially I'm talking about I mean you have existing examples that you used to train the machine learning algorithms and you have a trained model and with that trained model you actually infer a new code that whether those code snippets or code fragments can be classified as smells or non-smells. I would like to brief about what kind of existing academic work has been done on related to machine learning based approaches. So, people have used support vector machines, bios and belief networks, logistic regression and even CNN. And for first three of them the features that are the input is matrix, object oriented matrix typically. And that's basically a double edge sword. At one end you are trying to help machine learning algorithm to pick the classification feature very quickly so that the algorithm can decide whether it's a smell or non-smell but on the other hand you are reducing the search space and so that the and you are limiting the capabilities of machine learning or deep learning algorithm by limiting the capabilities of that because you are now giving only the matrix so the machine learning algorithm is bound that can do at best what that metric can represent. The metric is not representing the feature or the characteristic that is helpful in detecting that smell then well we are not going to get a good classifier. Another related problem especially in the existing work is either the validation is there are many details missing or if there are details are there then the validation is on balanced samples and in the morning also we have seen that when we have a balanced set versus in balanced set that can be really drastic change in the performance. If you are representing a realistic case in which you have maybe 1% of classes are smelly and 99% classes are non-smelly then and you apply some sort of a machine learning algorithm and your precision recall curve will be like this. So that is one of the observations that we had and with this sort of background these two major research questions that we set up for ourselves the first one is is it really feasible or possible to apply deep to detect smells and if it is possible then which method is performing superior and basically we analyze CNN and RNN and we give inputs in 1D and 2D so we basically have three models to compare CNN 1D, CNN 2D and RNN. In the second research question we are exploring whether transfer learning is feasible and what is transfer learning? Transfer learning is nothing but a technique where we exploit the commonalities between different learning tasks. So what we are going to do essentially is that we learn a model from one programming language and then we apply that trained model to another programming language. So to do this what we have done we have a research question, set of research questions that I already talked about we downloaded source code obviously our first approach, first step is to download source code and GitHub is the to go method. So we downloaded C-sharp and Java repositories. We use designite to detect smells in both kind of repositories. We use code again our own tools that we have made available. I will share all the details where you can find all these tools. So we basically we are generating code fragments from these big projects. I will go into each step individually also. So I will go a little bit in more detail. This is just I am giving an overview. And then these code fragments and the detected smell we basically classify these code fragments into positive and negative samples. And then we tokenize them. After tokenizing we prefer a pre-processing which is nothing but removing the duplicates. And then we finally give it to the deep learning models. This is the overall setup that we used. Let me go into a little bit in detail one by one. So we downloaded more than 1000 repositories containing C-sharp code. We selected 2500 and downloaded 100 repositories just for validation. We are using Java repositories just for validation. So we just downloaded 100 repositories. And how we selected these numbers? 1000 and 2500. We have 8 dimensions architecture, continuous integration, unit test and so on and so forth. So we selected all the code studies that has at least 6 out of 8 positive numbers, favorable numbers in these dimensions. And number of stars at least 5. So we discovered everything which is less than 5. With that we got these kind of numbers and we downloaded the code. The second thing we did is that for each project we need to split the big projects into individual samples, individual fragments. An individual fragment could be a method when we are detecting implementation smells and a class if we are detecting a design smell. So depending on that we basically split the code into either method or class and we put them in separate files. Same thing we did for the Java. The next step is smell detection. We use Design Night and for both Design Night C sharp for C sharp code and for Java, Design Night Java for Java code and detect all the smells that can be found in the code. And by the way Design Night is a tool which can detect 19 design smells, 7 architecture smells and 11 implementation smells and similarly Design Night Java version is also picking it up to that level. You can download these tools and again at the last slide I will share all the links to you. So we know what kind of smells a project has and we know all the code fragments. So with these two pieces of information we classify these code fragments either through a positive sample or to a negative sample. After that we use tokenizer. What tokenizer does basically takes a code fragment and converts into integer tokens, a set of integer tokens and to do that and in that context there are two important things to mention. Tokenizer defines specific ranges for specific tokens. For example all the reserved keywords are always assigned the same token and all the user defined symbols are also specified or assigned a specific set or specific range of tokens. Another fact about tokenizer is that right now it supports six languages including C-Sharp and Java which was our subject systems. It looks like this so if you have a very small method then if you tokenize this method into 1D it will look like a 1D vector and if you have a 2D matrix then it looks like something like this. And again it's important to go a little bit in more detail and see how we prepare the data, how we train the model. So that's why I'm showing you a very specific example of a smell. So this is a 5,000 approximately 5,000 number of positive samples we have for a smell and more than 311,000 negative samples. So this is the, you can see that it's a highly imbalanced data and that we split into 7030 for 74 training and 34 validation and what we do obviously we have 3,000 and we have a split 7030 and for training we take this positive part and this negative part and similarly for the evaluation we do the same. Here when we train the model we do balanced training which means that we have equal number of positive and negative numbers samples. So what we do basically we discard anything extra that we have with this particular part, the negative part so we have equal number of positive and negative only for training. For evaluation we keep the ratio as it is we don't change it and we all the results that I'm going to show it on the real world evaluation. And we selected 4 smells. All 4 smells are different kinds of smells. 3 are implementation and 1 in 1 is design. So the complex method is a method that has a high cyclometric complexity. Magic number is a numerical literal is used in the expression without explanation. MP catch block is a catch block of where the exception there is nothing written in that catch block and then multi-faceted abstraction is a kind of difficult to detect smell because and that's why we chose it because it is very different from the rest of 3 and because it has a semantic meaning you can't detect a multi-faceted abstraction just looking at the code. You need to understand what it does because what it means is the cohesion of the class is low. So these are the 4 smells that we chose and now let's talk about what we did, how, what kind of seen and what kind of model or architecture we prepared with the CNN. So you can see that we have convolution batch normalization and max pooling layers and these layers are repeated. Repeated in the sense when we experimented with number of deep layers and when we say deep layers we only refer these number of deep layers 1, 2 and 3. So if deep layer layers are equal to 3 which means we repeat these 3 times and we did a kind of a grid search which means that we chose different kind of hyperparameters and we experimented with all the permutations and combination and we used dynamic batch size depending on the input sample size. So if the number of samples are very small then we use the smaller batch size but if the number of samples are very large then we use the higher number of batches for regularization we use early stopping with patients 5 so that we don't overfit and model checkpoint along with that and some other configuration you can see here that drop out layer has a point 1 drop out and the dance has I mean the output is 32 with the Rayleigh activation and so on and so forth. Similarly for RNN we have LSTM layer and before that we have embedding layer and when we have a number of layers multiple deep lining layers we basically repeat this LSTM layer. Again we have different kind of hyperparameters and dynamic batch size and similar size, similar callbacks although we use patients too because it is very expensive to train RNN model and if time permits I'll talk about that in little bit more detail. So essentially what we did is we had two phases. In the first phase we did a grid search with the different kind of parameters configurations so we had 144 configurations for CNN and 18 for RNN and we kept validation set 20% aside for the first phase. In the second phase when we know which configuration works the best we run it again with validation 0 obviously and we performed all the experiments on VANET facility which is a Gleek supercomputing facility with one GPU and 64 GB of memory. So what we got is before I go into what we got and I said initially that we were very we started without any knowledge of machine learning so the first thing that I did is figuring out accuracy so I was computing accuracy and the first time when I run the model I got 99% accuracy and I was jumping and I told my supervisor see 99% and he looked at me hmm you have imbalanced data and huge imbalanced data obviously you'll get 99% even if your model is just predicting 1 or 0 you'll do you'll get that so then we learned that okay that's not the right way so the first thing we added is ROCAUC so this is the kind of ROCAUC we got so area under curve basically this is what we got so though it's crossing 80% in some cases and the lowest and it's 50% or something like that. However we again realized that this is not the right way to precision recall and precision recall curves and this is what we got for precision recall and basically we summarized precision recall into F1 F measure to have a comprehensive one number so you can clearly see that some of the smells these deep learning architectures that we designed can pick it up but some they can't the reason in general F1 is low because we have low precision and why we have low precision because we are predicting false positives a lot of false positives so well we can optimize that whether we want high precision and recall and probably you know better that precision and recall always have a trade off so if you want to get high precision we can have to compromise on the recall so this is what the first result and we go into little bit on detail that whether CNN versus RNN so whether CNN is better or RNN is doing better in our case and we see if you look at the maximum F1 there is hardly any difference even if we plot all the cases then also it's not decided decisively we cannot say that whether CNN is better in our case or CNN2D is better we also did comparison between CNN and RNN I would not consider this because in all the cases in all the three models the performance was very low so 0.004 or something so which means that it's not really fair to compare 0.004 and 0.006 because it's very it will bring the multi-fold difference however here when RNN can detect something it's doing very good in case of complex method it's not doing that great because probably it cannot capture the structural property of the method however in other cases in these two cases probably RNN is picking up that feature and that's why it's doing much better than the rest of the two models another experiment that we did is is it beneficial to increase number of deep layers and what we found is well for the second layer maybe we get some boost up however it starts decaying or at the best it will not change anything so again when we are applying deep learning we need to see that whether increasing number of layers whether it is beneficial or not or we are just increasing our training time this one is even interesting so I would repeat what we did here so we trained the model in C sharp however we applied the trained model to Java samples to see whether the same trained model how good it is to classify Java samples into smelly and non smelly code and this is what we found and again it is something similar trend but if you compare this direct training versus transfer training transfer learning you will see that transfer learning is doing much better so the model trained on C sharp is better in doing smells in Java code rather than its own code which is kind of surprising and I am still looking why it is happening honestly but this is what we got so this is what it is the first conclusion is that it is feasible to make deep learning model learn to detect smells obviously which smell you are detecting and the performance varies but we can do much more that is a separate thing transfer learning is also feasible we have seen that it is feasible so what it implies that if you have a very good tool for one programming language then you do not have to invent or write the tool for other programming languages at least similar programming languages you train the model and you apply the model to other programming languages and obviously there are many many possibilities for improvement we can improve performance we can add more smells different kind of smells and this is I believe this is just a start and there are many many things that can be built over it these are some relevant links as I promised you can download all the source code and data here I made it open source today morning you can download the the design for C sharp here and for academy use it is free for java it is open source project you can do whatever you want code split is again this java version is open source you can download and you can play with that and for C sharp you can download freely without any charge tokenizer is another tool that we have offered and you can again use it it is open source project feel free to use it ok thank you if you like stickers I have some stickers some fancy stickers please feel free yes ok what is the exact input to the convolution layer ok so as you can see the architecture I mean you we have defined and before giving to the deep learning model to the whole models we convert source code into tokens so yes tokens integer tokens but we have a numpy array basically which individual element is basically integer token in 1d 1d cnn 1d we have each input or for example each method is a single vector 1d vector in 2d cnn 2d we have the matrix so each input sample is a matrix 2d matrix it doesn't remove I think it's only the white spaces it removes so ah so the question is that how does the tokenizer work and how what it removes from the source code so I think it doesn't remove anything it just remove white spaces and for each for each token for example public I mean if it is a reserve word there is a set of tokens assigned to it if it is a user defined symbol then there is a range of tokens on which it will be assignment and if it is a numerical literal again it says range of tokens which will be assigned to it framework dependent code the question is how it will work on framework dependent code I don't see any difference between framework dependent code or otherwise but we have not tested it so this is a very first attempt right now maybe we can do that I don't think that will impact anything because whatever code you provide I mean anyway the scope is very limited for the scan so it's for implementation it's just one method so it's just looking that method ah ok the question is what are the challenges if we want to extend this work to more involved smells more complex smells for example scattered functionality it is difficult it is difficult because our goal was that we will not give the processed input to the model we want to give model as close to the raw source code we just tokenize it and remove the duplicates but we didn't do anything else which normally referred as feature engineering we didn't do that and that's the one of the aim we intentionally didn't do that because this is how we want to process it but if we want to if we want to go little bit in detail and more involved smells and which we have seen in the multifaceted abstraction it's not really easy the whole characteristic of the smell is not captured really and the model cannot capture even we given a training size of 10,000 cannot capture that that is one possibility but probably we need to do some sort of or mix or introduce relevant features or give relevant features to help the model to detect better when is it useful to accuracy of natural language processing above 95% right now to detect the smell when do you consider it useful at what accuracy level is the goal for the industry the question is that is it ready for the industry if I summarize it at what level accuracy is it useful for us to okay for the industry what kind of accuracy will be needed to so that we can say okay now it's ready for industry that's difficult to answer but in the morning we one of the sessions we say that the precision is really important for this kind of analysis so if we have a precision or lower precision like less than 70-80% than people will reject it very outrageously so it must have something decent precision and because this is not really a life threatening if you miss something it's not life threatening so we can compromise on precision but we cannot no sorry recall but not on precision because otherwise if precision is very high which means false positive is already high which means if precision is low then high false positive is very high which means people will say out of 10 only one is useful don't want to use it so a decent percentage could be at least 85 there is no goal rule for that but something like that