 usual way. Today I'm going to talk about one of the projects that we do at JetBrains Research and as I think you all figured it out it will be about Kotlin and Kotlin code anomalies. Well I'm not here to officially promote Kotlin, we have specially trained people for this, but we have to understand what they're looking at. So another show hands how many of you tried Kotlin? Whoa this is awesome. Anyway, so brief introduction. Kotlin is a general purpose study type programming language that both features object-oriented and functional features. So it was designed to interoperate this Java fully and originally was targeted JVM and Android platforms but now it can be compiled to JavaScript and to even native code via LVM. It's open source, it's relatively young but have active and continuously growing community. So we should focus on the tooling around this language. For most people Kotlin looks like a more streamlined version of Java with its extension functions, code routines, properties, null abilities, analysis and different features and when people use these cool features their code generally becomes more clean and concise but there are always people who do things differently. For some reason we have an implementation of fourth compiler in Kotlin. You know this stack-based language from the 70s. Who knows why it was written but as GitHub states it's written in Kotlin, 100% and sometimes it doesn't look like Kotlin at all. We call this code fragment anomalies in a way that their good code it's syntactically correct it may even work very well but it doesn't look like the code that other people write in this language and these code fragments are actually of great interest to language developers because well they could show some previously unnoticed compiler bugs or highlight some compiler performance issues or even could be used to get hints on how to improve the language further. So the informal text description that we've got from the Kotlin developer team was take all Kotlin code in the world and bring us some weirdly looking programs. And like in risk management it's pretty hard to plan for known unknown and it's much harder to plan for unknown unknown when you don't know what you're looking for. Before we clarify what this could mean and how to achieve this let's see what can be done in this field. If you Google code anomaly detection you well you find something some of the papers listed here the first two of them are based on static analysis the first one presents group miner tool which tries to detect anomalous objects interactions so it takes the code builds a directed acyclic graph that features construction method calls and their dependencies and tries to apply graph anomaly detection techniques to find a typical areas in this graph. The second one uses somewhat similar idea but they build the users models of objects using their sequences of method calls and they also apply graph-based anomaly detection techniques. They are very helpful if you're trying to find bugs in your programs so you these approaches basically target language users not language developers that we are intended. The next papers are based on dynamic analysis the guide use tool tries to run your program and store every value of every expression that it counters and it tries to induce some invariant rules and when these rules are violated for example some expression gets a value that well it differs a lot from all its previous values well it's considered a candidate anomaly. The last one here also runs the program but collects traces of system calls but nevertheless we have a huge data set in mind and well projects have all kinds of weird dependencies and who knows how they're supposed to be run so I think we should limit ourselves with static analysis only. Okay getting back to the task at hand what should we analyze? Originally we targeted the regular Kotlin that works on the JVM so we have the source code and the byte code produced from it. We should analyze both of them because analyzing source code gives us patterns in incorrect language use while analyzing byte code provides us with compiler issues but what's best we can combine these analysis together for example we could search for code fragments that were not anomalous in the source code representation but became anomalous in byte code representation and that's clearly an issue of some kind. Next at what level should we look at the code? Well obviously looking at operators single operators or lines of code doesn't make any sense because they don't capture structures complex enough to form an anomaly. Functions seem like a good choice because they're large enough to have some code that can form an anomaly but small enough to represent one single operation on a class. Classes also seem like a good choice if you want to look for anomalies in inheritance in function signatures or control flow for example to long chains of function calls whatsoever. Files could be used if you want to search for anomalies in class interaction and projects seem too large and to the main specific you can actually analyze the project without knowing what it was created for. So moving on to how. There's data science and there's anomaly detection techniques. We could use a standard task of anomaly detection on vectorized data and then when we get some anomalies we should somehow classify them according to the type and that's another task another challenging task. Speaking about code representation it's well it's really a hot research topic right now but basically all these approaches fall onto two categories. The first one are explicit features basically software metrics like the height of the ST or collision coupling metrics and some natural language processing features like bag of words or their derivatives. These features are very descriptive so you just look at the vector value that you've got and almost always you have a good hypothesis on what's wrong with this code and why it was considered anomaly which is very good but these features are hard to choose because well first of all software metrics are very well gray area because they mostly rely on opinions on what good code is and this is highly subjective. Also it's hard to choose them because when you don't know what you're looking for so you have to specifically specify precisely specify what are you looking for with these metrics. There's also path-based presentations training approach that basically allows you to traverse the tree and collect all the nodes types in this syntax tree that you encounter and then use these paths to do some research on it. An authority here is implicit features that mostly could be described as n-grams, some kind of neural networks processing, ST, hashing and different kinds of distributed representations while they obviously lack expressiveness. You get some vector of numbers and you can understand what this really means but they can capture some properties that were not obvious beforehand so they could be very useful too. Speaking about classic anomaly detection task you can do a lot actually when you know things about your data. Our data is all Kotlin code in the world and you can hardly make any assumptions about it you don't know your distribution type for nothing so we have to be really careful here when applying these techniques. The first two here are just a brief overview. The first two here are very popular outlier detection techniques. They are good because they're assigning a normal score to each of classifying objects and not just performing binary classification. Also they do not make any assumptions about distribution of the data which is good in our case. Some clustering algorithms including this here well also could be used for anomaly detection since they don't prefer that each element should be put in some cluster. Autoencoder neural networks is really fun too because basically it's a neural network that has input and output that has the same data as input and output and using its hidden layers it tries to learn the identity function basically. So the first part of the autoencoder is the decoder which tries to reduce the dimensionality of your data and well create a hidden layer of values and the second part is encoder that tries to reconstruct your data from these hidden reduced dimensions. And if you get reconstruction error, if you observe reconstruction error meaning that your predicted value differs a lot from your actual value that's normally candidate and you can just do it like that. There's also one technique here which is not unsupervised learning technique. It's semi-supervised learning meaning that you have to get some labeled data to train the classifier but it also used for outlier detection and well basically if we get some results we can pass them to this one class SVM and get more results. So there's a number of algorithms available to solve a task. So let's talk about what we have done so far. The workflow here is pretty straightforward. We get code from GitHub via GitHub API. We cloned and changed the compiler a bit to serialize all syntax trees that it builds. We ran feature calculation on these trees. We ran algorithms on these features. We look at the result and repeat. At some point we take all anomalies that we've got, go to Kotlin developer team, show them and see if we need to change a course. Today I'll talk about three experiments that we've performed last year. Well now the data is different. For example for last year the amount of Kotlin code has doubled on GitHub also but anyway we fetched all repositories that stated Kotlin as their main programming language that were not forks of any repositories and it was created before March and that left us with remove the duplicates and that left us with about four million functions and 9,030,000 files. We decided in these initial experiments we decided to stick with functions as our analysis level. Well that was a design choice. So the first experiment was very straightforward. We used explicit features and collected 51 of them describing different aspects of the code as you can see here. So we get for each of four million functions we get a vector of 51 numbers which we managed to reduce to 20 using the PCA. It's a reduction dimension reduction algorithm which led us with pretty much the most of the variants that we needed but much more friendly computation time. We ran a local outlier factor and isolation for us on it. The contamination parameter here is basically the proportions of outliers in the dataset and this is another design decision because we have no idea how many anomalies is there. So we just set it to some small number like one hundredths of a percent to keep the results observable by humans. That led us with one hundredths of a percent of four million four hundred functions which we manually observed and selected 322 anomalies that could have been of interest to language developers. Another experiment used explicit features. We used n-grams namely unibuy and trigrams on the syntax tree. This simple image shows the biograms and trigrams on the syntax tree but the idea is quite similar to these past based representations but simpler. And we use out encode a neural net to detect outliers here. It was a very simple architecture with only one hidden layer. We experimented with the compression rate of the hidden layer and got around 360 anomalies. So we brought these anomalies together and removed the duplicates and still got a lot of anomalies that we were not sure were useful or not. We looked them through and manually labeled them. We got 23 types. We created a simple web interface that allowed Kotlin team developers to rank these anomalies one by one from one to five. And 12 out of 23 types were considered to be very useful meaning rank four and five. This table shows this E1 column is experiment one, E2 is experiment two, errors rank. As you can see here most of the anomalies that were considered useful were lots of something. It's not that fancy and weird as we anticipated to see but developers still found these anomalies useful and use them in their tests. Some examples. This is an anomaly with this is a function with a when statement having about 120 case branches. I sincerely hope it was automatically generated. This one is a function with 22 generic type parameters. Actually this one is very useful because our recent finding was that compiler does a lot of work around generics and in some complex, real complex cases it could even reach exponential complexity inferring types. Code examples like this are actually very useful for tests. This is a weird test function. Actually this fourth compiler that I've showed before actually also have fallen to this strange code constructs group. And this one is good. This allowed us to file a bug to a parser. This obviously incorrect 400 plus lines of code actually break the parser with the stack over flow error. So we filed a bug. And the third experiment here that I'm going to talk about is features both static source code analysis and byte code analysis. Still we don't want to compile any code ourselves but we need byte code. What do we do? We created a tool that crawled GitHub and tried to find released Java files for projects. It downloaded the files, grabbed metadata from it and tries to search source codes for these packages. That way we managed to grab around 40,000 source files and byte code for them so we can compare the anomalies. The approach for the same using add grams and auto encoder that work and we looked for anomalies in one case and not anomalies in another. That led us with 38 conditional anomalies which one example that I will show you is these 10 lines of code function that turns into 4,500 byte code instructions. This is not actually a bug in a compiler as it turns out. It's a bug in the framework or not a bug but design feature. Someone wrote a very complex bind function that was inlined and someone wrote a code that inlined it nine times in a row so it resulted in a huge byte code. That's an interesting example because looking at the source code by itself you can have a guess that that's some kind of a weird code. This was our initial experiments. Now we are trying different algorithms for anomaly detection, different code representations. Now we have some labeled data and we are finally free to use semi-supervised learning like active learning or some other fancy stuff. A lot of our pipeline is still made by hand for example clustering and labeling of these obtained anomalies which should definitely be automated. We can look at different structural levels not functions but classes. On the other hand we can look at feature-specific anomalies. For example look for anomalies in loops or look for anomalies in function signatures and so on. We could look for anomalies in object interactions and some other ideas that were presented in the papers that I mentioned before. We can go deeper in the compiler and see which optimizations produce anomalies of any kind and so on and there's always Kotlin for Android, Kotlin native and Kotlin.js each of which is a completely different world with different anomalies. So to sum up even our first very straightforward experiments showed some useful results. Our work is open-sourced on GitHub. Also here is our research group page. If you in some way interested in our work feel free to drop us a message and thank you very much for listening. We have here the question was about whether we tried some other algorithms to instead of auto encoders. Well for implicit times implicit features we haven't. Well basically we tried to see we have a lot of stuff to try and we wanted to see if any of it works because when we got the test description just find the anomalies we weren't sure if we succeeded all. So we just tried this and this and this and this and got something working. Yes now we are doing different well more scientific approach when we try different algorithms on the same data and compare and stuff but I'm not sure I can talk about it yet. Other questions? Yeah the question is whether anything that we found so far well influence the language design decisions. I'm not sure that's the short answer. The more detailed answer is that you don't really I think you can't really see that obvious what makes people who take decisions well do this. So we have presented several anomalies that could have led to some decisions but well I'm not sure if they will be taken into account but already we had some feedback some positive feedback on some anomalies that were well they could be used I hope they will be used but I think that takes time and some process it doesn't work so bad. Thank you. Thank you.