 Okay. So, cool. So we're ready for next talk, Vadim from head of our machine learning team at Sourced. He's going to be talking about, I guess, what we're doing. So, Rob, let's have him. Thank you. So, I'm so happy to speak for Fosim today. It's my first time. So, I'm Vadim, the lead of machine learning team at Sourced. Also, I'm a Google developer expert in machine learning. I want to talk today about how we apply machine learning to Sourced code. In particular, how do we fix problems with Sourced code during code review? So, the plan of my talk is as follows. First, I will explain why we decided to do products on top of assisted code review. Why anything else? It's so many ways we can help developers with writing Sourced code, but we choose assisted code review. So, why did we do that? That's the first part. The second is about the platform which are developing to do assisted code review. It's called Lookout. The third one is about the software development kit because since everything is open source, we want to simplify creating Sourced code analysis for code review as much as possible and it's impossible without an SDK. So, everybody can easily create a new analysis and write a build for code review, which is awesome. The fourth part is a demonstration of how one particular analysis works. We call it Stylenalizer and it tries to fix formatting problems in Sourced code just from training rules in an existing code base. So, we don't really use anything predefined and you don't have to configure it. The last part is my explanation how everything works with Stylenalizer, how machine learning works, what are the challenges what were the problems and how we are solving them. So, let me start from the beginning from the origins of assisted code review. I did a few queries on GitHub, just searched some very typical phrases which I think are relevant to software development and I looked how many issues and pull requests and commits people have with these messages. So, the first one is please use single quotes. Here, as you know, there are languages who do not really distinguish between single and double quotes. You can use both and it's really up to you which type you want to choose. It's written in your style guide or noted. It's not really required by the language itself. So, more than 1,000 issues and most of them are pull requests are devoted to people asking to change the type of quotes just because their project people contribute to, adopts one specific style of quoting. The second query is indentation. As you know, we can indent with spaces or with tabs. Some people tabs are better, some people think spaces are better. Anyway, if a project is consistent and let's say it uses spaces all the time and you get a pull request where people use tabs, you ask to use spaces just because that's what you have in your project. So, more than 10,000 pull requests and I checked it. The majority of pull requests contain this message and people ask for space indentation. The same for list comprehension. It's a feature in Python. So, instead of writing a for loop, you can do it in a fancy way in square brackets and it's a one line and syntactic sugar is very good so people tend to use it instead of for loops. So, more than 2,600 times people ask to use a list comprehension instead of a for loop in Python. So, I cannot say that it really changes the program logic. Of course it's not and I think it works with the same speed but it's a way you write source code, right? It's your style of writing. Splitting of function. Functions can be very complex. They can occupy more than 100 or 200 lines and nobody likes to read them because it's hard. So, the essential refactoring in this case would be to split a function into several parts and make it more comprehensible. More than 8,000 times people ask to split a function just because they think they should do that. Again, it doesn't really change the logic much but it's your style. It's an example of how you, let's say, enforce good style on a project. The same for in names. You can name things differently because there are two hard problems in software development how you name things and how you do cache validation and naming things is hard but if you are a maintainer of an open source project most likely you have better idea how entities should be named than a contributor. So, you ask if you see there an opportunity to improve naming, you ask for renaming things. More than 9,000 times people ask this. Adding a comment also very relevant. 48,000 times people ask to add a comment just because they don't understand how the function works or they want to allow other people to quicker understand how the function works. The last one is my favorite. It's fixing a typo. We all humans, we all make mistakes, we misspell identifiers, we misspell function names and can you imagine 910,000 comments actually are typo correction and it's a huge number and we have to do something with it because seriously that's just too much. So, to summarize all of this which I just curate on GitHub. We spend a lot of efforts on enforcing boring things because I think that enforcing typos correction or enforcing good functions or enforcing least comprehension or indentation is boring. You are not going to spend all your working time on just enforcing these things. You want to work with the algorithm and you want to spend your code review is reviewing the logic and not how the code is written. It's intimidating. So, we should do something with it, right? But the good thing is that if something is boring there isn't an expectation how things can be fixed, right? This is by definition why it is boring. If it different all the time probably it's not that boring. So, if there is an instruction you can follow specific manual follow a specific set of steps and automate it. But the problem is that how we are going to automate this style enforcement in source code. Yeah, this is quite hard. Also, programming is still on art and you cannot really write about who writes code for you. So, automatable here doesn't really mean that it's unattended and you still have to do human supervision and, you know, supervise how your automation works. And this is basically the idea behind this set called review. You don't really try to replace a human and do all the work for her or him. You want to help just point to some problems or suggest fixes to some boring stuff. So, speaking about how we can help we can do it in an IDE and this is what Miltus said in his talk this morning. It's a very nice way of helping because the majority of problems which you fix you fix them while you are typing in an IDE. This is by definition. The problem here is that there are many IDs and if you want to ship a specific analysis you have to do it for all IDs at once and this is a lot of work. And the second problem is that people expect an immediate feedback. So, your analysis should be really fast and if you want to train simultaneously it will be hard because you have milliseconds, maximum seconds, you don't have hours to train. So, we decided that we will not follow this idea approach and we do something else. You can also deploy checks through CI or even periodically let's say you run it in Jenkins or in Chrome, it's also possible but the problem here is that there is no user interface. You cannot easily show people that something is wrong and this is how to fix it. So, recently GitHub added a very nice feature it's called a suggested change. And this is like a deal breaker for us because we can suggest code fixes during code review in pull requests and with a single click of a button you can apply this change, which is awesome. So, we decided that we will do assisted code review. So, this is a very nice blog post from Codesy and contains a really nice gift here. Yep. So, the author curate some specific terms on GitHub like I did but he did it more extensively. And he found that 20% of code review comments are actually about boring stuff, right? About style and best practices. I think that this number is probably too optimistic. I think the real one is even higher especially if you start working on a project and you enter this norming stage of team formation. People tend to argue about code style and actually this number of comments can be even higher. But anyway, it's at least 20%, which is huge. We want to cut this 20% and use the rest 80%. And I'm passing over to look out which is our platform for assisted code review. So, it provides a tight integration with Git on GitHub and it's completely language agnostic. So, it doesn't really care which language you analyze and it doesn't care which language you use to write your analysis. It just takes all the, let's say, obstruction for dealing with GitHub, this parsing source code with talking to different parts of your system. You don't have to deal with it. You just write your algorithm which does code analysis and that's it. Comments are posted automatically. This is the overall architecture. So, look out, basically contains two parts. The first is a server and the second is your analysis. It can be several. They register through a remote procedure call and it can be written in any language, GRPC supports and this means all the popular languages. So, whenever something happens on GitHub, you have a push event or your contributor creates a pull request. You find these through GitHub API and trigger specific functions in your analysis which we by the way call analyzers which is pretty straightforward. And when they do something, when they detect problems they report them back to the server and in turn, look out server reports it as GitHub comments to GitHub through API. So, brief detail about how everything works. If you have a push event, you have a notification, look out server, create requests to analyzers then analyzers actually ask for data because these initial requests just contain Git metadata and you don't have the contents of source code files, you don't have abstract syntax trees, you don't have any information which you can use. So, data request go to server and then it ask to parse source code as needed to BabelFish. BabelFish is another open source project source it developed. It helps you to parse source code in a uniform way and it expresses ASTs in a universal format. So, whether it is C sharp or Python or JavaScript it doesn't really matter, they all parse to the same format to the same universal abstract syntax tree which is good because you only have to write your code once. Of course, in the core those abstract syntax trees are different for every language apparently. We cannot do anything with that but at least you can use the same functions and you can work with all the languages using the same API. Finally, it goes to Git and it takes the raw contents of all the files which are involved. Then it responds these data to analyzers, they do something, machine learning or not, it's really up to you, they can be just rule based, why not? Then they report the status back, it's a push event so we don't really expect anything in return and that's it. In terms of pull request, they respond to these comments with fixes, what they found is problematic with the pull request and look out server aggregates it and sends comments back to GitHub. So, it's kind of simple and if you want to read more about how we know look out works, you go to docs.src.txt.lookouts and there is an extensive documentation where we will read them, I really recommend to study it. I'm passing over to software development kit. So, these analyzers as I said can be written in any language but the thing is, if you want to write something low level with just talks to look out servers through GRPC, you indeed can use any language but you will have to solve, let's say, many problems on your own. If your analyzer is stateful, this means it has a state. If you have a push event, you do something, you, I don't know, save identifiers or you train and save your machine learning model. You have to store it somewhere. If you have many repositories, you have to organize everything. So, you have to deal with it and this is why you have, let's say, a higher level SDK, look out SDK ML, which is written in Python and it takes all these problems away from the developer. All right, so, a few words about SDK. It provides low level API and go in Python. You can generate foreign language, that's no problem. SDK ML only supports stateful analyzers and it's written in Python. It's also integrated with the rest of source set ML on code ecosystem. We have a few projects written in Python for code analysis, for doing ML on code and it works seamlessly with that because everything is in Python and everything is integrated. As a rule of thumb, if you know Python, just use SDK ML and just don't bother. Okay, I'm going to pass over high level API because there's not much time left and I want to demonstrate how Stylenalizer works. So, in brief, you just implement two functions, train and analyze and everything works. So, you implement two functions in Python and that's it. You don't care about anything else and you post your comments to GitHub, which is cool. Behind the scenes, SDK does talking to GRPC, pooling, threading, load balancing. It maintains the database with train models. It maintains caches with your models. It does login, metrics collection and many, many other things which you don't really wish to deal with yourself. Okay, so I'm passing to demo. Stylenalizer, as I said, is our analyzer for a lookout which tries to fix formatting mistakes which are mined from code base. So, I'm going to GitHub, yep, like this. I have a personal fork of jQuery. It only works with JavaScript for now so I took a well-known project for JavaScript. It's also a bit old, right? Anyway, I'm going to SRC if my internet works. Hopefully, I will enter the directory. Please, internet work. Well, I have some problems with my internet. That's a pity, anyway. If I change it, some code here and created the pool request. Oh, and I cannot view the pool request because, again, internet doesn't work. That's a pity. So, the thing which I wanted to demonstrate really was that how you fix some formatting problems in a jQuery pool request. Let's try. Well, actually, I think it's loaded, no? I think it's loaded. It's just super slow, but it loads. See, it switched to 2G. Oh, okay, Francesco, what is yours? Okay, I see. And the password is... It's 2, 2. Yeah, but it's... You know, I cannot because it's just five characters. It shows it, all right. Oh, I might not get it. Oh, okay. Please, it won't work. No. So it's... It's called a network, maybe. Ah, it's, okay, it's without a hashtag. Yeah. Okay. Yes, okay. I'm going to code. Well, I hope yours is not 2G. It looks like it is because it doesn't work. Okay. Okay. Yeah, sorry. We tried three different internet channels and it didn't work. So, I will show you later if you come approach me and I will show you how it works eventually. Anyway, so the thing which I wanted to show is that... For them legacy. No, yeah, yeah, yeah, yeah. It's, it's too much time. So the thing is jQuery uses spaces around, around round braces in function calls. And it doesn't consistently cross the old 168 JavaScript files. And if you try to contribute something to jQuery in my personal jQuery fork with these spaces removed, you would get an automatic suggestion in GitHub suggested change format that you should insert this white space, which is cool. So now how everything would work if I had internet. It works in two stage. The first stage is training a model. And the third stage is in fear the model which you just trained. So regarding training, the plan is as follows. First you parse all JavaScript source code files to an intermediate representation. Then you train a decision tree forest model on top of this representation. Then you extract production rules from decision tree forest and your rules kind of predict at each spot between each pair of tokens whether anything should be inserted or change it and so on and so forth. So our representation is very similar to the one Miltus used in his talk just before me. We also parse to a sequential token stream. And it's also augmented with AST changes. For example, if A equals B by two, you have A corresponding to the A node in universal abstract syntax tree. And white space corresponds to the higher level node which is an assignment expression. And so on and so forth. We don't have any data flow links here because first we cannot really compile JavaScript. We don't have this information. And second we only parse it. We don't do deeper analysis because you do it across all the possible projects. So once you extract features for each virtual node you also add features from a few immediate parents and then you pass it over to a random tree forest. Then you do feature selection, you do hyper parameter optimization and also you split by 80% training and 20% validation. That's a very common way of doing that. You end up with many, many decision trees in your random tree forest and then you extract production rules. And each rule is following a branch in a tree. So each node in a tree contains an attribute comparison so you join all attribute comparisons together by logical ends and this is your rule. So in this tree for example you have four rules extracted. Each leaf corresponds to probability distribution of the classes which you predict and you just pick the class which is most probable. Let's see. However, if you just don't do anything afterwards you end up with tens of thousands of rules and your model isn't going to be interpretable. You cannot really easy comprehend tens of thousands rules and as Miltus again said, it's very important to explain people while you make this decision. And if my internet worked actually you would see that if some rule is triggered you also see the hash and you can use this hash to actually visualize the rule and understand why this formatting change was suggested. So we optimize rules in three steps. The first is merging attributes which correspond to the same variable, the same feature. The second step is removing redundant comparisons. Let's say those ones which appear in a rule just by chance because they don't really influence anything but decision tree is trained in such a way that some attribute comparisons are just redundant. They all trigger at the same time as they don't really make any value. The third optimization step is removing some attributes which are related to each other through feature logic. You have some features which are tightly coupled. For example, you can assert for a reserved keyword value that it's an equals and you also can assert that it should not be a semicolon. Of course, if it's a equals sign it cannot be a semicolon at the same time. So the second attribute is redundant and you can throw it away. And you end up with relatively short rules with 50% less items and the number is also less because many rules alias do the same one after this simplification. Finally, you throw away some rules if you see that they're not confident enough and they are too noisy because it's also important to be precise and don't really introduce noise. So inference looks easier from the first glance. You just apply the rules which you want and then if you see some violations so the token which you predict with a rule doesn't match the actual token in the source code you generate the code suggestion. But it appears that this part is actually not less challenging than the training itself because for example, if you try to fix something in the old code which already exists you should not do that because developers are not happy. If something is already written and you are fixing it and it doesn't really belong to a pull request you should exclude it. Your change can break an AST. You can remove a white space between two identifiers so these identifiers concatenate and AST just explodes and code just doesn't work. So you don't want such changes so you want to filter it. And well for code generation you also need to solve many problems. For let's say the most interesting one is indentation. Imagine that you predict an indentation change and you have a code block and you have several lines in the same code block. Should you change indentation in all of them or only in one? It really depends. So you should go smart and do this smart indentation for each line in each case. So it's a lot of effort to ensure that everything works correctly. And yeah, you need to favor precision for recall. This means that you should never make mistakes even at the expense that you miss some potential problems and you don't report them. Because otherwise nobody will use your tool. It's too noisy. Even though it covers all possible fixes. Instead of recall we actually measure prediction rates which is how many times you try to make a prediction from the ground truth. So recall only accounts for true predictions. Here we just account for all predictions. So this metric indicates like your trade of how you trade precision for missing some important fixes. So this is our evaluation on our validation set. We have 95% weighted average on the dataset of 19 repositories. So it looks cool apart from a few small JavaScript repositories. It works bad for them because they're too small and you cannot really infer good rules from such a small amount of source code. Which is expandable. But the problem with this evaluation is that we don't really test how users interact with your system. And this is important. Validation has nothing to do with the real usage. So we have another approach for evaluation where we added 170 handcrafted errors, formatting errors to JavaScript projects. And we saw how we fix them. And we have 95% precision at 50% prediction rate. This means that we miss 50% problems but we are super precise at fixing what is left. Again, we don't really test the real behavior here because we introduce artificial changes to formatting. The true way of relating the formatting fixer would be to, you know, for example, take users and ask them to work with your system and measure some metrics from the real work. We cannot do this for now because we are still launching this product. But the idea is to take commits and try to infer formatting fixes from existing commits. You have many commits in JavaScript projects and apply Stylenalizer and see how it performs. We haven't done it so far but I think this is the best way of evaluation. Also we need to extend the current ways of evaluation. We need to add more artificial changes. And also there is an idea to introduce random mutations to JavaScript source code files and see how much we can fix. It's not real, but it's also a nice indication of your robustness. So I'm passing over to summary. How much time do I have? One minute, perfect. So to summarize, Lookout is a very nice platform for assisted code review. As I said, just implementing two functions in Python. Let's you write a fully functional analyzer for code review, which is really great. Stylenalizer is really fun and this is an ongoing project at Sourcehead. We are still fixing last minute bugs but it already has a first release. So we can try it out. Everything is open source, it's FOSDEM. And the third point is, my own code is really cool. And if you are not doing it, you should consider trying it. It's really cool. That's all. These are the links to GitHub, which I mentioned. Our blog, we post really nice blog posts from time to time. You can subscribe to our newsletter and follow us on Twitter. Thank you very much. Yes. Yes. So do you have any estimates of how accurate you would shoot it to really be? How I estimate what? Sorry. So the question is, how should we estimate the accurateness of Stylenalizer so that it is usable for production? And the answer is, so there is a very nice paper written by Google engineers where they explain how they tried to do assisted code review at Google scale. And the insight was that it should be at least 5%. So it should be at least 95% precision. Otherwise, it's just not usable and it has too many false positives. Thank you.