 Well, we have source code, which is interesting medium, because it communicates to two different audiences. So it's a B-model, they say. It's one modality and communicates to human audience, like a human who's going to read it and write it. And there's machine audience, like computer who's going to execute it. And in scientific literature, there is this hypothesis called naturalness hypothesis that we actually were verifying that says because of this B-model nature of communication, the corpora of the source code must exhibit statistical properties similar to the corpora of natural language text. So that way, a natural language processing kind of tools and models should be applied. Nice. Nice, we've got a slide. Oh, no, no problem. Yeah, we've got the slides. So it's going to be more interesting that way. And it looks great, like this, like that, and like this. Yay, awesome. One more second. So yeah, basically, we're going to skip some things. But the idea of the talk was exactly like that, motivation. And then we will go through details of one research, the technology stack that it took to conduct research like that, and the next research that we did this year. So we can skip that. The motivation for that is actually these data points. You can see that it's kind of, you feel it that way, but there's some measurements telling you how the source code bases behave with time. And you can see example of subversion and git. And you can see, although it's not big projects, they're both growing in size. And there's GNOME and GTK and Mozilla Firefox and Chromium. And these are all plots of number of lines of code every year. And you can see it growing. And it's actually, well, you might say that it's growing exponentially, because as you can see, it's doubling every five, seven years. And it's just early stage of that growth. And operating system kernels, and those are just all open source projects. But in closed sourced world, in the companies, code bases are order of magnitude bigger, actually. And that means that we all, in open source world and in closed source world already, are going to hit this problem on how we're going to manage that number of source code. The code base that is two billions lines of codes, how would you even understand what it does, what are the components? With the current tools, it's very, very laborious process. So we need better tools and better tools in all aspects of it, like testing, writing, reading, navigating it, discovering that inside your code base, you actually have similar libraries doing similar things. It's not trivial. And even non-code aspects like legal, how to verify that you didn't bring incompatible license into your source code, one of your two billion lines, or hiring, as I gave an example of a previous business model of the company, sourcing the candidates based on the number of open source contributions. So we need help with all aspects of that. And that's the motivation of exploring the space of applying machine learning to the source code. And by looking at code as the data and taking this natural hypothesis, we were able to show that it does actually make sense on the task of one specific task of a project similarity or code base similarity. And that's the topic model in the research I was telling you about. And the idea is that, well, in this case, every word is identifier inside the code base. And then the document is the whole repository. And that way, you can build a topic model for every repository, which tells you the distribution of the topics over the document. And then the assumption is that, if the documents are about similar topics, they must be similar. And then you would recommend similar vectors in that topic space. And we were able to do that and publish the results. And it was accepted and then reviewed. And it's nice paper. You can check it out. But it was ad hoc data collected for this. And it was very specific model for one particular task. We couldn't reuse it for other tasks. So we would need to train again for a different one. And it has annoying property of having one hyper parameter of fixed number of topics. So optimizing for that is hard. So we decided to fix data collection first and build an open source stack that everyone could use and play with. So before that, if you do any kind of research like that, your process will look like this. You get some ad hoc script, get the data specific to the task. It's actually quite small. And well, as you might know, the whole data-driven project, including machine learning, looks more like this. And machine learning is just small part of it. There are a lot of other things that needs to be taken care of to get a successful outcome. We actually maintained the list, created list of papers on this subject on a GitHub. If you're interested in what other people do in academia, that's something to check out. And we wanted to change this to something like that, which has the collection pipeline being a common infrastructure. Like we have operating system kernels which have shared cost of ownership. Many people use it on their servers. We wanted to do the same with the data collection. And then have shared data set that everyone could use and extract data from big enough so you could build the specific datasets out of it and the way to query it. And this is the technology stack with the project names. They are all open-source projects on a GitHub, on a sourced organization. I was talking more about it at a FOSDM meeting this year. I'm just gonna very briefly touch up on this. So the first step is, like the idea is to decouple every step and have a separate project covering the dwell. And the first step is just distributed git fetch in parallel on many machine and just store the pack files. So this is the collection part. Then the next step is to have a query library that would make the cluster of machine go through that pack files and extract the information for that. That's called ngine. And you can see example of the query with this library here. You ask for all references inside all their posters, their files, detect their language and then extract something from abstract syntax tree of those files. Something you may be interested in, for example, well, just documentation or just the function names or things like that. And then the last part is the distributed parser infrastructure. We call it bebelfish project. So it has input as a source file in a number of languages. And then the output is the annotated abstract syntax tree with some cross language annotations, like roles of the node that you can analyze. And using that infrastructure, we were able to build a corpora, like two terabytes dataset of 200,000 most popular repositories. So in GitHub, and you can see the distribution of languages inside that dataset. And you can play with that. You don't need to download all the terabytes to interact with it. There is common line tool. And that's the URL. You can check that. And using that, we actually conduct next research on identifier embeddings, which is that applying words to VEC style of model to the identifier embeddings. And that one can be used as a building blocks for many tasks, including repository similarity that we already did with the topic model one. And that's how it looks like very, very roughly. So from the source code, went to that syntax tree annotated with some roles using the bevel fish. And here we filter abstract syntax tree by the role and the role is identifier. So from the source code, we get only identifiers. The thing that contain natural language information, like functions names, variables names, and so on, class name. Then we tokenize them and then they build co-occurrence metrics as a first step of particular embedding algorithm that we use that's called swivel. It's similar to VEC in that it's approximate metrics factorization algorithm, but it's much more scalable. And it looks like this from, that's an example of co-occurrence metrics over there, but real co-occurrence metrics has tens of millions of repositories looks more like this. Then it gets sharded and then embeddings that trade forever shard and then you can do that on multiple GPUs and multiple machines. And that's an example of visualization of embedding space using Disney, which doesn't tell much, but well, you can see that things that get grouped together there. So yeah, you can check out the swivel algorithm that we apply for embeddings and I assume everyone know what are the embeddings and this is like dense vector representation of words in some continuous space where similarity between vectors makes sense. So we apply this algorithm. There is existing implementation at Google open-sourced. We have our own fork that is more performant and we also have a pre-processing step, a distributed one. So we can run this sharding of the metrics on many machines instead of just one. And well, it's a bit different than Word2VeG but the output is very compatible and it's count-based methods. So first you count the statistics of the corpus and then you run training part. The really great property of it was that it scales linearly with the size of the vocabulary because of the structure of the metrics not with the size of just your data. So that way you can train on much more bigger data set. And okay, that way we just get identifier embeddings but how would we get to document similarity from those? And there is this known algorithm called World Movers Distance. It comes from the statistics task of Earth Movers Distance and or Westerstein Metrics Distance I think. It's called, it's a way to measure distance between two probability distributions. But the idea is that if you can measure the distance between the words then you can define the distance between sentences as the minimum amount of distance that a word from one document need to travel to the word of another document. And that's an example of it from the paper. There is open source implementation of getting this distance between two documents but what we would need, we would need to get a system where you send a query and get nearest neighbors. And we did the open source implementation of that system that called World Movers Distance Relax. And it uses results of operations research that did Google open source. And it's a interactive system where you can send the query like example of the repository and here's the output. It's nearest repositories in that space. And you can see that from these results they actually make a lot of sense. So if you ask for Torch, which is machine learning library implemented in Lua, one of the results you get PyTorch, which is a library kind of similar library but implemented in Python. Or Intel's machine learning library that if you read the readme file you will know that it's inspired by Torch so it has similar concepts. The same for Gogit which is a Git implementation of Go language and you can see the machine figure out that it's similar to actually LibGit's two bindings for Go. So it's a Go bindings for native library doing the same thing. And then it's also similar to NodeGit which is Node.js binding for Git. And it's really makes sense and what we get is in a completely unsupervised manner a computer was able to figure out that those things are similar just by doing some processing and spending some compute time. And of course that's kind of very early stage result and kind of baby step. But I'm very excited about this because it's kind of similar to what we did with images before. And with the type of open source data collection pipeline and infrastructure we would be able to build image net of the source code and move on with this research and build more models that exhibit more statistical properties and the end resulting in more useful products for open source maintainers to automate their routine tasks like code review, like a lot of patterns exist and naming like style guides and things like that like refactoring opportunities that human might not even aware about but machine can figure out because some piece of code is similar to another piece of code. And of course there's more research done by other people on this subject and we, as I say, we maintain the created list of papers you can check out. We publish the data sets and it's part of the mining software repository's task this year. So more people are gonna use it at the mining software repository's conference. There are some blog posts that we did about more details for this code similarity topic modeling and ID2vec research on our blog post. And there are some talks that members of the source team already did on parts of this infrastructure and some models that we trained. You can check out and all the work and all the code is open source. And this is all a result of working with smart people together for maybe about a year. So by no means it's just me doing that. And yeah, with that I would like to thank you and answer some questions if you have. Do you have any questions? You can always ask me later if there are no questions now. I will be here all day. And if you're excited about topics like this, please check out the source code and the projects. Everything is public and well, source is hiring for remote positions as well now. So if you're interested, let me know. Okay, no questions? Then thank you. And I think we're almost on time for the next speaker. Thank you.