 engineering perspective. Before we go into that, I'll do some introduction. My name is Abyx, and I'm an infrastructure engineer at Sourced, and later you will see why software engineering perspective is important. So with Sourced, we've got quite an ambitious mission. It's a startup based in Madrid with the goal, no more or less, to build an AI that understands the source code. But well, AI that understands the source code. If you're like me, you'll be like, what? What does this mean? It's actually hard to define what it means to stand. But what we can tell is that if you understand something, you should be able to do some judgments, like making a call on, for example, developer experience based on artifacts that he published online, such as a source code, or things like better tools. We should be able to do some kind of assisted programming, better than what we do now with code completion tools. If you want more details on particular applications, there's a great blog post I'm going to publish the slides. You can read up and from here. Well, before going into details, we're going to establish some technical common ground for what does mean code understanding in more details. Well, first thing, in order to get there, we need to build some kind of common representation of things like code, developers and projects. And this whole talk is based on research that resulted in a paper submitted to a conference recently. And it's not the first research ever in the field, of course, but it stands on intersection on fields like mining software repository and machine learning. And we're going to present results of the first large-scale research, empirical research in that area. And, well, let's go into details what means representation here. So at the high level, we've got the hypothesis that, well, there's a source code written in different languages, but each language has just a predefined set of different tokens there. Other than that, it has things like names and identifiers, which are basically a natural language thing. And that could be a resource of information describing each project. And the goal is to build some kind of model or representation, that's what we've got over here, which exhibits some kind of notion of similarity between those things. And then using that representation, we can do interesting things like, for example, clustering repositories based on areas or topics. And, like, I'm pretty sure everybody is familiar with clustering, but just at a high level, it's a branch or task inside machine learning. It's unsupervised, meaning that we don't need really labeled examples in order to be able to learn that. And there are many algorithms like how we actually can do these things. I'm going to be mentioning and focusing mostly on one particular thing we found through Tool, which is topic modeling. And the general task of topic modeling is described with this example. Well, it concerns itself with a set of documents, each of them consists of the words, and all those documents are about some kind of topics that represent some kind of topics. And the goal here is to recover those topics. So there are multiple machine learning algorithms, how we can do that. And there are classic ones, which describe very well and have open source implementation like LDA, I think original paper was by Andrew in 2003, and then verbalistic light and semantic analysis. Those first two things, I think they're mostly based on algebra things like matrix decomposition and stuff, but there's something new in that field called adaptive regularized topic modeling, which was done by Russian professor quite recently. And that's what we're going to use because it has strong theoretical parameters as well as memory efficient and paralysable implementation done by gauge decisions of the professor. I believe it is in Moscow. And as far as machine learning goes here, that's basically it. And the reason is that there are many parts of building a product based on machine learning things. And here's a picture from a recent paper published by Google about analyzing their machine learning workflows. And basically that black box in the middle, that's where machine learning itself as math and things is. And everything else around those are engineering tasks of how to gather the data and filter and representate and sort of the results, things like that. So I mentioned legislative experiment meaning that our source will be the whole github. That's the scale we're talking about right now. And we're going to do a number of steps before getting to machine learning and topic modeling here are the steps. So basically for every repository out there on the github, we're going to get only source code files which have the source code not like documentation or something else. Then we're going to parse it and get only natural language part like names, identifiers and build something that is now in the bag of word model out of it. Very simple but very efficient. And then it's not enough just to train machine learning algorithm based on that model because on the github itself we've got a lot of things like forks which will screw up the statistics for topic model. We've got a filter out of those on this scale. And then we can do things like cluster model. And I'm going to go through these steps one by one very briefly and describe particularly which open source things we used to do that which would build ourselves. So the first thing is like we fetch the old for every repository and then we classify source code files using the tool open source by github and it's the same tool which is used to build these bars with the languages code linked with. So internally our own implementations are more efficient but we use Ruby 1 here. And it supports 400 languages. So we should be able to identify the files containing source code and the language it's written in. From there we parse those with existing parser which is a syntax highlighter with big names. It's also an open source project famous in Python ecosystem. It supports also 400 plus languages but well there's different for hiring languages. Intersections both 200 common languages that both increase and pigment support. What it does that for every token inside the source code file it designs a class to it whether it's a name or some kind of built-in command or token. And we are particularly interested in token types called token name. So after doing that we can go and build Bego Fort model which is just a set of tokens inside every repository. So it's before doing there before doing that we do naming convention breaking like every language has some kind of naming convention like we want to be able to extract words out of it. Here's an example. Which after that from the whole github we have about 19 million unique tokens that way which is quite a lot. Then applying other language processing tools like stemming for things that longer than six characters we were able to drop that number up to 16 unique tokens and from there the last thing is to count the term frequency we've got every boisteries mapped to a Bego Fort with its frequencies. So the thing before doing machine learning on top of that representation is filter out for github itself. I think it was done on down public github based on December 2016 it have about 70 million of repositories. And the first way we can filter out for is that explicitly mark this fort in the github. It will give us 19 million repositories. But there are forks that are efficiently for it but not through the github. For example if you pull something and then push without clicking for it it's exactly the same thing. Then we can try filtering out colliding hashes that take sound about more repositories but that's still not enough because there are a lot of things like for example whole Linux kernel with additional driver without commit history and that's effectively for it but well we don't know about that. So what we want to do is to filter out those forks as well and that results in the reaction of about 2 million more repositories but that's how we do that. So basically what we need to do is for every set of talk and representing every repository we need to find close or similar things and filter them out. The naive approach would just use pairwise similarity basically creating and comparing every one of them but that's too long and will take a lot of time for that number of repository. This way we went for something more efficient. We open sourced implementation of weighted mean hash algorithm so it's basically well it's some kind of signature for every set that estimates similarity between sets so the signature is the same with the hyperbole and the sets are very similar and we use something called a quality sensitive hashing that is proportional well that was we were able to get almost linear time algorithm in order to filter out those repositories that are effectively for it and well the basic quality sensitive hashing is in the field of probabilistic data sets there are many interesting probabilistic data sets out there but this particular works a bit like a hash table it's different from hash table though because its hash function maximizes the probability of collision for similar items so after building it we will have the number of buckets each bucket will have all the similar items and there are many other interesting probabilistic data sets like something you might be interested in I encourage you to check out things like hyper log or bloom filters or things like mean hash here are examples of the same idea and in the digitalization up there you can see the effect of hyper log for paradigmatic estimation of the set initial set was like 40 megabytes as an example and preserving that paradigmatic information we could reduce that for 2 kilobytes but we do something similar and we do that on multiple GPUs and that scales quite well so that results in 14 million unique tokens and then we can do the topic modeling part so topic modeling by itself you can look at it from algebra perspective so there is one vector space where each repository is presented as a vector and the dimensionality of that space is about 40 million right now which is quite big then what we want to do we want to protect that vector to a smaller dimensional space of topics and in our example we choose to use 256 as a number of topics and those dense vectors usually called the dynamics would be common ground or representation of each repository before so in order to do that we use the open source implementation of the circularized topic model it's called bigRTM it's quite interesting it has a lot of regularizes as implemented and well it takes some time to train but for the all github dataset we were able to do that on a single machine in less than a day so after getting those representations the last thing to do is just cluster and cluster those vectors and it can do that efficiently on gpu as well but let's look at the infrastructure that we used for this research and here's the first of pre-processing part which is IO bound and it was done on cluster of 64 by spark machines on google data problem the input was about like 100 terabytes or something representing like in the subredbus and in less than one day we were able to pre-process down to beg of work model with this which is not tremendously due to this data but interesting thing is after that we have to do the filtering of those forks we were actually able to do that on a single machine with multiple gpu's in less than 5 minutes but just to calculate the hashes and that's impressive because we tried doing the same well it's compute bound for clover it's faster than 400 core cpu cluster and that's just one machine and then the rest which is a machine learning part was done on a single machine with more than cpu as well using enough memory and good implementation and parallelize as well and well we done all that so what and there are two interesting results like here's the list of number of topics although that's unsupervised training that we use but there's still manual labor requiring to analyze what those topics represent it was done in a few days we were labeled a human and the interesting thing emerged that actually we were able to see some general topics extracted things like well humans who have body parts or about nature or science or even design patterns and software also like we can say that communities like games, things in the gaming industry or bitcoin were representing as a separate topics presumably because they have unique vocabulary for a related narrative so we were able to cluster and then one thing that we learned that picking the 256 as a number of topics was not really an optimal because some were dual representing multiple concepts and some were just repeating multiple times so more work needed in that area so here's an example of one particular topic and the key words that rank high over the topic meaning that if this word ranked high in that topic with examples of some repositories well that's all good, that's interesting so it's meaningful results but the more interesting for me as a software engineer was well we were able to do that efficiently and not in a way that well let's use a lot of machine and just throw them there some companies can do that having efficient implementation of and stuff but we could go very far with some carefully optimizations and things like compute bound workflows done on multiple GPUs on a single machine so that's quite interesting and there's still space to grow so we can do more using the same architecture that I described with some careful optimizations and I would hope that all this persuades you in useful that approach and there are other directions to explore this one for example so other directions here would include better preprocessing like experimenting with standing and other languages other than English things like number of dimensions that I mentioned like it was an optimal can play with that or any more information we would just clustering based of those make-up work model but we can experiment with comparing with clustering based on social graph or adding more features to that model but something we found even more fruitful was well trying different models as VegaForge as you may know is a very simplistic model it loses information about sequences for example if you have a full in-bar it's exactly the same as a full and more modern natural language person to like worlds to work to back parallel implementation that's something that's going on right now we can publish results on that as well so it's open source conference and what's made for you is an open source active sparse I wanted to highlight a number of tools that we used or we implemented so you can use to do something similar first of it is a bit implementation made in bold language that's something that allows us to build a scalable infrastructure to actually having you have mirror to analyze first hand then there is this efficient hashing technique and multiplication we used so in case you want to do that similarity if you have said similarity problems that's something to look at there are two other projects that were out there and that we used this is big art for topic modeling and which is part you may know is pretty famous this day except for that we went a bit further we wanted for you to simplify reproducing of these results we built docker images that you can just what one docker command can run and we had the reproducing pipeline of your machine running and we published some data with it's actually all the names extracted so you can run after machine learning out the results on the same data or you can just use the train model and use your plastering so with that I wanted to thank you and if you're interested in things like that source this hiring so talk to me later I think we have time for questions questions as I mentioned that's a research first of all so that was at the beginning I mentioned that as a company it has its goals of building better tools in different way particular research whether it was kind of feasibility research whether it's even possible the hypothesis I mentioned before the source code is quite similar to natural language because it has natural language embedded in names whether it's true or not whether we can extract meaningful representation based on that information that was the goal actually it looks like it is so using more advanced technique we got more advanced results just taxonomy as you say which is not very useful with this number of topics anyway but it was feasibility research mostly so we were able to get interesting results and we're pushing that further with more complex models and hopefully that will lead to some kind of more sophisticated products that as a company we can build but there are many things in that domain you can try and do that well if there's no other question thank you this research used only range so it never went into the use of very simple features and extract them at scale but that's definitely open source tools in order to be able to extract that information