 So continuing this part of lightning talks where we talk about a patches of our foundation projects is Alexander Bezubov Who will be talking about visualization of big data? and test it just hey, hey Can you hear okay? So I think it's a bit different from what Raman has announced So I used to work on visualization for big data But this particular talk is about tools for large-scale collection and analysis of source called repository And that's something I've been working on in the last half year and that's exciting and even maybe more exciting than visualization part So my name is Alex and well I'm engineer at source and also a committer and PMC to purchase a plan source to the startup in Madrid that I joined recently It's very cool and all the things that we work on there are open source So and I'm gonna talk about some tools that I built my colleagues built during the daytime job so well, we collect a lot of source code, but why and So it's two-fold one. It's research material for academia and two It's the fuel for data-driven products on top of the source code and it's kind of rapidly evolving area of Building better tooling to write programs. So it's quite exciting But first you need to get the data and that's what I'm gonna talk about It's open source pipeline that you can use on-premises to collect a lot of git repositories because well it's the most popular version control system and That's the source of truth about source code and the collection pipeline is pretty standard It's like we've got a crawler and distributed storage and the parallel processing. So after you store that Most probably you want to go through that and figure something out So we'll briefly go through the tech stack and what the takeaway of this talk is that if you're interested in a large scale data collection There are some existing open source tools and there are some new ones that I wanted to share today So things that have black or gray boxes around are the things that we build at source but we're gonna go through all the stack one by one and To run the software. Well, you need some hardware and on infrastructure side We have a dedicated cluster with which is I think kind of called the immutable infrastructure these days So there's basically machines that up from boot provisioned with chorus And they eventually become a part of a Kubernetes cluster where it can schedule your application on So it's very nice and automated There's gonna be detailed talk about that in configuration management camp on Tuesday in GEMT If somebody up for learning more details about infrastructure part On the collection part. So well, we've got machines. We want to get some good repositories So it consists of two parts getting the URLs to that repositories and then basically cloning them and well, they We focus on git and that's the most popular thing. So you need to talk git language to be able to do that We implemented custom implementation of the git protocol format and storage format called go git and was a talk about it last year in Go language dev room. It's a pure go Implementation one of the big five implementation of git It's very extensible and it's interesting. You can do a lot of things like store things in memory when you want to or Add custom protocols or store things in database if you want to so that's Something to use for cloning part and then there Two separate programs one is to find the URLs of the git repositories and store them in database and second one to schedule their cloning This is called the rovers and Borges parts and Well, there's hundreds of millions git repositories So we want to be space efficient and there's some nice tricks You can do like store forks together and single git repository because you have an extensible git Library and that's the concept of fruit that your posters depicted in the middle So basically your posters the store history and start with the initial commit to the same hash get stored in the singular Big repository Which is works really nice? So while you collect the URLs and you collect a git repositories most probably want to store that somewhere and it's going to be distributed So on URL part is just a postgres database We wrote custom your or have implementation and go which is type safe and quite nice called Kalex and For a poster side is HDFS which works very well But it scales linearly with a number of files in it. So we want to minimize number of files Basically, how we do it with custom archive format implementation So every root repository end up being in a single file inside database This is the example of custom archive implementation called Civa. It's sickable appendable and indexed format So it's blog based and this way you can fetch replace your once and then append it after new clones or fetches happen and it's all stored in HDFS So one after it gets stored you want to protest it somehow and Well a purchase park is a good way to do batch processing and on the cluster of machines sparkly scale is Is something useful people understand and and know how to do and we build custom library to expose those git repositories to the spark API level it's called engine and Well, it exposes references commits files and so on in spark language Be that Python or Scala, which is super nice And it talks through JPC interfaces to more advanced stages of analysis of the source code if you want to do that Here's example of usage of that library. I'm sure if you can see that that's You can extract with a simple pipeline like that. You express the extraction of references taking the head Reference and then getting all the files and for every file do something like Detect the language and so on which is quite concise and really great that both engineers and machine learning people can use that because there is Python and Scala APIs for that and Well after just iterating through files most probably want to do some more advanced stuff and There are two projects on this side that you have built one is a Henry which is a programming language detector It's kind of rewrite parts of the github linguist is the thing in Ruby that github uses It shows you the distribution of languages on top of your repositories One that we use is in go and it's compatible with the language than this faster Yeah, basically it's from four to 20 times faster on our measurements then then original a Ruby one and Another one is project bevel fish, which is a bit different because well First it's it's scalable parser infrastructure So it wraps native parsers and type inside containers that you can schedule across the cluster and that exposes uniform jrpc API so that way you can extract a lot of information from the source code in very uniform fashion and it has drivers for many different languages and Something called universal abstract syntax tree, which is native syntax tree of the language annotated with some language independent things that you might be interested in on their later analysis stages and Well, that's high-level overview and there are some future things that we want to get done for example on Kubernetes side having bare metal cluster and Persisting storage is not not really easy thing to make on the collection side There's a concept of stage events driven architectures as a paper published by Google I think about how to make a scapegoat system that dynamically saturate the some resources by having Cues in between the stages, which is very interesting If you want to do well clone a hundred million of repositories On the processing side, we're looking into adding distributed indexes to speed up apache spark queries that we have And on analysis part they're advanced things like how do you div up the stock syntax tree or How do we start cross language information from abstract syntax tree, which we are looking at in a bevel fish project That's it any questions rights. Yeah, so that's the question was do we look at the read-me is So the thing is the collection pipeline I described is generic So you get everything you get full replace stress on the project that detect language This does not do anything except what github is already doing So it's also doing the same job, but there is some research in that area, which is quite cool How which comes from natural language processing and machine learning area? Well, how do you tell which language is that like basically classify language based on the source code and you get pretty good results? It's not inside Henry first. It's more yet, but it's more in the research stage And I think it has a lot of potential well, hopefully even to be able to merge upstream and in Github implementation so it does a better job, but yeah, that's that's good one Any other questions? Okay, thank you so much