 All right. Hi, everyone. Good afternoon. I hope you've settled in Felipe is going to talk about all the stuff that you guys have done on github I hope everyone here has done at least something on github So now he's going to tell you what you've done and why you've done it and why it was wrong because he has the data So everyone say hi to Felipe Hello, everyone. Yes. Thank you very much. Thank you for being here I'm Felipe Hoffa. I live in San Francisco. I have jet lag But I'm pretty happy to be here. We're going to analyze data We're going to analyze a lot of data on github. Who has a github account? Good good good good. So I'm going to measure you now What do you have been doing? What are your favorite projects and why you should be doing this? If you have any questions, you may want to interrupt me don't interrupt me too much, but I'd be very happy to Run interactive queries and go wherever you want to go What do you see here? What is this? A license code Yes, it's called But if you go deeper you can start seeing other things like yeah, it has a license It's doing some import. There are some modules. There are things from the future if you It's Python code if you go back you can see look at the big picture You can see the number of lines you can see the number of stars number of forks how many people have contributed so what I want to Say here is what I'm looking at here is data And with a big font because it's big data We have a lot of data here that we can analyze. So who wants to analyze github? And for example project maintainers who here has a popular project Everyone has a github account, but who has a popular project? I Know some people yes. So yes, we as a project maintainer you might want to know How popular is your project? Who is starting it? How to manage change? How many people would you be breaking if you change your API or is your product healthy? Are you closing issues fast enough? If you are a project user, you might want to know the same things But also how to request features how to be more effective when asking about to how to get new features or what are the projects you should be following are there More popular price and especially before you become a user. You are a project chooser You want to choose a more the most healthy project? You want to see if they are more popular Projects if you're looking for a javascript library, there are like a thousand of them But you can use data to choose the one that's closest to your needs and if you just love data Yeah, we have a lot of interesting data here. That's why we keep doing this That's why Alan is doing Bitcoin This is why I love doing github because we have a lot lot lot of interesting data that we can analyze We are going to look at three main data sets and all of these data sets are stored in BigQuery The first one is github archive This has a lot of all the events that are happening on github hour by hour You can see every row every event and start querying it. It updated hourly There is this other data set that it's gHTorrent that looks within the graph It goes beyond events and looks at also adds more data of Metadata that people have used and annotated github with and then we also have a github's code We copied most of the open source projects into BigQuery so you can analyze it and look at it as code as data Since I've been doing with this I've been doing a lot of interesting blog posts my favorite one or One that a lot of people love is what are the top companies contributing to open source on github? You know what the top companies contributing to open source are oh You read that blog post Yes, there was a blog post two years ago that said that Microsoft was the top company contributing to github but the thing is when you are analyzing data there are so many ways to measure things and No way is absolutely correct. You have to make a lot of Assumptions and I made some assumptions in that case That blog post said that Microsoft was the one that was contributing the most users to github But then you can count different ways. So for example here. I have it in two dimensions one is the number of Users identified from each company on github and you can see that yes Microsoft That's the one on the most of the right that has the most users But if you look at the dimension of how many repositories people are contributing to Google is on top, but both companies are there on top and Have way more users and way more projects and these other companies that are still pretty cool But also the size of each circle here Is counting the number of stars these projects are getting so basically we're looking at How much impact how much people love all of these projects? So even though Microsoft is doing really well, I love how much they've changed in the last few years I still can say that Google has more projects and more stars And I love it that it's the things are that way that you have a lot of other companies there They're pretty cool some I expect more from Some are doing huge contributions while being small companies. So Yeah, these are pretty interesting ways to measure still I had to make assumptions and one of the main Messages of this talk is that I want you to challenge my assumptions if you think I'm wrong Please tell me you have access to all of this data. You can go deeper You can count things in different ways and we can refine the results I got here to get to these results. This is the query that I run It's pretty complicated. It has a lot of assumptions. So For this talk, which it's better if we start at the more simple place Let's go back to 2012 when my teammate Google or Iliac Rigori started collecting all of these events You'd have cast an API. He connected to the API. He started downloading all of the events and he left this as files That you could download if you want to get one hour of github archive of github events You do a W you get you'll get a file in less than a second by megabytes of compressed data If you decompress it, that's like 40 megabytes of data That's nothing Anyone can download any of these files and start analyzing them on your computer at any time But that's only one hour of data If you want to analyze seven years of data this rate, we are talking about two terabytes of data More than a billion rows and that's way more than we normally have If you have to handle the one billion events Two terabytes of data. How do you where do you do this? What's your tool of choice and What would you use? Okay, that's one answer Yes, so there are options But my option the one that I'm using is called be query that when Iliac started collecting these events in 2012 was one of our New projects at that time and he put all of these files here. Why do we want to use the query? It has some nice Advantages one it's fast We will be able to analyze terabytes in seconds. It's simple. You only need to know sequel It's scalable and you can go from bytes. It doesn't matter how much data you have it will store all of it And it's always on compared to other solutions Here you have nothing to turn on it's just always there You just load your data as much data as you have and it will be running always because you don't pay for hours You don't pay for RAM you you just pay for how much data you are storing and how much data you are query and Everyone gets a free terabyte every month. So if you want to run any of the queries I want I'm going to run now You can just open your computer create an account with Google you will get 3 terabyte and you will be able to repeat all of these Even better the query has you can share data And you can when you load data be query. It's your data for your eyes only you can have data health data private data It will be stored securely But if you want to share it with third parties with any other companies within your company or with the whole world You can do it and so Ilya was able to share all of these events with all of you And you get this free terabyte every month to run queries So that's be query will Demo it to will run some live queries, but let's start looking at the stars Which stars am I talking about? Hmm, you'd have stars of course So what were the projects that got the most number of stars last year? Tensor flow, that's a good guess any other guess What other projects got a lot of stars? Okay, let's count it. Let's let's run this query before We run out of guesses. So I have here be query. I have a simple query You can connect Python or whatever your favorite tool to analyze data is to be query But it also has this web UI. So I have this table from 2017 all of the events from this year This table is has one terabyte of data. It has 400 million rows and if I want to count the number of stars I can run account stars of all the watch events every time someone starts something it produces a watch event and last year we saw 30 million stars you want to know what are the projects repo name group by This is the first column ordered by the second one and In descending order and give me the top 20 What were the top 20 projects last year and you can see it's pretty fast to just start writing queries and getting results and The second project with the most stars was tensor flow Good good good good guess the first project was free code camp anyone here knows free code camp Yeah, so they are a great resource for people that want to learn how to code and the first step in this In their program is create a github account the second step is start our project And that's how you get to the number one spot It's smart, but yeah, they deleted that step. So These 90,000 stars are only from the first half of 2017 and you have uJS Facebook react What every program is to know and a developer roadmap, etc. Etc. So That's how we start analyzing github now whenever you run a query whenever you're looking at data you have to ask You have to be a little suspicious. Are these the real results? Can we trust any ranking that tells us these were the top projects last year can we? There's out Let me this is for satia. You might know anyone here is a fan of for satia I'm a fan of for satia. So this is a project that has almost a thousand stars And I will give it another star and then if I remove this star I can give them one star again, and I can remove it and I can give them one star again In this way, I can just create a series of events of That just gives them a lot of stars when I'm just counting them like this so instead of doing an eight count I can just Look instead at how many different people have stars these projects So these things actual logging and now we are going to order by that column instead And let's look at the real number of stars that these projects got You will notice that some projects are more inflated than others and Again, we're querying one whole year of data. It's taking 15 seconds to process all of the data And yes, you can find out see the three code camp you didn't get 90,000 stars. It was 85,000 There are like 5,000 duplicate stars Some products have more duplicate to start than others We can call them fake stars if you want, but that is happening and when you're running rankings Take care of projects that might be faking their number of stars for any reason Let me go back to the slides because Yeah, so these were the top products on 2016 2016 free code camp got double the number of stars because they had these the number of stars now we can So this is how the rankings changed from 2016 to 2017 TensorFlow went from the number five position to the number two As I was telling you not all stars are equal a Star that I give a project is different than a star that you give to a project that you give to a project Because everyone has a different background Every star comes from a different individual with a different experience with different interests Some people are interested in machine learning some people are interested in php Some people are interested in living Singapore and each star has all of that background and we can start querying that kind of data Where do I have a cool query? Yeah So for example, let's compare Let's compare TensorFlow with free code camp they both got a lot of stars in this case. I'm comparing The number of stars they got in 2017 and I'm looking at different dimensions here. For example the age of the users starting this project I don't know the age of people on GitHub, but I know how long they've been around it Have did they create their accounts five years ago or one year ago? It turns out the age of people That star free code camp is one year while in TensorFlow is two years people starting TensorFlow have more experience They have watched star more repositories like eight times more repositories. They have written more issues More comments. They've done more for more pull requests and this way you can find that projects have a constitute of different kind of people if you are creating a project for People that are learning how to code you probably want to get newbies if you have an advanced project You have your rank you will count things in a different way It all depends on what you are counting you can define your own rankings You can be the number one in any project in any list that you define In this case if we count only The number of views the stars from people that have written more than 20 comments Free code come doesn't have a hundred forty thousand 40,000 stars, but almost the same number as TensorFlow Because in this case in this ranking we care only about Experiencing people maybe that's why what I care about if you care about something different That's up to you to define your ranking So Again looking at 2016 if we look at users with a lot of comments that have been active on github The ranking changes free code camp is not longer the number one project yarn is Facebook react is not on the top 10, but Facebook incubator create react app is on top And so it all depends on who are our users. I was looking also at the Stars for for for session for session has a hundred some 200 projects 211 repositories. I wanted to know what are the top products for for session You can run a query like this And interestingly enough you get results that do like this so many for satiate projects have around 280 stars What's happening here is that we have a lot of first Asia funds that start every first Asia project Which is nice, but it makes us harder to know what are the top price? That people are interested in not so It's cool if everyone starts first Asia projects But we may need to look at things in a different way and we have the raw data to do so So for example, if I look at fans that have a star more than That star mostly only for satiate projects and I remove them And I get a completely different ranking a different and a different distribution. This looks more normal Now here. I'm not looking at fans. I'm looking at general interest on each project When you get a starts you can also ask so what else have these people start so in this case For example, I'm looking at stars to TensorFlow. What else did they start? They start Other with a query like this. I can see that start finding and other machine learning projects and it works pretty well Now This is a name ranking just by counting but then if I want to do it by probability I get a different ranking and I get that yes people that start TensorFlow also Start the Arnold torch cafes. I could learn Keras and Xnet and You can start navigating creating a graph for any project wherever your project is you can navigate what is this? People that are starting reprise are starting to You can do time-lapse of stars because some price get like stars in a very They get the same number of stars every week and some other price just have these huge spikes These are the top Apache projects. I created this with data studio our free visualization tool And I have a whole interactive repository. We have time later. We can come back here But for example, here I chose two different Apache projects arrow versus flink flink has gets a stable number of stars each week Meanwhile arrow gets a lot of stars every time West McKinney writes a blog post about it Boom everyone starts this project. This was in February one year ago. Then in September He wrote another blog post. They have another huge jump So if you want to get stars if you want to get attention to your projects There are things that you can do like for example putting your project on hacking news because Let's look at these projects all the spikes they got in the number of stars If you see those annotations those annotations show when these projects were shown on the hacking news front page And there's a huge correlation if you show up on hacking news people will give you a lot of stars And what's super interesting here is that I didn't run these annotations manually But because on the query I not only have a github data I also have for example all of the hacking news story So you every day we update that Data set and you can run a join between both data sets and just start looking for things that show up on Hacker news and cut the number of stars it produces And now the number starts is not the only important metric we can gain stars We can but for example, we may want to see a product the health of products What are the products that have the most comment on issues this month due to 2016 Kubernetes had a huge amount of comments 17,000 spark had a lot of comments an open shift to and a project called southern demo Anyone knows that project? No So again, whenever we Run a query whenever we get results. We have to Be a little suspicious of these results and it turns out Southern demo more this Almost 5,000 comments were written by one account They're allowed to you can have robots writing comments. So you might want to Take away that kind of stuff In this case this ranking is counting not only the number of people writing comments the authors So we have the number of comments. We have the number of people writing comments And then I'm calculating how many comments each author wrote and you can see that for Kubernetes Each author wrote about 18 comments 500 people writing 18 comments each that shows you that you have a super healthy community While the product font awesome. Yeah, I have more people writing comments, but each one left less than two comments each so in average so yeah be suspicious of the results you get and You can see here that even those products can get the same number of comments. There is a different measure of healthiness And for station projects, I did the same query removing the comments that look too similar and These are the top points for for station open event androids who see servers who see androids and you see that there's Healthy number of people commenting and there is a different rate of comments per project As a private maintainer that's super interesting data as a project user. It's interesting data You can do text analysis to So for example, how do people start issues on github? What's the most common way to start an issue if you want to request something are people nice are not are they not nice? These are the results I got for the first four words when someone starts an issue They start in a nice way like it would be nice Is it possible to and trying to People are really nice There's a lot of people that start issues in this way But what's most interesting here is the third column that asked how many of these issues get closer and it turns out If you start an issue with it would be nice You get a 56% closure while is it possible to which is more concrete get 73% of closure Now the best one I got here is when you start an issue saying I get the following Like being concrete showing what's your problem with showing what you want gets you much better results than just being more the theory You can look at countries where people coming from And if you have a guess of the top countries These are the top countries the first one is new because most people don't put their location on the profile But then you have the United States, India, China, Great Britain, the Deutschland Which might be interesting This is the same by number of postures around the world and of course the most concentration is in USA, China But it's more interesting if we look at results per capita and now we see a huge concentration on North Europe We see a huge concentration Australia These are the numbers I got the top countries By concentration of programmers are Iceland, Sweden, Norway, New Zealand, Denmark basically cold countries Or that's what I think when I see those names. Now instead of stopping here and thinking about cold countries I can go and run an analysis over it. I can find out where do colders prefer to live And in BigQuery I also have the worldwide weather day by day station by station. You can get this data This is my the average temperature for each station around the world grouped by country and Singapore is one of the hottest countries I can confirm that Here are the coldest countries on this side We can join both data sets and we can get a chart like this This chart we can see that yes the coldest countries have the highest concentration of programmers There is a correlation now Within the hottest country that star on the right top that one from the right top that Singapore Yes Exactly people don't want to go outside So there's a huge concentration of programmers here And somehow in Asia you have the opposite correlation where the hottest countries have the biggest concentration of colders Counting developers. There are many ways to count them again Looking at them per country in Asia you can see that of course the country with the most Users is China followed by India Japan Indonesia. This is by Github Now each country behaves in a different way China gives a lot more stars per user than India 34 stars per user first to stand Singapore is there number nine you give in average 17 starts to project you could do more And then I have other data sets in the query have stuck overflow So what are the top countries in the stock overflow? Is it the same ranking or different? India is the top one followed by Pakistan somehow Pakistan uses a lot of stock overflows producing answers producing Questions China and then here Singapore goes down to To number 11 You could use more stock overflow you could produce more content you could ask questions answer them and these numbers could go up Still you can look at the growth in this case for the stock overflow. How many more users you're getting per per year And really surprised here about Philippines Indonesia and Malaysia They had a huge growth on the number of people participating in stock overflow Singapore had a 45% instead Yes replication distribution oh I would love to do that Okay, that's my homework or you can help me anyone can run where it's here It will be really interesting to see which countries produces the most useful answers Stay tuned Yes, so I We also have all of the stock over not all of the github code the open source projects We have a copy of the query so you can analyze code when I Release these two years ago When they took this creature the table was almost two terabytes now it's over two terabytes more than 200 million unique files It's a table that has the content of each file the size it's fine unique files only shown here once And so we are deduplicating it you want to get the total number of bytes you can multiply the Size by the number of copies and you get that we have more than 46 terabytes of code in this table and Then remember some rules before querying this table. It's really important that Just don't go and query this table Extract the data that you want to extract first Because everyone has a free terabyte of analysis every month Quering a two terabyte table will take away your free terabyte pretty fast But I have left for you Table with all of the Java code all of the Python code and if you want to extract anything special You can also ask my help and I will leave that table publicly for you I also left a sample table that's way smaller. I remember we only get open source projects According to the license they have if we cannot determine that the license is open source we don't copy it And now you can start looking at the real code you can run regular expressions You can see for example this where the top imports the top rows it imports in Javala between these years People are doing way more injects now people are using more mojito, etc And then you can start looking at things. Why are people linking to stack overflow within the code? With a query like this, I'm looking for regular expression anything that looks like a link to stack overflow I can join it with my stack overflow data set For example in JavaScript code. These are the top-linked questions, right? Is there a regular escape function in JavaScript? I'm I also have the number of views that these questions are getting One of the questions with the not top number of views is how to create a new you a D in JavaScript have It's linked from 600 files on github has six hundred thousand views on a stack overflow same with Python This is how I extracted all of the Python code anything that ends with the pie or a Python notebooks a Gigabytes of Python code a gigabytes of a Python notebooks This house search for the top imports within Python Just look at the lines that start this way and extract what's there and top-style references that overflow questions from Python code And then you can look at the opposite question Sebastian did here How many people are copying code from stack overflow into github? Anyone here? Anyone wants to admit it? so Sebastian was asking that question and He found one of the most popular answers how to convert by sizing to human readable code format in Java That's the top answer. And then if you want to look for this It's not that easy because people change the name of the variables the indentation So he transformed this answer into a regular expression And with the query you can search for real expressions You can do it in a 3d time and he found at least 400 files that match this answer And only 27% gave credits and they all look like a copy of the stack overflow answer So please when you copy code from stack overflow credit it or Sebastian will find you Okay, and how do you request features using data? Someone wanted in go wanted himself writing having to write after expiration time sub time now They wanted to write after time until it's nicer But they didn't add any data. So my teammate at that time Francesc that's still working in go but not with me but with you He moved to a different company But he still does this kind of analysis. He looked for all the projects that Would benefit from this and he found at least 2,000 repositories that would benefit from this change And they go long team implemented this feature someone else was asking also to to rename httls config to make it standardize it between two different Modules and princess found that 700 repositories would break. It's a normalized this so they didn't And the important message here for you is that if you put your code on github if you open source your code You can your vote count your code counts as votes Because people can you don't need to do anything just put your code there and people that are interested in analyzing it We'll find it and we'll implement features that are more the most useful for you Just and you don't need to do anything else other than open source You can go beyond regular expressions you can do a user defined functions in javascript So for example to do static code analysis. I downloaded the javascript library from github called Jscint and now I can run it inside the query and I just import the javascript library and things run Some people are also running arbitrary C code inside the query because you can compile C code to web assembly and I have some people here that are doing exactly that And yes, you can run that kind of thing and this is static code analysis of javascript code What are the top warnings? I have two minutes left. So I will hurry up spaces versus tabs spaces tabs This where the rules how I analyze this everyone wanted to know what's more popular Spaces are way more popular as I didn't know if you like go there If you like tough you can go to go and that's where people just put People have used this repository to fix Vulnerabilities there was a team of 50 Googlers that went all around it have fixed in the mad catch Vulnerability was pretty cool I love putting when you write also when you put your commas in sequel Would you rather put them at the end of the line or at the start of the line? Anyone likes them at the start of the line? Well, I like it at the start of the line. I know it's ugly But I wanted to demonstrate to everyone that it was better, but it turns out yes way more people put them at the end But then the question is which products are more successful? And how do you measure success starts starts of year numbers contributors activity and there's some that projects with a query like this You can look at projects that allow you to put a comma at the start as I like them Those projects are double as successful as the other projects So I guess I will until someone else proves that I ran the wrong query And you can go and find me because all the raw data is In others here. So please challenge these results Just go deep find the things you want to find tell me where I'm wrong and show up change these results You can be more active on github. You can be more active on stock overflow tweet about what you're doing blood of blog People are measuring and looking at what you're doing. So who wants to analyze github even github does it with the query I have a video with Alice on one of the data scientists at github And I hope you get pretty interested in doing this There's way more. I don't have time to talk about all of these blog posts in the last 40 seconds But you can go deeper. You can publish. I'll be very happy to see it We have also other than our three hundred dollars. We have the startup program for with more than three thousand dollar credits Talk to us about it Yes, you can find me on github You can find me already to the stock overflow and you can give me feedback because I love feedback. So please leave it there Thank you very much Anyone has a question in 10 seconds