 Hello, everybody. Our next speaker will be Felipe Hoffa, telling us about analyzing millions, what is trillions of lines of OSQs with one query. So give him a big hand and close yours. Thank you very much. Thank you for being here. Yes, I'm Felipe Hoffa. I joined Google six years ago as a software engineer. Then I became a developer advocate. That's basically a software engineer with a license to speak. And today, we're going to analyze a lot of data, a lot of data from GitHub that we have shared so you can do this analysis too as quickly as possible, maybe the next five minutes after the talk. So what do you see here when you look at this? Sounds good, great. But there is way more here. There is data. Let me zoom here. We have a year. We have a license. We have, oh, this is Python code that is importing from certain future features. We have certain imports that we are doing. And then if we look at the big picture, we also see metadata, like number of stars, number of faults, pull requests, how many people have contributed to this piece of source code. So what I see here is a lot of data. And I wrote it with a big font because this is big data. Thank you Boris for. So who wants to analyze GitHub? Project maintainers. If you own a project, you want to know how popular your project is. Who is following your project? It's not only about the number of stars. You want to do change management. Should I add new features? Will I break people that are using my project? Is my project healthy? How's the community behaving? How am I behaving on issue closing? And project users want this kind of data too. They want to know what are similar projects to follow. They want to know how to request features, how to better express themselves, how to use data when asking for changes. Project users. Even before becoming a project user, you have to convince yourself, or you have to convince other people, or should I use this project or not. And having data to do so is great. And data lovers. If you love analyzing data just for the sake of it, that's me, hopefully you too. And there's a lot of other people that are helping us use this data and get better metrics to drive open source. We are using three main data sets. There are three main data sets that I will talk about today. GitHub Archive, it was the first one that I saw. Ilya and Vigorik started this one in 2012, but has, so far, 8.7 billion events that are events published by GitHub hour by hour. And we are getting updates every hour. Recently, I added to my list of data sets J-Torrent that has been a separate project for a long time. But now I have it on BigQuery too. Well, we take the same events that GitHub Archive has, but we go way deeper. We are linking more metadata. We have a graph of data of what everything means, who is each person, et cetera. And then the data set we added last year is the source code taken from GitHub. We put it on BigQuery. So you could analyze even the source code and what's happening there. And that's a lot of code. I mentioned BigQuery. Who knows BigQuery? Excellent. So many people know it. When I started three years ago, not so many people knew it. But just quickly, for anyone that doesn't know what it is, it's a service from Google that lets you analyze a lot of data in very few seconds without setting up anything. It's just there. You can load its web page, or you can use the REST API. And even better, I can share data with you. Any of these data sets, any data set that you load here, if you want to share it, you can share it instantly. And everyone has one free terabyte to analyze it every month. You don't even need a credit card. So for example, let me show you quickly the query in case you have never seen it. I have pretty big fonts here. But basically, you can load a web page, you can load a query, and you can run your query here. And in very few seconds, it will analyze terabytes of data. In this case, I'm analyzing more like 1.7 gigabytes of data. And this query takes six seconds or 15 seconds. And I got results that I will explain to you now what I'm doing. So for example, Top Projects by Stars on 2016. That's a simple query. Let's look at all the watch events during 2016. We run it, and we get that the Top Project by Stars was free code camp followed by Google Interview University, Vue.js. But things are not so simple if you want to do things. Is this the real number of stars that each project got? Turns out people can star, and star, star, and star a project many times. And we are counting those events. So sometimes you want to think your query is a little more and you want to add, for example, let's count, let's deduplicate the number of stars. Let's count the distinct number of users. And yeah, then each project has less stars, but more real count. And what I want to highlight here is that you have the freedom to count things as you wish. Don't stop with the first number. Always try to go a little deeper. Once you have stars, you might want also to know more about the people starring you. 1,000 stars on this project are not the same of 1,000 stars for a different project, because people have different interests. And you can start doing queries like this, where here, for example, the query just ran outside. I'm taking everyone that star TensorFlow during 2016. And I'm counting all of the other projects they start. And I'm taking out anyone that also start Free Code Camp just because it's so popular that my results get a little madder. So I'm free to add and extract and change the way I'm counting things. And yeah, it's pretty cool, because I kind of start from TensorFlow. And then I start building a graph where people also start TensorFlow models, CAFE, Keras, Google Interview University, Cykit, CNTK. And the results make sense. If you want to run it for your project, I'm happy to show you how later or you can find it on the slides. And then you start wondering, if you're thinking about stars, how did people arrive here? Where did my 3,000 stars come? And it turns out, if you look at the flow of stars, they are mostly discrete events. You get a lot of stars in two or three days, and then everyone disappears. And these events have a lot to do, for example, with being shown on the Hacker News front page. So here, the small annotations that you might not be able to read show every time these projects showed up on Hacker News. And the fun thing here is that I didn't need to add these annotations manually, because in BigQuery, I'm also storing Hacker News posts, comments, and I'm able to write a simple query, or not so simple, but it's one SQL query. And I'm able to join social media with a number of stars. I can start identifying how social media affects my projects. Health, let me go quickly through this. Let's find projects, how many issues they have, what's the rate of closure of issues, how's the engagement of the community, what are the best ways of getting my issues action. Simple query, these are the projects that have more issue comments during this month in 2016. And Kubernetes has a lot of comments, a Spark, you might know, or an OpenShift, you might know. But then also, we have Sauron demo, and no one knows this project. But it got so many comments, so maybe we are counting things wrong. What if he started counting instead the number of people? How many people were really commenting? And then we can start looking at metrics like, oh, how many comments each author wrote. So in this chart, I have from big projects with more than 400 people commenting on issues, who was with the projects that had more comments per author. And Kubernetes shows a high degree of community involvement. People that comment here write at least 18 comments per author. And I'm removing robots, because that's another important thing to remove. And you can see other projects that also have a lot of engagement. But it differs a lot project by project. And you can go even deeper. It's more than numbers. You can do text analysis. Here is a simple query to look at the top four initial words to start an issue. The top way to start an issue on GitHub during this month was, it would be nice to follow. But is it possible to, you can see the rest. Yeah, people are really nice when they ask questions. But issues get a different closure rate, depending on what kind of issue do I file. It would be nice to get a 56% closure versus, is it possible to, it's at 74%. And if you start looking at patterns, there are better ways of asking things. I'll come back to questions later if I have time. It's number seven real. It's number seven real. So that the humans. Yes, it's real. There's some place where I have to deduplicate. Yeah, I have some minutes left. Let's talk about the code, the data set that we released last year. I had the pleasure of announcing this, but there's a lot more people working here. There are some Googlers, people at GitHub that allowed us to make this data set real. Where we are taking the code from GitHub, replicated on BigQuery. You can find the table online. The last screenshot I took, this is 1.79 gigabytes of data, terabytes of data, 200 million rows, and each one represents a lot of files. If you look at the main table, you can see that files have been deduplicated. They did, that is basically a hash of the content of the files. We have the size of each file, the content. So you can see the source code if it's a binary or not binary file, and the number of copies. So when I say we have 46 terabytes of code, I'm basically multiplying the number of copies that each file has by its size, and that's how I arrive to 46 terabytes. There are some rules when you want to work with this data set. It only has text files that are less than 1 megabyte, only one copy of each file. If you want to know all of the path that each file has, you can join it with the table files. But don't join it to get all of the contents deduplicated because you end up with 46 terabytes of code, and that's not really what you want. You want to duplicate files and not really you want something you want as a result. But whenever you want to analyze this, take your try to extract what you're interested in, all of the Java files, all of the Go files, all Java files created in 2016, et cetera. And I left a table with a way smaller sample of 10% of the contents of the top projects. And that's a faster way to start and not get all your free quota, not use all your free quota in the first day. This data set only has open source projects from GitHub. How can we tell that something is open source or not, that it has a valid license? We use GitHub App License API. So if you want to see your project replicated here, if you want your project to count, make sure that GitHub can tell its API. This uses the licensee project. There are some licenses. Sometimes people modify licenses in ways that the API cannot detect it. Please look into that so to make sure that it's being detected. Your life, that your project is open source, and that a robot can tell that your project is open source. Some examples I did analyzing code. Take the top Java imports in 2013. Take the ones from 2016. And let's see what had the biggest growth. The results I got here is that we got a lot more Android imports. We got a lot more injection imports, and a lot of Mochito tests. We can go deeper on how this happened, but thank you for testing your code. Or I love this example, or how to request a feature. Someone was asking, hey, I would love if the package time had the time until method. And then the Go team analyzed, should we add this or not. And Francesc, who is giving a talk right now in a different room, if he's looking at the livestream, Francesc just wrote a query and saw that at least 2,000 projects written in Go would benefit from adding this method. And thus, the method was implemented, and it's going live next month. This is a real good way of asking for features. Use the data behind it. Am I out of time? Yeah, sorry, Felipe. That's all the time we have. Thank you very much for your talk. So give me your big hand. The rest of the slides are online if you would need them. Felipe, where can the people find you if they have questions? Yes, please. Find me, find the talk, find I have links, and everything else. Sorry for taking a little long. Thank you very much.