 Okay. I'll just start now. So, welcome to this talk. I have no one to introduce me, but I'm Daniel and I work for Apache and Quenda, which is a company that does all sorts of open source, whatever we like. And I'm going to talk about Snoot, which is a interactive activity aggregator. So, what it does is it pulls in everything that the ASF or any other organization does, like code development, community discussions, interactions, whether it's on email or mailing lists or GitHub or issues in JIRA or Buxila or whatever, and pulls in all the people that are contributing to the ASF and displays that in pretty graphs and some text. It started as a simple line of code counter. We got a request from a large organization that said we would like to count all the lines of code we have in our repositories. Can you do that? And so we started out counting all the lines. And we thought we wanted to see some graphs, how stuff has evolved over time. And then we started making some charts. And then we kind of said to ourselves, why start that code? Why not go in and pick out bugs and email and online messaging and changes in communities and so on? And then it kind of evolved. And this is the, if one of this works, this is the ASF over time. The lines of code we have in our repositories, all 838 repositories we have. As you can see, we just hit the 150 million lines of code this March. So, congrats to us. And below, you can see the evolution in each different language at the ASF. This is obviously Java. No surprise there. As you can see, that's almost half the entire ASF code base. So we added more and more charts and we made an API so you can go in and get whatever you want for yourself, which also enabled us to make a dynamic web page where you can go and interactively get all the stuff you want whenever you want it with live data. It updates on the fly. So it's pretty much a gigantic, clustered, chronological job that runs every night at 2 a.m. my time. And it goes and says to the ASF, what changed? So it picks up all the commits, it picks up all the open or closed tickets, the new email, the online messaging, whether it's IOC or Slack or whatever. It even goes in and checks the downloads of ASF code bases and paste visits on GitHub and everything that you can actually count from the ASF. And then it puts it in a database and it does an on the fly data crunching whenever you request a chart or a statistic or whatever. We have about 340,000 people in the database and that is people who send an email or open the ticket or commit code or has done anything related to the ASF. So this is the amount of people that have interacted with the ASF in the past 23 years. So some it may seem like a lot and some it seems like not a lot, but that's the figure we have. We have around 20 million emails that we have in the database as well. Two and a half million commits to Git or SVN. We mostly use the Git mirrors because as much as I like SVN, it's a bit faster getting the log of changes from Git. So we pull in the Git mirrors and then we do a Git log and then go through everything and see that every day. We have nearly a million tickets in JIRA and Bugzilla that we have also pulled in. That took a few days because JIRA is not the fastest of things and we didn't want to overload it. But we have all that now and what we do every night is just going to say what's changed. So we just pull in like a thousand tickets and then just pass that. So, yeah, 155 million entries crammed into seven machines, two big ones and five small ones. And it's open to all committers. So if you don't have an account on SnootGit, you can go to www.snoot.io and use your Adapte.io email address and this specific invite code. It's important to use that code if you want to get into the ASF organization. Otherwise, you'll just get into a demo organization and we have to invite you and that's troublesome. You can do that right now on the Wi-Fi. I took a lot of screenshots because sometimes if we have like 20 people using the ASF instance on Snoot, it can get a bit slow because it is live data crunching. So it gets a bit slow sometimes. For those not a committer, we'll hopefully show a live demo later on with some of the stuff that you can see and some of the stuff you can't see. And since Sally is not here, I'll show you some of the stuff that she doesn't want you to see. So as I said, it is interactive. You can go and pick any time span like May 2016 to April 2017 or a specific project or a specific repository or a specific bug tracker or any specific contributors and you can see the stats for them or the project or the repo and see what's going on here in this specific time span. So you can go and see what am I doing, what's Jim doing and how does that relate to how the project is doing. So this is me. As you can see, you can see the commits I've made. You can see I was pretty active at some point and then kind of lost interest and then I got active again. You can see the lines I've changed which is kind of a fun chart to make. You can see the emails I've written over time. You can see I was very active when I started. Oh, that's 2015, that's not when I started. But I was active then and I'm sort of active now. You can see the tickets I've made or closed. I've closed quite a bit of tickets period and not really made a lot of tickets and you can see the projects that I'm contributing to. So you can see I love pony mail and I work on infrastructures so I also commit to that and the other repos. And you can all go in and see your own profile and you can see when you started at the ASF. You can see I started in 2012. So I have my fifth anniversary at the ASF now. And you can see when I send my first email, when I contributed my first line of code, you can do that for every single person at the ASF. Which is also kind of cool because you can go in and you can see when is someone having an anniversary at the ASF. So it's publishable which means you can go in and you can get a widget and you can publish every single chart or piece of data from Snoot on your website. It will then go in and pick the live data and show that on your website. We actually have that on projects.patchy.org. So if you go to that website, you can see all the statistics for the entire ASF at the moment. It's got a public API. I think it's like 120 pages along the API documentation. So if you really want to spend time on that, you're welcome. So you can go in and check everything that is on Snoot. You can grab and display it publicly if you have a token. So let's just get started with how you interact with Snoot. The first thing you want to do is create a view. A view is a collection of sources. Everything in Snoot is a source. Like a Git repose source, a mailing list source, a bug tracker is a source, a Slack channel or an INSC channel is a source. And we have around 5,000 different sources at the ASF. So what you do is you say, I want to watch this and this and this source, and I want to call it cloud stack or whatever source you're watching. And then you create a view, and then you find a time frame for when you want to watch the activity for that specific cluster of repositories. And then you have fun with it. And then you can publish it and use it in your board reports and so on and so on. So as I said, it's a collection of sources and it's personal. So you can create your own view and that's just yours. And some people like Perley and Sally and so on and so on can make some public views that every single ASF committer can go in and use. And we have, I think we have 10 of those that everyone can use. This is my views. I have quite a few of them because I like to toy around with the patchy projects. And you create a view by clicking the create view button, the teal one at the top. And then you type in, for example, flex. And then you use a regular expression to say, I want to pick stuff called flex. And it picks all the sources that have some sort of relation to flex and then you save it. And then you have a view. And you can use this view to filter out all the other things that the ASF does and just show what flex is doing at the moment. So this is a view. This is the default pace for a view. You get the code composition, like what sort of language is flex or whatever this is. Written in, you get how many lines of code are there, how many comments are there, how many blanks. This is kind of what Olo or Open Hub does. You can see the commit activity. You can see how many people are committing, how many are in SVNs the same. But in Git, you can see how many people are authoring code and how many people are committing code, which gives you an idea of how many contributors are there versus how many committers. And do we maybe need to invite someone who's contributing or authoring code into becoming a committer? You can see the lines change. Again, this is just for fun. And you can see the development of the code, the evolution of the code over time. You can also click and see each separate language inside the code. What's the evolution of that? Has JavaScript or Java overtaken C at some point in this project? And why? So what do we need Snoop for? Most of all, for the ASF, it's for board reports and community outreach. It's for finding out what is going on in the project, what are the trends? Are we in a lull? Are we steaming ahead at full speed? Or what's going on? We also needed to see if the community is growing, if the community is healthy, if we are inviting new committers in, if we have someone we should invite in. And yeah, making sure the project is not dead, because nobody likes that project except Henry maybe. And also you can go in and you can find out other projects that are working on something related to what we do. Are there maybe committers that are on two or three different projects that we could work together with? I'll show you later what I mean about that. So yeah, board reports are used by the PMC, the project management committees to inform the board of directors, how are we doing? Are we still alive? What's going on? And the items in blue are the stuff you can get on Snoop, the items in black, you should just write yourself. And most people don't bother writing that. They should. So normally you what does this project do? Are there any issues that require board attention? Like do we have a bad invite somewhere between some people or is something not being maintained probably? Then a short activity summary, what's going on? What's the major issues? What's the major activity in the project? Community health? How is the community doing? The project, the software doesn't really matter to the board. What matters is the community. Is the community healthy? Is it not healthy? And then you got some pretty standard procedure things. Have we invited new people in to the committers or the PMC? Have we lost someone? And then software releases? Did we release anything? Did we not? Are we going to release anything? And then the final thing is code, main list issues. What have we been doing that we have some figures on that we can present to the board? Is it slowing down? Is it speeding up? What's going on? So yeah, I kind of have a tendency to talk about my next slide and my previous slide. So I'll just skip this one as I always show you. So in Snoop you can get a summary of all the various things on the top of each page. So you can see this is just an example for, I don't know what project flex maybe. You can see how many lines have been changed, how many committers have been active within, let's say, three months. That's the usual report cycle. So you can see it's going, yeah, okay. It's okay, but it's going down. So some people are not committing that was committing three months ago. You can see the commits have really gone up. So that's hopefully positive. You can report to the board that there's a lot of activity going on, even though fewer people are participating in it. You can see the number of tickets closed, opened, how many people closed it, how many opened it, which is also a good indicator of the problems surrounding the code being solved or not. Or just opening 5,500 tickets every week and no one's actually working on them. Finally, the mailing list, how many people are sending emails, how many different topics have been discussed, and how many emails in total. And again, it's just going up or down compared to the previous cycle. And you can pick any cycle. This is a standard six-month cycle, I think it defaults to. You can pick three months. You can pick 20 years and see what's been going on in this time span compared to the previous 20 years. So community health matters to the board because, as Sam also said in the keynote, community is more important than code. The board doesn't care about whether you've released some fancy new feature or not. It cares about whether the community is healthy, because if the community is healthy, the code will also be healthy. Like, as in Wayne's World, if you book them, they will come. If you've got a community, the code will be written at some point. So that's what the board really looks for. And it should also matter for you because deep down we are good persons. We want people to get along and also diverse minds and diverse friends. And I don't mean by gender or nationality or anything. I mean, a diverse opinion in a project is really important. It allows you to find things that you wouldn't otherwise see. And from my personal experience, I have a tendency to scope out an entire project in my mind, and then I just code it the way I think is the best. And then you got someone else coming in and say, but hang on, to me, this doesn't make sense. Or this is the wrong approach. And I'll be, wow, I didn't think of that because I'm just me. And if I have 10 dangles next to me, we'll get a lot of coding done, but it will be all the same mindset. And thus, we are going to miss some things. So it's important to have a diverse group of people in the project. So are we gaining or are we losing people? You can see this on snoot. How many people commit? How many author code? They get this difference between those two. What is the trend? Are we getting more committers? Are we getting more authors? Yeah. We can also see on the user list, on the dev list, it's the user list more active than the dev list, which is usually a good sign that you have a lot of users. If you only have activity on the dev list, you're probably making a lot of code, but you're not providing it to anyone because there are no users of it. So that's a good thing to go in and measure. If you have a user list, some projects only have a dev list. So and like I said before, our ticket's being closed. We can't open or is everything okay? Do we need more people helping out? It's a good indicator of do we need to do some outreach to the community and find people to help us fix these issues? We haven't talked tomorrow about software as easy, communities hard or something like that, which also is going to talk about some of the issues of bringing in new contributors because it's not difficult to bring in new contributors. If you see that you need them, you just need to reach out because like on Stack Overflow, for example, you have for the Apache website where you have thousands of people that want to contribute and say, this is how you solve a problem and we're sitting over in this corner seeing, oh, we have tons of tickets, no one's closing them. What can we do? Well, what we can do is we can reach out to all those thousands of people and say, come on over and help us close these tickets. But we kind of need to know first that there's a problem and then we also got to know that there are people willing to help. And finally, conversions. Are we converting people from, for example, Stack Overflow or from the user list to active contributors solving issues, filing bugs, solving bugs, maybe contributing code or these patches and is this a positive trend or are we getting more and more new people or is it just the same old people contributing all the time? So this is pretty much the same thing. So yeah, this is the repository activity. We've been over that. This is the mailing list activity. You can go and you can see who's actually writing to the mailing list at the moment, who's the busiest author. You can see how many topics are there, how many authors are there. Is there a trend? Are we getting more and more emails on a specific list or is it slowing down? Or is the dev list overtaking the user list or is it the other way around? And then you can report back to the board saying we are getting more and more user traffic, which hopefully means we get more and more users. And then the board will say, great, you're doing a great job at outreach to the users. Then got issues. This is, I think it's all the issue trackers. Yeah, it's all the issue trackers at the ASF, so it's got a big number. You can see how many people are closing, how many people are opening issues. You can see the top people that are really good at opening issues, the people that are really good at closing the tickets. You can see the busiest tickets. So if you want a really long conversation to read, you can go and find those tickets and read them. And you can also see here how many tickets are unresolved over time. And this is usually like an exponential thing because newer tickets tend to be, sorry, tend to be closed and older tickets tend to be just hanging around. I forget where I was going with that. It's usually an exponential thing curve and there's a reason for it, but I forgot it. So sorry about that. Yeah, I'll go back. Is there like a three-person overlap in the top five openers and the top five closers? Overlap? No, I think not that I can see. Oh, well, some people can be both top closers and openers and closers. That's not uncommon because people will open these, use them, will assign them to themselves and then close them again. So that's completely normal. And also, you can be a top closer in one project than a top opener in another project. This is the entire ASF. So this is a fun chart. When you report to the board, you usually say, are we getting new people in? And this shows, especially for Git, not so much for SVN due to the nature of commits. But for Git, you can see how many people have not contributed before the red area and the newcomers, but are contributing in this period of time. So the red are people that have never been seen before writing code for this project. The blue are people that usually write code and still write code for the project. And the green, the tiny green area up there is people that have been writing code and then they went away for 10 years or whatever. And now they're back, which is also a good thing, getting the great beers back because we like them. And you can see who is actually now contributing that wasn't contributing before, both in code and in issues. So who's opening tickets now? There wasn't opening tickets before. And this is the conversion chart. You can see who is an issue contributor before and it's now contributing code. And how long did it take them from opening their first ticket to submitting the first patch? So you can see how fast do we convert people from being users to being someone who needs to scratch an edge to being someone that contributes code to the project. And we can also use that to see if someone is authoring code but it's not a committer yet, maybe we should invite them in. And then again, we got people who sent an email and are not contributing code. And you can see how fast that conversion is. So this fellow here, pretty fast conversion. Now I'll show you something new that we haven't had at the ASF for 20 years, but we have that now, which is downloads. You can actually go in this note and you can see how many people have been downloading Apache source code. And you can see where they are from. So this is CloudStack. You can see it's very popular in China, also in the US. And it's pretty much all of America and Europe and Asia and not so much in Africa, which is pretty standard. If you go look at the download charts for the entire ASF, you will see that every single nation in the world in the past three months has downloaded software from the ASF, every single nation. Yeah. Is that per download or is it also counting the number of bytes? So it's a larger download than the count heavily? No, it's just a number of downloads. We can't track downloads because that goes to the mirrors and we don't have staff for the mirrors. So this is when you click to get a link for the mirror, it goes in and it checks where you're from and it locks that. And that's publicly available. So we just go in and grab that. But that's all the board reports. I'm going to show you something a bit messy, cross-project relationships. This is something that doesn't make sense at first, but it's pretty crucial to the further existence of the ASF. And what it shows is for the past two years, what projects have shared contributors, that is someone sharing, for example, take Jim or Shane or anyone who's working on one project, then they move on, they work on another project, then they go back, work on the web server, and then they also work on whimsy. That makes those two projects linked. They share a contributor. And this shows how big a project is in terms of how many commits. And all the lines show that this project is sharing a committer with this. And the bigger the line is, the more committers are shared. I'll show you a less messy chart in a second. But this can reveal possible joint ventures to projects going together because they share. If you share a committer, you usually have some sort of segment in the project that is similar to the other projects, which opens up for a joint venture that people and projects can come together and work on something. Because if I'm interested in the web server, but I'm also interested in Steve, they're probably connected. And that means that other Apache web server developers could also go in and work on Steve all the other way around. So there is an opportunity there to find people that you can bring into the project that already know about the Apache way and thus you can more quickly gain a bigger community. So this is the big data at Apache. You can see that some of them are connected and some are not. So a lot of these projects, they have the same committers working on each different project. So it's more of a family for them. And then you got the outer rims. You got the projects that are not really connected. So you got to ask yourself, why are they not connected to the other projects? Is there a specific reason for it? Do they have a silo culture where they just do their own project and they don't really care about Apache? Or is there a different reason for that? And this is both interesting for the project, but perhaps also for the board to see what's going on. Is there a specific reason for that or is that just the way it is? And this is the same thing, but for bug triggers or issue triggers. So you can see there's a bit more of a connection here which could be users of the projects that are actually wanting to use this project and that project. But then the developers are not developing those two projects together. So what's going on? Snoot can't really tell you what's going on. He can just tell you that something is going on. And what it can tell you is that people have an interest in developing our project as well as Project Foo. And then perhaps we could get more people from Project Foo. And also, as I said, we have people using two different projects. Maybe there's a common denominator in those projects that we could utilize more. The next thing is a bonus slide, which is something called radar maps. And this tells you what is going on activity-wise in a project compared to another project or perhaps to the entire foundation. In this, we have the active MQ, we have Spark, and we have CouchDB. And what you can see is that Spark is way more active in filing bugs or solving bugs than, for example, CouchDB is. But CouchDB is way more active in interacting with users than Spark is. And each step is a factor of five. So this is one and this is 1,000 times as active as the other projects. So you can go and see what's my project doing compared to the others. If you see that these two are really, really good at interacting with users and we're not that great, maybe we'll go ask them and see how can we learn from them. What are they doing that is so successful? Or do they just have more users? So that is a fun chart for projects. And you can also compare to the entire foundation and see how are we doing compared to what the ASF in general does. So as I said before, you can publish all the bits of data on Snowden, I'll show you later how that's done hopefully. So you can write up a blog post or email or a board report or you can just put all those charts on your website or your blog post or whatever you want. And that is the presentation. And then we'll go to the live demo. So yeah, the presentation is there and you can poke me or Twitter me or email or IRC. So we'll try a, for those that are not committers, we'll try a live demo. I think, yeah, this is the, if my mask will work. So we can see here that we have Sling, for example, was that many commits in the past six months and it's connected to four different projects. We can also see that the specific line, the big one here. You can see that Arrow and Parquet two Apache projects have that many contributors in common. So you can go in and see where are people really doing cross project pollination and whether or not. Got to do something to get this. We've got the radar map. This is something that you will not get access to, but I'll show you anyway. This is the affiliation of companies, the affiliation of people that are committing to the ASF. So you can actually see all the, this is not even 20% of all commits to the ASF. It's shoots the amount of companies that have developers that contribute to the ASF. It's really, really, really a lot of people. So you can go in and you can see what company is, when did they start? How active are they in the community? Or did they stop? Or did someone change jobs? Which is kind of fun. So you can see here, this is the top 15 or so, and then that is not even 17% of the entire code base at the ASF. It's generally a really, really diverse community at the ASF. But you won't see this because Sally said no. So this is the API documentation. You can see it goes on for a while and so on and so on and so on. So if you want to grasp some data from SLUDE, you can do that there. Code relationships. Is there anything that anyone wants to see if anyone's not a committer or don't have an account? No? Yeah. Do we have a microphone for you? Or can you speak up a bit? We do that mostly by email but also the, I think the top 500 most prominent committers have been sort of profiled so we know where they work. And you can go in and if we find, let's just pick a random person, like let's pick Mark Thomas. So you can go and you can see that Mark is a part of Pivotal because he's been tagged by someone at the Apache as being part of Pivotal. That means we can go in unofficially and say Pivotal has contributed this much, which is not true because this Mark Thomas contributing. That's why it's not been shown. And you can also see when he's first committed as you can see. Oops, that's everything about Mark Thomas. You can see he's been really active and he's solving a lot of tickets and he's mostly doing Tomcat. And you can do that with all the 340,000 people at the ASF. So that's how we do it. It's mostly automatic but there's quite a bit of manual labor involved in it as well. You had a question? Yeah. We are almost finished supporting private brokers that you can install on your personal network and it will just push data to some server somewhere. So it's still, you're still logging on snoot.io but it'll be, we won't have access to their repository or anything, just the aggregate data. So that is possible, yeah. Any other questions? Well, what else can I do? I can show you we have, the latest thing we have is GitHub. We can go in and we can see how many page views. Let's just pick, so we can see that people don't really visit GitHub in the weekends. But we can also see that people are really interested in cloning or forking the Apache repos, which is kind of cool. We have this many clones. I did not expect it to be that big, the figure. So that is some cool stuff. We can also visit downloads, yeah. That's why that's one. This is the entire ASF downloads. So you can see the United States have downloaded 500,000 copies of ASF source code within the past three months, or I think it's just two because we started capturing that two months ago. And as you can see, every single country except for, oh yeah, the Central African, they probably, they don't have internet apparently or they don't pack the ASF. So they haven't downloaded anything. But every single other country has downloaded from the ASF in the past two months, which I think is very awesome. We can even go in and say, HTTP. So you can see that Russia really likes HTTP for some reason in China as well. And so that is something you can go in and if you're a committer, you get access to all of this. And my computer is dying now, so I'll just finish this talk. And thank you for attending. And if you have any questions, you can just poke me and I'll try to answer them. So thank you.