 Yeah, today we are going to visualize some of the data from Drupal.ovg made available via its API. This data is somewhat indicative of the community behind Drupal. And I hope these visualizations give a picture to give a picture of these numbers. So I am Hussain Abbas. And actually, this is the second time I've been given the last session of the day. It's always a challenge. Back in Barcelona, I was the only person between everyone and the beach. Over here, it's everyone and whatever is around here. I have been working with PHP since about 15 years and with Drupal for about six years. I'm now a technical architect at Accident. When I came to New Orleans, actually not just this Monday when the conference started, there was a haunted history too organized by Amazilabs. And throughout the tour, the guide, the tour guide who walked us through paid special attention to all the specifics and the numbers of everything that happened. And even though I believe, I mean, of course, it's just stories, all these haunted stories at these various houses. For example, this house over here, I don't think I can pronounce this. But even though these are just stories, it's really just about the numbers. You know, they kind of give a richness behind these stories. And I like to collect these as data. I'm actually always been fascinated with collecting numbers and data of my own personal life. In an organized manner, and I just don't mean Excel, I religiously check in everywhere. I use Strip It, I use NuCache, I use all these apps that basically collect data in a format that I can use. And I can process them however I want later. And that brings me to Drupal contributions. So I've been a contributor to Drupal core and modules and everything since about two or three years. And until recently, there was no real way to quantify that data. It was actually available. I just didn't know about it. And you might have heard of DrupalCourse.com that basically counts the commit credits you have on Drupal 8. So that was the only metric I had, basically. And last year in Los Angeles, I, David Hernandez and myself, we give a talk on improving community contributions. You know, identifying what's causing community contributions to stay in a long tail. So if you have looked at those contribution graphs, you would know that around 70 to 80% of people just have one or two commit mentions. And we were looking at how we can pull that up. And of course we looked at data from Drupal.org. And we got this data from Drupal Association at that time. And I always wanted to get the access to the raw data. And eventually I found the API. I didn't want to scrape, obviously. I didn't want to scrape Drupal.org. I found the API this year when in Drupalcon Asia, which was held last February in Mumbai, a company called Azri Solutions organized a contest, basically, you know, to build a visualization for whatever data you collect from Drupal.org. And I participated. I didn't really expect to win. I've never won things in life. But I was actually hoping that whatever libraries I build out of this contest, I would use it for my own projects, which I had, you know, there are a lot of projects which I quite shelled up from quite a long time. I was hoping to use all of these libraries over there. And yeah, I mean, you know, Drupal.org is not a small site. You know, a lot of, you know, just getting the users took about a week on my instance. Well, anyway, in spite, you know, whatever, I actually won the contest and I'm actually hoping to take it further today. And in that process, I want to just share some numbers and some of the things that I found while building all these visualizations. So a little bit about the technology itself of this website. It's called drustats.com, by the way, DRUSTATES.com. It's, well, the idea is that, you know, I mean, I just, I'm hoping for more contributions to that in terms of, not just in terms of spirit, but pull requests as well. It is, it's built with Laravel 5.2. It uses MongoDB to store all the data. It's convenient, you know, MongoDB stores all the JSON written by Drupal.org API as it is, you know. I don't need to worry about the schema or whatever differences I need to verify. And actually it has got a great aggregation pipeline, which actually helps in processing all the numbers. And it's pretty fast. For a single instance, it's actually pretty fast. I use Beanstalk for queuing. So every 24 hours, all the objects get queued up. So all the nodes, different type of nodes, forums, issues, modules, and all that. They get updated every 24 hours. And on the front end, the graphs, the charts themselves are shown using a D3, that's a data-driven document library. And I use Bootstrap just for the, it's a very simple thing. There's not much to it. Just uses Bootstrap, you know, for just getting it up and ready. Quick note on D3. D3 is actually very, very basic if you've used it. You'd know it's very low level. There are various reusable libraries on top of it. I have not, I did not really have time to evaluate each one of them. So suggestions over there are welcome. So data collection is an important part of data science, but it's actually a relatively simple one. You know, what is more important is asking the right question. All the data in the world are just numbers until someone asks the question to transform that data into the information that you're interested in. And my main goal with this information is not just to present the data I've collected, but also to identify what questions we can ask from this data. So let's jump in. Projects are an important point of community and it's actually useful to know how they're growing. So this chart shows the growth of all the projects. Yeah, this chart shows the growth of all the projects since the beginning. It's since around, you can see from 2004. You can see roughly, as of right now, we have about 34,000 plus modules, 2,000 themes and 1,000 distributions, also 200 core projects. So core projects are basically Drupal.org, sorry, Drupal core projects, which have been forked for testing and playing, and some 20 or so theme engines. You can, actually on the website, you can filter out these data, but in this chart you can't really see because most of it has been taken up by the module itself, just the modules. So you can actually see the difference between the modules and themes. It's what, almost 20 times modules are present and compared to the themes. So a few things of note over here. Around 2011, you can see that there has been a slide, there is actually a jump in how the projects have increased. The curve, which was growing slowly, suddenly accelerates. You can actually see that clearly over here. So this graph is slightly inaccurate because, well, it's more like a histogram and I should have used a bar chart. But for the most part, you see it's accurate. And you can see basically around 2011, that suddenly there were what, 60 projects created, 60 plus. And I'm guessing that's when Drupal 8 development started and a lot of projects got out of core or something. I mean, I've not really evaluated this further, but it's actually an interesting thing. What's even more interesting is that after that, we are seeing consistently more projects being added in. So until 2011, you can see, on average, around five projects get added in. But after that, it kind of doubles almost. Doubles or almost triples in some cases. And the graph is kind of fuzzy. That's really because it's on a weekly cycle. It's just not resolved to that resolution. It just shows, I mean, all the jumps. So the projects, from back-of-hand calculations, I saw more projects we created on Friday, if I remember correctly, number two other days. So that's when you see the peaks. Actually, on the website, you can zoom in to see the dates wherever these peaks are. We can look at it a bit later. We can look at it after the, I think. This is, actually just a bonus graph here. It's a bubble chart sorted by all the modules, sorted by the number of downloads in each module. So, well, there's no surprise. The views is the highest over here, followed by AC tools, token. This is, there is a similar chart for all the projects, not just modules, but that's mostly just filled with modules. It's a, obviously, Drupal core is the highest number of highest downloaded project. And the highest downloaded theme is Zen. But apart from that, it basically follows the same structure, these modules, that chart is mostly filled with all modules. Well, another way to look at project is the number of issues. So this chart over here, it shows all the projects on d.o with number of issues. So you can see Drupal core, for example, has 74,000 plus issues. Webmasters, 22,000. Webmasters is the Drupal autology content, content on Drupal autology, and that has around 22,000 issues. And so on, you can see this over here. Now, this could actually be assigned to, this could actually be a way to see the active projects on d.o. It's not so much of a sign of the code quality of these projects. It's more about how many people are involved with that project and how many issues were created. The next one, this chart is slightly different. It shows the actual open issues right now per project. And this is kind of, you know, you could correlate this with the quality of code. So again, no surprises here. Drupal core is with around 14,000 plus open issues at the moment. Open issues is everything, open, active, fixed, all the issues get fixed. Issues get closed after two weeks after they're fixed. So in those two weeks, they're still counted as open. And RTBC, all this, you know, needs work, needs review, all these issues are counted as open. And there are such, 14,000 such issues on core, 4,000 on views and so on. Okay, and really, what can be seen here is that, if you can imagine, it's more of a long tail. That Drupal core has significantly more issues or open issues compared to any other project. You know, it quickly falls down. It's from 14,000 to 4,000 to 2,000 to 1,500. And then it kind of stabilizes. So this is what is a long tail in this graph. It's not shown here though. And that's another interesting thing about how the projects get developed. So Drupal core is a highly organized effort. Everything, every commit that happens, happens through issues. This is usually not the case with many of the contrib modules. Many contrib modules are very particular about this. There is always an issue for whatever change that goes in views, for example, of course. Others are not so much. There's an issue you just fix it. You don't create an issue on Drupal.org for it. So this, we can just look at the overall issue count on Drupal.org. It's as of this morning, it was almost 900,000 issues. And it's all issues, open, closed, everything. So this pie chart breaks up the issues into bug reports and further into the priority of each and the status of those issues. So it's actually an interactive, you take your mouse over each of the slices and it will show you how many issues are there under that. And you can actually filter per module. So this basically, I mean, it's again like I said earlier, it's about asking the right question. So if you are looking at, for example, one of the modules we saw earlier, like OG or views, what's, how do the issues look in those modules? This, you can actually go to the website and filter it just for issues on views module or issues on OG module. Another thing is that those big maroon slices you see most of them, they are actually closed and fixed issues. And like there are a few, there is a legend over there, I don't think so, it's visible on the screen right now. But basically in each slice, there are about 75% of issues are closed. So it kind of correlates with the earlier graphs we saw, which is about 20% of issues are closed. I'm sorry, 20% of issues remain open compared to the overall number of issues that are present. Then let's look at the users and of course users are the community. Drupal.org, again as of this morning, has about 1.9 million users. And this is how they mostly split up in different countries. It's no surprise, USA has about three and a half times that of India, which has twice the users of Great Britain. It's Great Britain internally, United Kingdom of Great Britain, which is about three times of Canada and Indonesia. Then it follows on, there is a list. I think Germany, I believe Germany is next. And there are about 757,000 users which have no country set. The field is null basically. And we'll come back to talking about the countries. There's another graph of how the users have grown in each country. We'll actually look at some other data there. But just some other things. So languages spoken by users on d.to. Again, English is, there's no surprise there. English is the highest, with 66,000 users speaking English. Say they speak English. Around 8,000 say Spanish, French, German and Russian. So again, it's kind of English is almost what? Eight times compared to other languages. But if you look at the amount of users that have not specified the language, you know, these numbers are very small. It's 1.8 million users have not specified the language, which makes this sample size too small to really discern anything. It's just a number for now. And the next, this next visualization just shows it further. This expertise is even of lesser value. 99% of users have not put in any expertise. So, but still from what we can see, you know, developers, site builders, themeers, they are mostly the same, 2,000 or 3,000 users each, 1,500 project managers, one and site building 300. So actually site building and site building are same, obviously, semantically, but this is a free text field. So users can put in whatever they want to. Anyway, this is like I said, again, the sample size is too small to really make out anything compared to almost 2 million users. It's 3,000 users saying developers is not really an indication of anything. So, over the past few years, this is how the users have grown on DDoTome. So, let's assume that if the UIDs have always been assigned sequentially, which right now, the UID is around, I think the latest UID is around 3.5 million. We can imagine, you know, we can guess that around 1.6 million users have been blocked or removed. It's very easy to, I think, they must be spam. We see small jumps in this data. So around mid-2012, you see there is a jump. And then they're like small, it's not really significant, but I think the most significant jump over here is in 2012. And this actually makes it a little bit more clearer where the jumps are. So ignore the first two peaks. Again, like I said, this should not really be a line graph. It should be a bar chart. This is just like a quick chart I constructed. So it's basically, you know, from the beginning, it seems like there were 2,000 users migrated. And until mid-2003, there were no new users. Until in 2004, there were again around 2.5,000 users migrated. So by the way, I checked this chart. The dates are actually the same for all these users, so it's not a bug in the code. But after that, you can see it's pretty accurate. So again, there is a peak in, like we saw in mid-2012. And we can actually see that in one of the upcoming charts, what that could be. This is, again, a small thing, you know, gender ratio has always been something I look forward. You know, in all the camps and meetups I organize, this is one thing I always measure, gender ratio. And on Drupal.org, gender ratio works out about, what was it? Actually calculated it, but I forgot the percentage right now. Around 74%, I think. So you have, as of again today, as of today morning, you have around 135,000 users identifying as male, 28,000 as female. Around 1,000, say, other, they are not specified, and 200 as transgender. And the users that have not filled this at all, they are about 91% of users have not filled this at all. So this sample size is still kind of small, 10% of the population is not significant, but at least we know it matches up with what we observe. Again, if you see, you know, we see the same jumps that we saw earlier. So similarly, you know, this is again a similar chart, but it's split by country. Now this was actually a heavy chart to render. It actually crashed Firefox, and in Chrome it stopped playing all the animations. It's basically splitting up all the users per country. So the big gray one you see on the, the biggest one you see on the bottom, that's actually users that have not specified a country at all. The next orange one is United States. After that, the red one is India. So if you remove this, you know, I mean, if you just like visualize this, remove this from, remove the bottom gray one, you know, the ones with no country specified. All the other countries are close to, are linear. They're pretty much linear. You can actually see the gray is the part of the graph, which is responsible for the jump. It increasingly seems like, you know, that a lot of spam users that got in on that day. And so if you take that out, you know, I mean, we still see that United States and India have been growing, you know, faster than compared to other regions. But they're still linear. All the countries, they're pretty much linear. Which is what is expected, I think. You know, we are, our growth has not accelerated. It's, the growth is there, but it has been mostly constant. So all these visualizations, they are on roostars.com. Some of them are not listed on menu, but it's actually the same data, same code. And I'd appreciate feedback, contributions, you know, on that. Just create issues there, pull request or tweet on, just let me know by Twitter. I'm looking to, you know, keep building this website. So what do you think? And also questions, and of course, you know, please leave feedback for the session. Yeah, yeah, yeah, please. So what do you need to make it better? Well, like, like I said in the beginning, you know, it's important that we ask the right question. So all this data is there and all the visualizations are there. But if you're not deriving anything of value over here, it's useless. So it's important to ask the right question and we can always build that. Josh Mitchell, I'm the CTO of the Drupal Association. So my team, you know, works on Drupal.org, can do improvements to Drupal.org. We can also help others get improvements on the Drupal.org. So one of the things that I would love to see is some contribution around helping to find what should be in the API, what are the things that would help identify better data so that your visualizations can be more accurate. So any help we can get there, I think, would be helpful. True, yeah. So I'm assuming all the code for which drives this API is on d.to, is that right? I've actually not looked into it. I meant to look into it after Drupal.Koneshia, but... The documentation could probably use a little bit of help, but the basics are all on Drupal.org slash Drupal.org slash api. Yeah, that's the instructions. So what I'm saying is the code that drives this API, I believe it's a REST-WS. It's all under Project Drupal.org. Yeah, that's right. The main customers, but it is using REST-WS, the module. We've only implemented what is out of the box with REST-WS, but I think we would need to extend it to basically, the API of the trip would be doing that in a performant way, because some of those queries could be... Yes, so what I found over here was that there were many API calls that could actually be a lot better. So internally, users refer to a field collection, and field collection then points to, let's say, an organization, right? So this actually makes me... This actually means that I have to make the API call twice to get the organization for the user, right? Because just getting the field collection ID is not enough. I have to see what organization is storing the field collection. So things like these, you know, the same thing that happens to the projects. So projects have releases, but the releases themselves are under another field collection. So, you know, simplifying the API, you know, instead of just IDs, the actual data over there, it actually speeds up the indexing. Another thing I noticed over here was because of which the user graphs are not really that accurate, because users don't have an updated timestamp, which means that I can never get updates to users. And actually there are, users have more entities than any other entity over here. So it's not easy to scan through all the two million users to find updates. I can do that with nodes because they have a change timestamp, and I can sort on changed and just retrieve the nodes I need to, changed within that last 24 hours. I can't do that for users. I can do, I can just get the new users. I can't get updates to the existing users. So these are a few things, and I'm open to contribute to make the API better over here. I just had to, you know, get into it. Yeah. That's part of why I was asking, because like I love the visualizations. Oh, thank you. Very worthy of a motorcycle. I'm gonna show you. But I want to figure out how we can get more of that and make it more accurate. Because I know from querying a lot of the Drupal.org data that, for instance, 1.9 million users is not accurate. We know that a lot of those are spammer accounts. Which muddies the data and makes it harder to see. Right. So the API returns only the active users. The spam users must be blocked, right? No. No? They're not written by the API. I'm gonna say of the 1.9 million accounts, some of them are spam accounts that were created that were never used to create spam. So we know that to be the case. So figuring out what that real number is, is actually probably more like querying all the confirmed users or... Exactly. Wearing all confirmed users plus users that have been created after the date that our spam measures were put in place, those sorts of things. Yeah, true. For example, if they have any comments or if they have created any forum, like any kind of nodes, for example. So that would be an indication. Yeah. True. I'd love to see contribution in that space. So thank you for pulling all that together. You're welcome. Thank you. So are there any plans of... I mean, you might be having or had conversations with the VA folks or whoever, you know, folks who are managing Google.org. Are there any plans of embedding any of these chunks back into Google.org somewhere? Home page might be a great place, by the way, but anyway. No, I mean, I think Josh can answer that. But yeah, there are no plans as far as I know. I mean, I'd love to help in any way. But again, you know, the visualizations, I'm kind of happy keeping it running separately if that's what it needs. Of course, you know, any help. So I'm running this hosting on Digital Ocean. So yeah, so I'm happy to continue doing that. Of course, if Drupal.org, like Drupal Association, prefers to keep it on Drupal.org for some reason, I'm, of course, you know, I'm open to contribute over there as well. Josh, probably not the home page. We used to have a map on the home page. I actually think the home page is in the best place for visualization because it's already competing for so much attention. So you can see as we get the API better and we get that refined and we figure out what are the things that best describe the community. I would love to have those visualizations on Drupal.org and drive traffic to it because obviously you get more traffic by having it on the home of the community than being on a separate site. But happy to link to it as well. It's just a matter of getting them more accurate so that they really tell the story. Because some of those are awesome as they are right now and some of them we, like, for instance, the user things, things like that. But I can see a lot of the issues and projects that we can link to that right now. It is really valuable. Motivating, I think, Hussain, you've got a long way to go and a few, good few milestones out there. We could, yeah, you know, let's get this Drupal.org, and write the motorcycle. Thanks. And by the way, you know, seeing Indonesia as, you know, the top five was something new to me. Had it been here, I would have observed that, or, you know, noticed, but that's where visualization really makes a difference. That was a surprise to me. Germany was just after it. And I could just fit in six numbers on the slide, so. My name is Chara Scruzz. I'm from Ukraine and our guys from the Drupal Ukraine community did a similar project. But we analyzed Drupal.org stats, but our main goal was to compare countries. We compared Eastern Europe countries by developers activity on Drupal.org. But you guys did some other visualizations and actually your visualization are very good. Thank you. Maybe you can just, we release our project on Drupal.org. It's available like a model. It's called Drupal.org stats. Maybe you can found some ideas and implement it in better way. So it's like we can work together and make it even better. I'll be very happy to. Thank you. You can always reach me on Twitter. So I always go by the Twitter handle. I mean, the name Hussein web pretty much everywhere. Twitter, d.o. So, yeah. Thank you for sharing this data. I can understand why the folks at the Drupal Association would be interested in this. I was curious why you're interested in this particular topic. So, like I explained, data has always been something I've been fascinated about. I collect each and everything in my personal life as well. I mean, I religiously check in into each and everything wherever I go. It's on four square. I note down each and every dollar or rupee. We spend, we have Indian rupees currency, right? So I write down each and every, even if it's five rupees, I note it down that I spend five rupees. It's, I don't know, maybe you can, maybe it's close to an obsession, I guess. But yeah, I try to collect data and not just collect data, but have it in a way which I can process later. So of course, I've been interested into many open data storages as well. So for example, open data store for trips, open data store for flights. And well, I'm a bit of a procrastinator as well. So I'm not really done as much as with the data as I should, as I want to, but I'll get there someday. This is something which I just got around to. So you mentioned a number of times the importance of interpreting the data rather than just collecting it. And I was curious if there, if you thought of many potential goals that you might be able to take this data with. For example, maybe you could also pull in data from other communities and compare it with Drupal data and display it in such a way to help emphasize how Drupal does things in certain ways. Maybe, you know, contributions by gender, something like that. Have you looked at pulling in data from other communities that might be available? That thought has crossed my mind, but I'm not really, you know, I'm not really built up any ideas around it because it's a lot of effort to collect this data. This website took me about 15 whole days to build. You know, I mean, like I spent nights and building this for the contest to get it ready in time. So it's not on my top list, but it's something which I really want to do. But more important than that is the gender thing which you mentioned. This is something which I've been actively following. You know, like I said in the, I organize meetups in camps back home. So this is something which I always try to measure, you know, the gender ratio. So Drupal cons, we get the number in closing session. We'll get this conference's number tomorrow. But again, you know, in camps back home, I always see this number and see how it holds up to the industry trends, you know, which is around 20% which matches over here, by the way, which matches the camps I have seen, camps I organize back home and the camps I have been to in India. So this is something, you know, this is a number personally I would like to see grow. So that's one of the other things. And... I wasn't necessarily suggesting you to do more work. I was just curious if you had looked into if this data is available, or similar data might be available from other groups after our communities. Well, if it's there, that's great. I am not aware of any such place. I'm not really, yeah, I'm not really made any effort in that direction, yeah. I'm actually, that's the best thing about open source, right? You know, I mean, I'm hoping somebody will, like you just said, and you know, so I'm hoping somebody will actually say that, okay, there is this data available over here and let's see if we can build a comparison. Then, yeah, that makes things easier for me as well. Thank you. Thank you. So I think, Mike, yeah. And it did the visualization. So let's talk about Royal Enfield, the motorcycle. It's one of the best motorcycles in India. Since 1955, I do not think there has been any other bike. So, that perhaps was one more good reason. Yeah, it was a reason to get it done finally. Add to that, I personally am very interested in how the community works and how certain decisions come to be made. And I do think it is helpful to have data to maybe compare ideas and then maybe test theories against that. So, visualizing it, I do think is helpful. Yeah, I agree, I agree. Okay, and if there are no other questions, I think we're done. Please feel free to, so I don't have my Twitter handle on the slide right now, but like I said, it's Hussainweb pretty much everywhere. You can tweet me there. I have a Twitter handle for the website as well, drewundasco stats, so you can tweet there as well. And yeah, that's the GitHub repo, Hussainweb slash true stats. So you can create issues or pull requests or just look at the code. So there are complete instructions in the repository. You can look at how to set it up. But I mean, if you have any ideas on setting it up, I would love to collaborate. There's no reason you have to go to and scan the whole dataset yourself. Yeah, so by the way, this dataset, the dataset that gets scanned, I strip out all the descriptions from the field. So my primary motivation in doing that was to not use more disk space than necessary. And anyway, I did not have any ideas on what to do with the description anyway. I have some ideas, it's maybe too early for that, like NLP processing and all that, that could be done on the descriptions. Really, it's too early. So for now, in interest of disk space, I did not store the descriptions. But what matters is the relationship between different entities and that's there. So I think that's more than enough to actually still keep asking the right questions and keep going ahead. Thank you everyone. Thank you. You can always look at it at drewstats.com, all this here, except some of them which are not in the menus. Most of these are there. Issues we don't actually, I mean, you actually have a box over there so you can go over the project issue. But I'm pleased and delighted to take it in front of you all, so I'll be next time. Send it to you over, let me see the file. Oh, they're not got the link. Yeah, okay. I was one of the bobers. I know, I know.