 So shall we start? Great. Welcome everyone. Thank you very much for coming so early on Sunday. So probably you had some part yesterday night. So well, my name is Daniel Edgardo. I'm one of the founders of ETERGIA. I would like to talk about gender diversity gap. I've been doing this analysis across several communities in the OpenStack Foundation. Good morning. The Linux kernel and some other analysis. And then I did similar stuff for this Python interpreter a couple of years ago in the Python Spain. So I said, well, what if I extend the analysis for the whole repositories that we have in github.com slash Python? So this is basically what this is. In general, part of the presentation is about numbers. But I would say, forget about the numbers, because we all know that the issue that we are talking about and it's about how to become more inclusive and how to create more diverse communities. So this all started in Tokyo. Why in Tokyo? It was during an OpenStack summit in 2015, 2016. And basically, the OpenStack Foundation tweeted something like, hey, we have a 13% of women attending the conference. And that's a great number, by the way, in terms of if you are comparing this with others and how the progression of the OpenStack Foundation. But they had another tweet asking for, but how many of them are doing technical contributions? So that was basically the problem we wanted to solve in terms of, can we measure this? Is this possible in some how? So well, part of the motivation, I already mentioned this. But another reason for this is basically that there is some kind of lack of existing data, at least the worst of my understanding. I'm not aware of any data set from a quantitative perspective in terms of measuring comets and so on in the Python community, telling we have this percentage of underrepresented set of people in the Python community. If you are aware of this, please let me know. Of course, this is a matter of transparency, because if we are transparent as we are basically publishing the source code and everything, why not this aspect of the community? So this is a matter of learning from the best communities, the best companies, or the lessons learned in the industry, and how can we improve our own community in terms of diversity and inclusion? Of course, there is an ethical and business perspective, ethical from the point of view that we as a community, we are producing things, but we are producing things for the world. And the world is basically, let me limit all of this to 50% of women, 50% of men. Of course gender is not something binary. This is another limitation of the study, but if we go for this, so we are producers and we have consumers, right? But most of the producers are men, so what's going on here? From a business perspective, basically the more diverse a community is, and there are basically studies by HBR, Harvard Business Review, and some other magazines, they say that the innovation is increased by a certain percentage, 30 something, as far as I remember, and that's a lot. So basically the more different people we bring to try to solve a problem, the more different point of view we'll have to solve the same problem, right? So that's the point here. So I went to the Python site and said, well, what's the definition of diversity by the Python people? And we are talking about all of this. If you see the bold thing there, we are talking, this talk is just about gender diversity. So we can talk about age or national origin, geographical location, and a lot of things, politics, blah, blah, blah. So this is how the Python software foundation define us, well, this is our diversity statement. Some more context here about what diversity is. And now I'm bringing examples from the industry. So the top left basically is Pinterest. They say, or they are claiming that, well, our workforce, technical workforce, is around 19%. Okay, this is the number. In terms of Google, the top right, they are claiming a similar percentage. So 18% of the technical workforce are women within Google. Then we have in the bottom left Facebook, where this is around, yeah, for all of the employees, they have a 32% while the technical task force is 16%. And then we have Pivotal on the bottom right where they claim that they have a 25% in general, okay? So we have numbers between 15 and 25%. So half this in mind. When I started this, these are the numbers. So this is the population of women participating in some communities. So this is OpenStack Foundation. We have 11%, women participating in the Linux kernel around 10%, Hadoop ecosystem. So it's not only Hadoop, but other projects. This is around 8%. The C-Python interpreter in this case is 4%. But again, forget about the numbers. It's not a matter of comparing if we have higher or lower numbers than our communities, but it's a matter that these are the numbers. And if you check this with the previous numbers, there's a difference, there's a gap. But the same companies that are claiming that within the garantization, they have a 20%, 25%, 15%, they, we go to open source communities and they only have like a 10%, 8%. So why this gap? Well, first I don't know. And if you have any clue about this, please, if you have any idea or comments, let me know as well. So there's this gap. And of course we have this gap between the still women are underrepresented in technology, even in the high tech companies, okay? So how are we doing this from a technical perspective? So the project is Grimoire Lab. That's the logo on the top left. Well, I have another, I should have some link in the next slides. The first thing we are using is Perceval. So Perceval is basically the tool to retrieve everything. This is now supporting, of course, all of this is open source. Perceval is basically the tool retrieving everything. What you get at the very end is a JSON document. So if you don't have to think about APIs, logs, whatever, you get a JSON document. That's all right. Then another important tool here is Sorting Hat. Sorting Hat is basically the tool to manage all of the identities. If you think, for instance, about OpenStack Foundation, they've had something like seven or 8,000 different people committing to the source code. So that's a big amount of information, right? So how to manage all of the identities, affiliations, different people moving from company to company, the country, the gender. So all of this is managed by Sorting Hat. You have a command line interface. There is an API. We are now migrating these to GraphQL. So there's no need for a rest API, blah, blah, blah. But it's there. And then we have Kibiter. Kibiter is a soft fork of Kibana with certain plugins we are using here. Let's say that the previous tool, so all of these since elastic search to the left, everything is sour. So in terms of the Grimoire Lab, the project, while we are taking advantage of elastic search and Kibana. Kibana is a fork in this case. So Kibiter has specific plugins and widgets that we produce, again, open source. You are more than welcome to play with this. Just an example of how Perceval works. It's as simple as this. So you take Perceval, you specify the specific data source that you want to analyze. In this case, we want to analyze Git. And then you have the URL for the Git repository. It's done. You start getting JSON documents. It's as easy as that. And how is this? So this is a JSON. This is an example of a JSON document. It's a reduced version of this. So if you think of a commit, what kind of information we have there? So we have the authors. We have the committers. We have the date of the author. We have the date of the committers. So we can have reviewers, for instance, of the code. So those might be the committers at some point. We have the files that were modified. We have for those files, the code chunk. So specifically the lines that were added or removed or modified. We have the number of added lines and removed and all of this. So what we are doing in this analysis is basically we enrich this information with other types of information. In this case, we have the gender. How are we analyzing the gender? So if you think about the problem of this, and this is the first question I had on the table when I was in Tokyo, if you remember, was how can we calculate technical contributions? So we try to approach this in the simplest way at all. That was, can we guess the gender of someone somehow with the first name? We said, maybe. So it happens that there are some, oh, let me go to genderize. This is an API that we are using. This is not ours. This is an existing service around there. And basically you are asking for, you say, can you give me the gender of Peter? And they say, yeah, gender, Peter is a male with certain probability. And you say, OK, that may work. Then it happens, and luckily, that most of the developers in open source communities are coming from the west side of the world. So the US, Canada, perhaps South America, not that many. And then we are in Europe, and then we have some in Asia. So typically, or at least the numbers we have is, depending on the project, like 80% of the total amount of information is done by west countries and somehow. So then we are dealing with west names. So that's great, because in this case it's quite reliable, the kind of data produced by genderize.io. So this is the way we are using this. And Sorting Hat, what I mentioned before is, so basically we have a set of identities, and then it happens that each of us, if you are participating in Git, you are using an email address, and it's probably much different from the nickname you have in GitHub, and it's probably different from the nickname you have in some other platforms, Twitter, and so on. So there is some process to merge all of the identities, and so on. So this is done by Sorting Hat. And well, affiliations and blah, blah, blah, and the same for gender. So what I wanted to show you now are some numbers. So I'll go for the live demo. I produced a dashboard. So let's go for this. So this is Kibana, well, Kibita in this case. So we have produced certain things, like the top menu on the right, and well, certain widgets that we have. So let's focus in first place here in this chart. So basically this is Tallinas. Hey, this is the whole life of the community, so we have the last 25 years, if you click there. Again, this is github.com slash Python. This is what we are analyzing in this case. So we have 130 something commits, blah, blah, and then we have these commits by gender, and the population by gender, right? Of course, there is a certain percentage of people that were unknown in the sense that basically what I was doing was I have a list of developers. I go for the main developers, or there, basically, and then I go through them checking. So there is a lot of manual effort, a lot of coffee, and so on involved here. So Luis, indeed, there was helping me a lot with this. So this needs some data creation process. It takes time. But the important thing is that all of the work we've done is already there in the sorting out database. So if we want to keep improving things, it's just a matter of saying, OK, let's go again to the database. So we need some policy around data retention, data storage, and how to deal with this. And of course, we are talking about privacy of the data, so that's something to take into account. So if you go for the last 25 years, basically, the population we have of women is 3,4% that you have there, from 74 people. If we click here, basically, this gets updated, we have this number of commits and this number of repositories. So basically, we go from 130 something thousand to 1,000. This is the activity we have for women in GitHub.com slash Python. Another question I have for you is, basically, what other communities in Python should I include in the study? Because there is a huge amount of them. So I said, let's start small, and then let's start to the community somehow. Basically, what we have now on the top is a sticker. So this is filtering this. So if I disable this, then what we have is basically the previous information. Let's focus for a while right now in the time zone. So this chart here is basically displaying time zone information. So on the x-axis, we have certain numbers, which are basically the time zones. So this is mainly Europe, right in the middle. Then on the right side, we would have Asia, China, Japan, and so on. And then on the right side, this should be mainly, basically, the US. So we have East Coast, Central Coast, Central Time, and Pacific Time. OK? If we click here on women, oh yeah, we have the sticker here. Sorry. So we are here with women. What are the most diverse areas from a geographical point of view in terms of the data? So in this case, it's basically the US, the US and Canada, and mainly the West Coast, right? So this is something to take into account. If you remember at the very beginning, what can we learn from the most diverse places, projects, propositories, organizations, geographical areas? So what's going on in the West Coast to say, hey, we have more diversity? Perhaps they are more inclusive for certain reasons? I don't know. But the point is they have much more population, and they are producing much many comets than in other areas. So it's something to study. So we can point to these places, right? Don't think of the data as the result, but as a tool that can help to improve things, right? So then in terms of activity, the table we have at the bottom is basically the most diverse, in this case, propositories by activity. So this is quite similar to if we filter by mail or by the whole organization. So basically that means that the most interesting projects for the community are the most interesting projects specifically for women. So based on the data I have, it doesn't say, well, it doesn't seem to be any different between women participating in the community. In other communities, for instance, they tend to be more focused on documentation. It doesn't have, for instance, in github.com slash Python. That's something that's an interesting result. So more things. Yeah, this is another chart we produce. So this is basically the demographics of the community. The demographics by, well, if you check here in the attractive developers, what we have is basically activity over time. In the top chart is, again, activity over time. And then each of the peaks here in the bottom chart when each of the women did their first commit. So when they came to the community, when they were attracted, right? And then on the top is basically the last commit of any of them. So let's imagine that we have these three developers here. So we click there. And then we know that for them that they produced their first commit in 2015. Basically, they left the community in 2015 in August. So we have these numbers. How can it be useful for the community? So if we have a certain amount of people basically coming to the community, and we know exactly when they left, we can say, hey, we miss you. Thank you very much for contributing. You have certain knowledge that is really important for us. So what if we talk again, we try to have you again in the community, right? So this might be useful for community managers in terms of understanding who's around and trying to keep retaining developers. This is interesting as well to have some certain analysis of the retention rate. Because if we say, how many people do we have nowadays still contributing from 2015? Zero. So our retention rate is zero since from 2015. So what's going on? More things. This is another chart we produced. Basically, this is the evolution of our time but comparing women and men. As you can see at the very end, so the green stuff, our men, the yellow line are basically unknown commits and the red things are, the red bars are women. And we don't have almost activity by women till 2017. Why is this? This is an open question for you. I don't have a clue. I don't have any idea, but probably you know this. Or you have certain ideas. Of course, there are some small peaks around here but, well, this is happening. So why is this? I don't know. And then I wanted to say to you as a final chart, a network analysis we were producing. So think of each of the dots as developers. So basically, this is a developer, right? Then each of the rectangles are repositories, right? So we go here to the big one. This is Cpython, right? Then we have some others as mypy. That should be, whoops. Oh, this is typeset. The point about this is that we can understand visually where people are working, how they are contributing, and the areas of the code that they are around, right? So this is for the whole community. But what's the network for women? So as I mentioned before, one of the conclusions I had is, well, it seems that women are basically participating in the same projects as men. And if we check here, basically we have Cpython as one of the main projects here. Then we have typeset, and mypy should be around here. So this should be, well, it's around there, I don't know. But the point is that the Cpython and mypy typeset seems to be projects that are interesting for both women and men. OK, let's go back to the slides. So this is the numbers presented. This is some numbers for the whole 25 years of community for the last year. So for instance, we have 800 authors in all of the community, 38 women participating in the last year. This is the community evolution over time, the geographical distribution. For those that are becoming late, thank you very much for coming. We are talking about gender diversity. This is the representation for the last year for women. So we are talking, again, in the West Coast mainly, comparison of the charts, newcomers, and so on. So comments and limitations, because we only have five minutes. As I mentioned, gender is not a binary thing. So this is a limitation of the study. But at least the data we have is good for bringing on the table the problem, the issue of the community. And this is happening in all of the communities. This is focused on github.com slash python. So any other community you are interested in would be great. I would like to make the dashboard public. But the point is about how to deal with the privacy of the data, because we are basically producing private data. So how to deal with this? If you have any ideas, so we have certain ideas about how to do this, but it's a bit hard. This is a small fraction of what diversity in reality is. Because if you remember about the diversity statement by the foundation, this is just gender diversity. And this is focused only on quantitative data. Again, so this talk was about numbers. Now forget about the numbers, because the problem is that we have this problem in the industry. There is kind of a class sailing of 10%. And diversity in inclusive is a challenge. So it's something we need to check. I would say, in my opinion, is that data might be useful to be aware of what's going on. Because the point is that if not, we have only perceptions. And with opinions, we don't go to anywhere. So the point is how to have this data. So well, we have data now. And think that data and technology is only a tool to achieve the goal, because the goal is to be more inclusive and have more diverse communities. So we produce a specific report on OpenStack. So if you look for OpenStack, you know a report around this is sponsored by Intel and by Vitergia. And we're producing this. We were analyzing things like mentorship activities and so on. So there are some results and some conclusions. Some recommendations from the OpenStack Foundation is that we'll enforce the code of conduct, keep supporting gender groups. In this case, it's the women of OpenStack, collaboration with tech leaders and so on. So basically, if you go to the tech leaders and you present the data, some of them will be interested in data and say, what can I do with this? How can I improve diversity? All of this is done under the umbrella of Kios. Kios is a project from the Linux Foundation where all of the source code is stored. So you can go over there and check. It's community health analytics for open source software. We are currently having two groups, GMD, which is growth, maternity, and decline, more software engineering related thing and diversity and inclusion working groups. So we believe that diversity is a key factor for having healthy communities. These are some examples of what we are doing in Kios. So we have the focus areas of event diversity, contributor community diversity, communication, inclusivity, recognition of who to work, blah, blah. So for each of these focus areas, we have a set of questions. And then we have declare or state a set of metrics. If you are interested, I would like to welcome you to our meetings. It's on Mondays at 7.30 Pacific time, 4.30 Central European time. So you are more than welcome. Look for this. And well, this is all about this presentation, which is, it's not about perception. We need to have data, right? So this is all. Think about, don't think, my opinion is, no, no. The data says this, right? So it's something we need to take, to be really sure about what we are saying, because we may think that the, I don't know, the diversity of our community is increasing, but this is not true, right? So these are the numbers. Indeed, but something to discuss. So this analysis will be continued at some point in the future. So we are working there. This is all. So I don't know if you have any questions. Questions, yeah. So the question is if it's possible to access the dashboard. Yeah, so that's, if you ask for this, you can be given access. So the point is that we are dealing with private data. So it's not public in the sense that you can go there and so on. Our idea is to have public, let's say the raw data in terms of the basic dashboard for Python and then you can play with that. About the privacy of the data is something probably that we have to discuss with the foundation somehow. So if there's someone from the foundation here or someone that is interested in this would be great. So time's up. So no more questions. I'm around. Thank you very much for your time.