 Welcome to the July Community Meeting for Wikimedia Australia. This month we have Dr. John Luca Di Martini, he's an Associate Professor in Data Science at the University of Queensland, and some of his research has been around Wikipedia and Wikidata. He recently received funding from the Wikimedia Foundation for research titled Measuring the Gender Gap, and I believe he presented at yesterday's Understanding Wikimedia as a digital media platform event in Sydney. So I'd like to thank him for joining us this evening and now I'll hand over to you, John Luca. Thanks, James, for the introduction and good evening, everyone. So like James said, the research my group has been doing also relates to Wikipedia, and myself I've been working with Wikipedia for a long time. So the plan for today is for me telling me a bit about what we have recently been doing for Wikipedia and what we started to do now thanks to this grant from the Wikimedia Foundation. Let me share my screen. Okay, so the first thing I want to mention is that I am a computer scientist. And so that's my first disclaimer. And this overwhelming slide is just for me to say that I really started to do my research in almost 20 years ago. And I used the Wikimedia extensively in my research. So what I was doing in the early days of my research career was building search engines that leverage Wikipedia as a background collection. So back in the days, search engines like Google, they were searching the web and they were returning links to web pages. But what we have today is not only the links to the web pages that you find on the web, but also typically on the right side of the search engine result page, you get is what it's called the knowledge panel or entity card, which looks like very much like a Wikipedia info box that contains some pictures and some factual information and maybe some links to other things. So if you're looking up an actor, you would see links to movies, the actor is starring in and so on. So all of that is in most of the commercial search engines powered by Wikipedia. And that's was the focus of my research in my PhD back in the days where we really looked at how is to how to build search engines that make use of Wikipedia, how to get information for search engine users, how to fit and so on. And how to present it in the best way. So this was the early days where my research was using Wikipedia, but now more recently we started to do research for Wikipedia and I'll tell you about those projects today. So one first thing we started to look at is Wikidata which is a structured version of Wikipedia where we get factual statements about entities like persons, location and organizations. And as you may notice, say database is then used to power what you see in the info boxes in Wikipedia articles. So we can just modify the version of Wikidata and then all the different Wikipedia languages can make use of the same data source to populating for boxes. So one thing we did before really trying to help the community of editors was really to understand how the community of editors does what it does. So we given our focus on data science and computer science, what we did is a study of contribution behavior of the community. So what we did is we took five years worth of edit history of Wikidata and this consisted of 250 million edits, the majority of which was automatically created by bots and algorithms and scripts. So we took away those 200 and more millions of edits automatically generated in Wikidata and we focused on the 35 million edits, contributions to the project made by humans. And we started to look at this over time. And our main question was time to understand why certain editors contribute to the project for a long period of time while others actually the majority they contribute once or twice and then never again. So our question was really what is the difference between those editors who contribute for a long period of time as compared to those who stop after a few contributions. And again, we did a lot of that analysis and few observations, few conclusions we reached is that those who contribute for a long period of time in a way are very systematic and they contribute regularly. For example, some will do it every evening, some will do it in the week and every week, some will do it once a month, but very regularly over time they manage to contribute and this makes them stick for a long period of time. And another aspect that differentiates these two groups of editors was the diversity of their contributions. So those who stay for a long period of time, they tend to start with a specific type of edits. For example, they would add all the capital cities of countries in Europe and then once they are done, they start to move along in this graph by for example, adding all the mayors of the capital cities. So they traverse the information in Wikidata and they in a way change over time the type of contributions they make as compared to the others who instead do not diversify and then they stop once they in a way have finished to cover their area of interest. So these were a few observations that led us to then ask questions about what data is there in Wikidata. So we know that over time it grows, there is more information about entities, there are more entities, more relations about them. So we started to ask questions about the type of data which is in there. And so if we look at Wikidata in this way as a graph where you could have entities as nodding the graphs and then you can have edges that connect one node to another node that represent a relation in the graph. That's exactly what you get in Wikidata. So if you look an entry in Wikidata would represent for example, this monument and then you have attribute and values relations to other entities or they will have a type and so on. So this can create a graph and then you can in a way traverse this graph following edges going from one place to another. So as we know, this graph has been growing over time more nodes and more edges are being gathered by the community. So when we think about is the data there good in terms of quality, there are a number of things that could happen. We could have wrong information, we could have missing connections, we could have missing entities in the graph and over time the quality has been improving. So the next question we wanted to ask is then how in a given point in time are we doing in terms of quality? And one as an example type of quality we can look at completeness. So the example is the following. In Wikidata we have entries about cities of Germany and we have a certain number of them. What we don't know if we want to know how close we are to having all of the cities in Germany which probably today we have but if you want to measure how close to having them all we are for a specific type of entity what we need to know is what is the actual number of cities in Germany in the world in real life. If we knew that then we can compute the number, right? We can say, okay, we are 80% complete because we have 80 cities in Wikidata and we know that there are 100 in Germany. So it's easy to count the fact that we have 80 cities in Wikidata. What is difficult is to know that we have 100 cities in Germany in general, right? For any specific type of entity that appears in Wikidata. So here is this problem which is how can we estimate the number of entities of a specific type? For example, US states it's easy to know how many but for other types of entities maybe difficult to know how many there should be in the database. So here, this graph shows how Wikidata has been growing over time, this is 2015, 16, 17, 18, 19, 20. I'm sorry, 17 and 18. And this shows that the graph is being more and more connected. So we have more relations and more nodes in the graph over time. Still it would be important to know how many entities we are missing for a specific entity type. So what we did in this work is trying to estimate again how many we should have of a specific thing. And the way we did it is using a method from computational ecology which allows you to count to estimate how many lions there are in the savannah, for example. So if you want to know how many lions there are in the savannah, what you do is you go there, you capture some lions and you tag them, we attach them, you attach something to here and then you let them go. After a while, you capture another group of lions and you check how many you have the tag which means how many you have captured previously. You do this a few times and then this allows you to estimate the population size. So how many lions there are overall, even if you haven't seen them all, even if you have just seen these samples. So this is a well-established method which we took and we applied to with data in this case. So what we did is we look at the edit history over time as our way to capture lions. So in this example, we observe monuments. They felt our is a monument we observe once and then we observe it again. So this is a recapture event. This is a new type of monument that we observe for the first time and so on and so on. So over time, we observe and re-observe entities of a specific type, for example, monuments and this allows us to estimate how many monuments there should be in Wikidata. And this is the estimation made by different methods. You see that the more observations we have, the more accurate the estimation start. This dotted line is the correct number that we are assuming we don't know. So there are about 850 paintings by Vincent Van Gogh out there. And we are trying to estimate by looking at paintings by Vincent Van Gogh in Wikidata and edits on them, how many there should be. And you see when we observe just a few, we are underestimating the number and we are overestimating. But then after a certain number of observations, we have a very good estimation of the correct number. And this allows us to know how many instances of a specific class there should be and therefore this allows us to compute the number of how complete the knowledge graph for the database is because we might have 600 currently in Wikidata. So we know how far we are to having the complete number. So it is, again, we did a lot of experiments. So we know that this worked well on Wikidata. And then we thought, how can this be useful to the editors to decide what to focus on, what to contribute next. And that's where it is a new project that is starting now is really looking at. So here we are looking at the same idea I just talked about, but so with a few differences. One is we focus on Wikipedia and the other is we focus on a subset of the problem. So rather than previously, rather than estimating how many monuments there should be, we look at the subset of a type of entity for a specific attribute value. Specifically, we look at the gender attribute and we look at the possible values of the gender attribute. So the problem is the following. It's the same problem as before, we just split it on gender. This means let's imagine we want to know how complete Wikipedia is in terms of astronauts. We know how many pages, articles about astronauts we have but we don't know how many we should have. So the problem becomes we want to know how many astronauts there should be and we do this for different genders. So this allows us to then compute an estimate of how complete Wikipedia is in a given point in time for a specific type of person for different genders. So that's the goal of this project to develop methods and tools to generate these numbers, to estimate how complete Wikipedia is for specific type of persons and for the different genders. What this allows us to do is to then measure whether the completeness for a certain gender is higher than the completeness for another gender. And it's not obvious, right? We could have 80 male astronauts and five female astronauts but this doesn't mean that the male astronauts are more complete because we don't really know how many there are out in the world. So the problem is to estimate that number. Once we know that, we could know that female astronauts are 20% complete and male astronauts are 80% complete and therefore the community may decide to focus on adding new articles about female astronauts, for example. Anyway, I want to stress the fact that also through conversations with the UK Media Foundation, we really don't want to tell editors what to do. Our goal is to generate these numbers and so estimate how complete a certain category of Wikipedia articles is for different genders and then we stop and then we give this data, possibly through a dashboard to the community and then it's up to the community to decide how to use it, right? And how to use this data to inform their decision making processes on what to focus on and what to do next. So that's our aim. It's really just to provide data and evidence to the community. And indeed one part of this project is talking to editors through interviews, through in a way of a co-design process of the solution, try to see how these generated estimates may be more useful for editors. So we are talking to a few of them to understand how they may benefit from this in the best possible way so that we don't just do this for the sake of doing it, but it can also be used to improve Wikipedia over time. So there are, I'll just spend one more minute to discuss specific challenges that I expect we will face and they mainly have to do with, so one is to do with the time as we mentioned things change over time and also the gender balance representation in Wikipedia changes over time. So in a way our solution is to consider that change over time and in a way make sure that the estimates are not biased based on what we have seen in the past but should be focusing more on what has been seen recently for example in that it is there. And the second challenge I expect to face is because of the notability criteria. So a good example is that of mathematicians in Russia. If you look at Wikipedia, the majority of Russian mathematicians there are male and this represents well what happens in the mathematics community but this happens because there is a gender bias issue in the specific society that in a ways pushing male figures for leadership roles. And that's why the most notable mathematicians are male but on the other hand, if you look at, say students in STEM disciplines in Russia the gender balance is very, very good. So there are many non-male mathematicians in Russia but the famous ones are male and this could also be in a way a challenging detail for our number estimates, right? I mean, we really have to pay attention not to reinforce stereotypes in the society but try to be as objective as possible to allow the community of editors of Wikipedia to then make their own decisions on how to use this information that we provide. All right, so that's our aim for this new project. I'll probably stop here and if there is any question I'm happy to discuss further details about this. Thank you. Thank you so much, John Luca and thanks for everyone who joined us this evening. We'll now turn off the recording to have a bit of a Q&A and I hope you all join us next time as well.