 Alright. Okay, so welcome to the State of Wikimedia Research, the 2018 edition. So I'm Mako, there's a whole team of other people who you'll get to hear from, who've both helped with the creation of this presentation, who will also be helping present. So if you've, good chance you've seen this, if you've been to previous Wikimanias, perhaps you've seen a previous version of this. I've been, I started this in 2008 and have along with a growing group of people been doing this, been doing a version of this presentation almost every single year since 2008. This originally began as an excuse for me, because I'm an academic, I'm a professor now at the University of Washington, and this began as an excuse for me to make sure that I was up to date on research about Wikimedia, because a lot of my own research is about Wikimedia. So in 2008, I sort of provided a description that looks something a little bit like this, to the program I sort of said in my submission that I would try to provide a quick tour, sort of a, of the last year's academic landscape, geared it, non-academic editors of Wikipedia and readers, I claimed that I was going to try to categorize this still and describe the academic landscape as it's shaping up around Wikimedia. I hope to read every paper that had been published about Wikimedia in the previous year and then talk about them to everyone. Two weeks before the presentation, I did the Google Scholar search and realized that there had been 800 papers published about Wikipedia in the previous year. I tried to import the list of papers into my bibliographic management software and I got banned from Google Scholar because they felt that there was no way that a human could read, realistically consume that much material published about Wikipedia in a year. I had a 45-minute talk and that was like also worked out to about three and a half seconds per paper and this year our talk is half the length. There's a huge amount of research which is done about Wikipedia. There were more than 500 papers published about Wikipedia in the last year and although we've kind of moved past peak Wikipedia research, there's still an enormous percentage. This graph actually kind of like incredibly sort of tracks the growth of editing of Wikipedia except as always academics are about five years behind the rest of the world. As of yesterday, there a quick search of one of the largest databases of scholarly papers suggests that there have been 8,000 papers published about Wikipedia. There's a newsletter which we'll talk a little bit about at the end run by a number of people who've helped put this presentation together with many others and which just in the last year has presented summaries of more than 100 articles written about Wikipedia and there are many hundreds more on the to-do list. This is my sort of disclaimer slide which sort of throws out the goals for this presentation which is that as I've suggested we're not going to be able to provide a complete view of the state of Wikipedia research. What we can do is highlight a set of themes. So what we've done is picked five themes which we're each going to talk about for really just a few minutes. We're going to provide a little bit of a research postcard and we've selected those themes with sort of these criteria in mind. We've tried to identify what we think are kind of important themes among this body of research which is being published about Wikipedia. We've tried to select for things that we think are going to be of interest to you all. We've tried to not include any research from people who are here at Wikimania. Sometimes we make a mistake. Yesterday we found out we met someone who was very nice to meet him but we had to remove his paper from the presentation as a result. It's a strict rule and we've tried to provide a bias towards peer review publications. So I'm going to give the first research postcard and then I'm going to turn it over to everyone else who's going to do one as well. The first postcard is around images and media. This is something which there's been an enormous amount of research on Wikipedia. Almost all of this has looked at the text of the encyclopedia. This year we actually saw a number of really interesting papers which have looked at not just the text but also the media. This is a paper written by a team of people at the University of Michigan and Northwestern University which looks at image use. What they're really looking at is what they call image diversity. Image diversity basically, they use this as a way of describing the way in which different images are used to represent the same content or the same concept across different languages in Wikipedia. This is a graph where they basically say that if they took the 25 largest language Wikipedia's they then took a list of all the images were used in each of those articles and then they tried to see how often images were used in the same article or different images were used. For example, this is a graph from a tool that they built which you can try out which visualizes this information. This is a graph of images which are used in different languages to represent the concept of happiness. So they are different articles about happiness. You can see that sometimes there are some images like for example the picture of the gorilla which are just used in one. German Wikipedia uses a gorilla to illustrate the article on happiness. And in fact, all of these little bits that are not connected to someone else are examples of images which are only used in one place. But sometimes you have images which are used in multiple languages as well. So if an image is used in, if you use the same set of images to illustrate the concept that would be considered less diverse and if you used different ones that would be considered more diverse. This is illustrations from four different concepts across these different language editions. And you can see that the article on Wiki uses really a lot of the same images across all these different languages whereas the article on Science really uses almost a completely unique set of images. They found actually a relatively high amount of image diversity. They found that 67% of images which are used in any one of these languages are used in exactly one of these in one language. But there's a lot of variation. There's been a bunch of previous work including work that these authors have done which have looked at text diversity as well. And there was lots of reasons to which basically says like how many of the same concepts are used to illustrate or explain a particular concept. And without going into detail of how they do that their expectation was that they would see less diversity in images than they would in text. After all, most images don't need to be localized and they're relatively easy to include a common image across many different languages. They actually find quite the opposite. Each of these dots represents a language pair so out of all those 25 editions and the red line in the middle, the zero line represents would be a situation and they were sort of equally diverse. So everything below the line are situations in which there's more image diversity than text diversity or concept diversity. And they find that in almost all of these pairs there's image use is actually more diverse than textual use. Alright, so that's the first postcard and I'm going to pass it off to Rin to do the second theme. Thank you, Miko. Oh, thank you. Okay, so other research has been done on talk pages. We have this paper which has been talking basically about who wins on talk pages. So whenever we have a discussion of what many people's weighing in, so whose perspective makes it, whose opinion makes it. So the paper looked at more than 53,000 instances of interaction on talk pages paired with edit actions were analyzed. So basically they saw who talked about what and what or how or when did this affect the article itself. So this is their concept of winning. So winning or losing in a talk page, in a discussion on a talk page related to an article, it depends on several factors that have been examined in the study. Some of them are language. So your language, is it inviting some person to do something? Are you requesting somebody to do something? And so on. How many times you talk? Of course if you talk a lot, it kind of matters. And who starts the discussion? And who has the last word? Your style. Do you use a lot of question marks? Do you love a lot of exclamation marks? And so on. How authoritative are you? Like, you know, are you adding things like Lupidia of blah, blah, blah that blah, blah, blah. How expert do you impose yourself? How do you frame yourself? And of course how emotional your language is. Are you really fervent about what you're saying? Or are you kind of like, yeah, okay, whatever I'm just saying. So the paper found that you're most likely to win if you talk in detail about the content itself and you discuss all the facts in it and you say that I read in this reference so and so and so and that reference says so and so and so. So if you assume expertise, ooh, okay. Yeah, so if you assume expertise, you're more or less likely to win. And also when you give examples about what your argument is, if you cite sources, of course. And if you use, if you do some word work. So if you suggest some spellings, corrections, if you propose some word choice, this more or less makes you more able or more likely to win in the talk page. You're most likely to lose if you talk about policies. Nobody will listen to you. It's kind of like you're the passive person who talks about stuff they don't really want to hear about. Or if you're moderating the talk. Like people, maybe we should not do this like this. Maybe we should take into consideration the new tool point of view and so on. People would not actually pay any attention to you. And also if you talk about page formatting. You know what, this page should be merged with another page. This page should be reformatted and rewritten in such and such and such. And also, all of this means like you're not going to be heard that much. And that's about it. Now we're moving to another card with Tillman. Thanks. So the theme of this vikimina is knowledge gaps. And not coincidentally, there has been a lot of research in this area in this past year. We're focusing on one paper that compares gaps across multiple languages of Wikipedia. I should say there are several other talks here at vikimina. Fox on the bus is a keynote yesterday where Martin did just about geographical differences. There's the cultural diversity observatory which I really encourage you to check out. It's actually a tool that's financed by the Grant Barwini Foundation where you can look at such gaps directly. And this morning, my colleagues from the research team at the foundation presented various work they're doing. Also new tools for bitching. That's called knowledge gaps. So the paper we're looking at here, it's by three years from Poland, University of Poznan, and they compared quality and popularity of articles across several different vikipedias. Is that a laptop or is it tea? And so one challenge they had is it's across 44 different languages which is a really large number of vikipedias and they had to construct a quality metric. How could you compare quality of articles across these many languages? It's been done many times in individual vikipedias where you have the, Fox on the issue here, you have this featured article rating, a good article, et cetera. And what they did, they used some pretty simple numbers as indicators. So the article length, the number of references, the number of images in the article, the number of sections, the ratio of references to article length, basically how the citation density, and then they subtracted the more quality floor templates like MPOV or unsourced templates, the article height, and they also compared popularity based on pages. I should say there's more sophisticated, that's an active area of research for last few years, there's more sophisticated measures like the foundations on ORIES tool which is based on machine learning, but it wasn't yet available in that many languages, so they resorted to the simplistic. And here you can see how they justify this. So the different colors are the different quality ratings on the HVB by editors manually. So you can see the stop is red. So obviously very short articles on the left column is article length. You see very short articles are almost always stops. If an article is longer than 250 kilobytes on the very top, then half of them have featured articles. Similarly, more references is more likely to be higher article rating. Similar for now I imaged a number of sections and the number of the citation density. Now I think that the comparison across topics which is another challenge in itself. So you need to define, again, within one Wikipedia, we can look at categories which kind of give you biographies of for example film, university. And that would do some work to compare this across all these different languages. They use info boxes and interweakings. And they contract a really interesting online tool which you can actually go right and try out. You can see the overlap of articles between different Wikipedia. So this is examples of article universities in the English and German and French Wikipedia. You can see that not surprisingly English has the most. And for example this little light blue patch is the articles that are only covered in the German Wikipedia. So that's a pretty direct visualization of knowledge gaps. And again you can use it as a live tool on that URL. No, but what results did I find? So basically they have a very complicated table of quality and popularity across 12 topics and 48 different languages. I'm just picking and choosing here a little bit from the German Wikipedia coincidentally myself. So it turns out that German Wikipedia articles about albums and video games have the highest average across all 44 languages. And I should caution that doesn't mean that the German Wikipedia have the best experts on albums, video games. It's much more likely that the coverage on the German Wikipedia is much more limited because of the stricter notability criteria. You can actually verify this Venn diagram tool that the German Wikipedia has much less articles about albums and video games than English, French or Spanish for example. The authors wisely refrain from comparing Wikipedia quality overall. That's, again, it's only for certain topic areas, et cetera. I myself was not as prudent, I was just curious. So I actually took their numbers. It's fully open access. The payover is great. I publish all the data. And so I took the average of all the averages. And so just between us here, right? Don't tell Emma, but the German Wikipedia actually comes out on top of all these 44. But very closely followed by the English and the Greek Wikipedia, the Hindi Wikipedia and the Chinese Wikipedia. So make of that what you will. And lastly, the whole question is also about her quality, where it's a popularity. There is some early research where it finds a bit of misalignment that sometimes is a very popular article that will probably on readers don't get a lot of attention by editors. So there's this misalignment. They find it largely correlates, although more strongly for some topics, for example, companies. And mostly it was settlements that includes all cities, et cetera. So apparently there's a lot of city articles that get a lot of pages, but few editor attention. That's another gap, if you will. Cool. So we have the next topic. This is Masli about non-participation. Hello. An important theme this year related to knowledge equity is why do internet users from different social groups contribute differently to Wikipedia? So this paper explored the factors and processes that influences these participation gaps. And analyzing survey data collected from 1,500 adults in the U.S. in 2016, the artist used logistic regression to model the activity of online knowledge production as a step-by-step process. A process that internet users who contribute to Wikipedia must go through. So they conceptualized a pipeline that anticipates leaks at the different stages of the knowledge production process so that fewer contributors remain at each step, at each subsequent step, beginning from a cohort of internet users. So most work on the participation gap has focused on the final stage about whether or not people actually contribute to Wikipedia. The artist of this paper showed that there are gaps at many earlier stages, such as whether or not people know that Wikipedia is editable and whether they have been on the site or whether they even know that it exists to begin with. So the results showed that at all stages of the pipeline that I just showed you in the previous slide, education levels, internet literacy skills and age significantly influence levels of activity at each step of the pipeline. So with this information, I just recommend that they recommend, you know, support to interventions that reduce technical and knowledge-based entry barriers as a means to increase participation at all the levels of knowledge production. At the early stages of the pipeline, income, employment and raise are significant factors that influence levels of activity in that stage of knowledge production. So the recommendation here from the artist is that they suggest the need for interventions addressing early participation gaps in minorities and lower income classes by reducing internet experience and autonomy obstacles. At the latter stage, gender played a crucial role to determine that, you know, compared with males, fewer people who identify as female know that Wikipedia is editable and actually go beyond that that awareness to contributes to Wikipedia. So the results therefore suggest two things. First, I need to create awareness among females, particularly, that Wikipedia is a crowd-sourced project, you know, crowd-sourced project that anyone can edit. And secondly, they need to continue support for gender gap campaigns and initiatives that seek to recruit more females to become contributors. All right. And so, me again, and I'll sort of finish this out, the one topic that we've covered either every year or almost every year because there's just always loads and loads of these things are papers that are not really about Wikipedia per se, but they use Wikipedia as a data source to study other things. Once again, we saw a big crop of these. It was led by a team, it was from a team led by Mohamed Mehdi at Concordia University which used a set of papers that used Wikipedia as a source of data as, wait for it, a source of data. This was a paper that was a systematic review of work, meaning that it didn't present new work, it actually presented a summary of all of these other papers. They actually looked at 132 other papers which used Wikipedia as a data source but in addition to summarizing the papers, they broke them down systematically along a whole bunch of different dimensions. They found, for example, that most of the papers that use Wikipedia as a data source are about information retrieval which is to say that they're helping, they're focused around helping people get good answers to questions but there's also been a huge amount of work in natural language processing which will use Wikipedia as a means to understand language for example by things related to computational linguistics or trying to understand things about the way in which people structure language. They have 10 tables that break this down into a bunch of different dimensions. This is just highlighting one that shows which language Wikipedia, these different studies that have used Wikipedia as a data source have used and what you find is that although there were of these 132 papers, they looked at multiple languages for example to help learn things that might help with machine translation but the large majority looked only at English Wikipedia so lots of opportunities there. This paper also described data sets that were created, tools that have been based on Wikipedia data, tools that have been created to study Wikipedia data and provided their own data set as well. This is a point here. This is obviously a very short presentation that can really just give you a taste of what we see as some big themes from the year. There's tons and tons of work being done and the good news is that you no longer have to wait to Wikipedia to hear about this. You can every month hear about this in the research newsletter and the good news of which everyone who's presented here and lots of people who aren't have helped contribute. That's a really good resource to stay up to date with this stuff. There are now conferences about Wikipedia research. There's OpenSim, formerly WikiSim which will be happening in Paris next month. There's the Wiki Workshop at the Web Conference which is useful. There's the Wikimedia Research Showcase where I don't think any of these papers were presented there but roughly monthly there'll be presentations of work by people. A fantastic thing that you can pay attention to from each one of these things are links to places where you can learn more about this and all of our notes and the papers and links to the papers are there as well. So thanks again. This has been fun as always and look forward to seeing you next year.