 Okay, Martin, good luck. Thank you. And we're live. All right, hello everybody out there in Wiki Research Land. Welcome to the November 2019 Wikimedia Research Showcase. I'm Jonathan Morgan, I'll be your host for today. We have two great presenters today. We're going to have a junior professor, Martin Pahast from Leipzig University, presenting on reuse of Wikimedia content followed by Wikimedia research team, research scientist, Isaac Johnson, presenting on a demographic survey of Wikipedia readers. Before we start, two quick announcements. First, we have a, we now have a location dates and website for the 2020 Wiki Workshop. Layla, do you actually want to say anything about this? Sure, I can say a couple of words. So the research, I'm sorry, the Wiki Workshop 2020 will take place as part of the web conference 2020 in Taipei, Taiwan. The date is going to be one of the April 20 or 21. The workshop chairs are working on finalizing the date. The website is up, and you can get more information from there, wikiworkshop.org slash 2020. Yeah, we hope to receive your submissions and see you in the workshop. Thank you. Second announcement is that we've been circulating a feedback survey. We're thinking of what kind of changes we can make in 2020 to improve the research showcase. So if you have seen on any of the mailing lists, a link to participate in the survey, please do. If you are, see the updates, the update emails from Janet Layton on any of the mailing lists that say the showcase is starting in 30 minutes. If you look at the one from today, there's a link to the feedback survey there. I just posted the link to the feedback survey in IRC. There's also currently a link to the feedback survey on top of the research showcase page on MediaWiki. And just for the heck of it, I'm gonna see if YouTube will allow me to post a link to the feedback survey in the comments section of the YouTube screen. The long and short of it is that we'd really love your feedback on the research showcase so that we can think about ways to improve it. And I believe, oh, and the wikiworkshop website is wikiworkshop.org slash 2020. That's wikiworkshop.org slash 2020. So without further ado, I'd like to kick it over to Martin Pautast. Thank you, Martin, take it away. Thank you. So I'm gonna directly start the presentation and I will share the window just a sec. So you should all be seeing the window, the presentation window now? Yep, we see the full browser window including tabs. Yeah, that's all right. That's all right. It'll stay that way. I won't go into full screen. All right, then I will start. So this presentation is about text reuse, the reuse of text, which has already been there by in other situations. And in particular, how text reuse applies to Wikipedia. Let me start by saying that this is not just my work, it is actually joint work with many other people. Some of them are listed here. They are the people who are co-authors on the paper that we submitted to ECIR with the title Wikipedia Text Reuse Within and Without. And so we are all working together as a group across four universities. And the group is called Webis. And everything which you see here, you can also find on Webis.de. So let's jump right in. First some terminology, what is text reuse? It is basically an umbrella term that groups a number of other more well-known concepts. So everyone will have heard of quotations, boilerplate or synonym might also be template text, translations and summarization. Something that this, which may look a little bit odd in this taxonomy is the translation further divides into metaphrase and paraphrase. This is derived from what the ancient Greeks distinguished a metaphrase as a literal translation or word for word translation, i.e. without changing the word order. And paraphrase is basically reproduction in own words. Today we also know paraphrases as something that can be done within a language too, hence the double connection in the taxonomy. There's another thing that may seem a little bit odd here, namely how plagiarism is entered there. Plagiarism has actually been subject to my research for more than a decade now. But we now would like to more call our research on text reuse because plagiarism is mostly something that is to be judged text reuse in terms of whether the text reuse is legitimate or not. For example, it could be that in an academic situation, text is reused without acknowledgement and that would render plagiarism. Another aspect which does not necessarily have to be coincide with plagiarism is copyright infringement. So just so that everyone knows what text reuse is really all about and what are the different kinds of text reuse that we might study here is an overview. I would like to start by showing you an example of how Wikipedia text can or is being reused today. On the left of this slide, you see the Wikipedia article on broadcast flag. A broadcast flag is a digital rice management tool. And on the right-hand part of the slide, you see a webpage called fandom.com and they basically also talk about the broadcast flag. As you can see from reading the opening lines of both sides, you see that the text is very, very similar. And in fact, I suspect that this text has been taken from an earlier version of the same Wikipedia article. It has not been updated since, but it is definitely, if this was actually the case, I didn't double check which earlier version of the Wikipedia article is, but the similarities are so obvious that it is very likely the case. So is this legitimate text reuse or not? So the Wikipedia license is very open and lenient. It allows all kinds and forms of reuse without payment, but there's one small thing that people should do, namely refer back to the Wikipedia and to the original source. And as far as I can tell on this webpage, this has not been the case. There's neither openly nor in the source code any mention of the term Wikipedia even, let alone a reference. So to summarize this example, and I think it is well known to most who searched the web a lot and to both find Wikipedia articles, as well as other web pages that very often one finds commercial web pages, which also show advertisements, which simply reuse Wikipedia text. And that this reuse often, even sometimes lacks attribution. And that this reuse is not updated along the original. However, there has been to the best of my knowledge, no real study that tried to measure the extent to which Wikipedia text is reused on the web. And this is basically one of the contributions of the research that we have done. Another example of Wikipedia text reuse is text reuse within Wikipedia. Here we see excerpts of two Wikipedia articles, which are very closely related on the left, the article Albania, and on the right the article on the people of Albania, the Albanians. And the editors of these articles thought it right to have a section on the occupation of the Ottoman, occupation of Albania of the Ottoman Empire. And as you can see the opening lines of these short passages are a little bit different. However, if you read on the left, for example, with the fall of Constantinople, the Ottoman Empire continued and extended period of conquest and expansion within its borders going deep into Southeast Europe. You will find that halfway down on the right side a very similar text appears. However, after the fall of Constantinople, the Ottoman Empire continued and extended periods of conquest and expansion with its borders going deep into the Southeast Europe. This is obviously also form of text reuse, although I cannot tell which version was the original, whether actually someone copied a text from one article to the other, or whether there's even a common ancestor in another article. And if you read further, the second sentence on the left is missing in the version on the right, whereas the third sentence appears in a different form on the right again, even mentioning different facts. And the question now is, is this a useful form of reuse? Is this wanted on Wikipedia articles to have different versions of texts or paragraphs, basically saying the same thing, but not being quite unified? Especially if an editor decides to copy a piece of text from one article to another, there's no technical support, as far as I know, in the media Wiki software until now. All of this is happening manually, and if someone reuses text, this reuse is also not tracked. So this means the two articles then may develop independently, get edited independently, especially the reused passages are changed independently, and this then may lead to inconsistency. Again, as far as I know, with one exception of Jimmy Lin's research, there has been no study as of yet on Wikipedia text reuse and the extent of reuse. We want to fill this gap with our research. So what did we do? We developed a text reuse detection pipeline, which was supposed to identify and extract all passages of text which have been reused and may be changed after the reuse, both within Wikipedia and when Wikipedia is compared to the web. For this, in the first step, we collected data sets, namely the Wikipedia itself, especially the English Wikipedia. So we focused on the English Wikipedia on a dump of May 2016, comprising 4.2 million articles and a total of 11.4 million paragraphs. And we downloaded a version of the Common Crawl released in April, 2017, sampled randomly 10% of it, and this sample contains 1.4 million websites, which in turn contained 591 million web pages and 187 million paragraphs of text. So the goal now was to compare all 11.4 million paragraphs of the Wikipedia to all other 11.4 million paragraphs of the Wikipedia and in the second processing step, compare all 11.4 million paragraphs of Wikipedia to all other Common Crawl paragraphs. So you can see this is quite a big job. And for that, we needed cluster computing technology, which we operate at Vibis. So here you see the four clusters that are currently in operation and we use the beta web cluster, which has 135 nodes, standard Dell rack nodes, currently 4.1 petabytes of disk space, 1,700 cores and 28 terabytes of ROM. On that cluster computer runs a Hadoop setup and an Apache Spark setup. And this is what we basically used for cluster computing. And the texture use detection pipeline, which was built on top of this cluster computing framework consisted of two steps, source retrieval and text alignment. Source retrieval is basically a pre-retrieval step, which allows us to rank for a given document, all other documents in a large collection with respect to their likelihood of maybe containing a passage of text that may have been reused in the given document. So we want ideally on the top ranks all documents that actually contain at least one sentence of reuse text, but maybe even longer chunks of reuse text. And in order to juristically do such a ranking, we came up with a ranking function. Sorry, this is now in the formula. If you, you can ask questions later, you can ask questions now. I can also skip this, no problem, but I will very quickly explain. So we set up a ranking formula row, as you can see here, which compares basically two documents, D1 and D2. And the comparison is basically similarity computation here marked as Barfi. And FI is a TFIDF rated cosine similarity vector space model representation. The representation is of two chunks of text extracted from the articles D1 and D2. And we use as likelihood of containing texture use the maximum similarity between any two chunks of text in both documents. Since it is intractable for us on our cluster computer to make a complete exhaustive pair rise comparison of all documents, pairs of documents within Wikipedia or pairs of documents between Wikipedia and the Common Call, we applied prior to that a step of search pruning, which was based on a so-called locality sensitive hash function, which produces a fingerprint of all chunks and the chunks were our paragraphs. And the fingerprint consists of a set of hash values which are built in a way so that the more similar text is, the more likely they will share at least one hash value. So if two paragraphs of text here denoted as C, contain at least one matching hash value as per the locality sensitive hash function, then it was then the two documents which contained the chunk were subject to the more in-depth comparison using the similarity function. And this way we pruned to the search space by two orders of magnitude. Still, this took quite a long time, like one month of processing time on the cluster. And the second month of processing time on the cluster was invested into text alignment. So all the documents are ranked now and we are simply going through the ranks for every given documents, basically building pairs of documents. And for every pair of document that is supposed to be compared, we want to then extract the pieces of text that may have been reused. Recall that the texts may differ from one another. They are not verbatim copies, but they may significantly have changed over time. So we want to still have the entire passages and not just small bits and pieces. And this is basically what happens in the text alignment step. We apply not the aforementioned technology pieces of technology like hashing or locality sensitive hashing, but a dot plotting based piece of technology that I'm going to explain visually now. So we basically can visualize this as a matrix where each axis represents one of the texts or more specifically word three grams of the text, which are simply three consecutive words found in the document. And I'm going to zoom in here so that I can show you. This is basically the first step that we see in our processing of text alignment, namely that the two texts apparently share some piece of text and every dot that you see here now corresponds basically to one word three gram that is shared between the two documents. This is basically identified using again a hash function, but not the locality sensitive hash function, standard hash function. This already produces kind of a line, a diagonal line, which already indicates that some alignment of the n-grams should have taken place here. However, these are just small indications. And if you try to just group them into consecutive pieces of reuse texts that are longer than three words, you get something like this. So some pieces fit together very well, other pieces don't fit together because they have been breaks. As you can see also, there are some dots in between, which have been rearranged completely. So this is not sufficient to actually extract the entire reuse passage. On these blue rectangles, we apply a clustering, DB scan to be precise, DB scan algorithm. And in order to cluster them together and to then re-identify the original reuse passage, but after it has been changed. So this is basically what happens in text alignment. The two texts are compared with a bit more effort than in source retrieval. But the result then is a much better quality in terms of what we learn about what kinds of text reuse may be in the text. And now we have a lot of data, a lot of cases of text reuse. And what we found is that within Wikipedia here on the right in the table, within Wikipedia, we found 110 million of text reuse cases, which is quite a high number. Outside Wikipedia, we found only about 1.6 million such cases. The articles involved here or the number of different pages involved are also listed and some statistics on the length. I will not go into detail here. Having such a huge amount of data presents you with a new problem. You need tools to explore this data in order to understand it. And this is basically our current and ongoing work. So the first outcome of this research was to build the Webis Wikipedia text reuse corpus, which you can download and find here on this webpage. And I will now present you a pilot analysis that we did in order to learn at least something about this data and perhaps one or two tools that we started developing. So how is Wikipedia reused on the web? We found about 5,000 websites out of 1.4 million websites which contain at least one page reusing text from Wikipedia. The top three reusing sites are domains are wikia.com with 560 free pages, rediff.com with 55 pages and un.org with 28 pages. As far as we can tell, 94% of all the reusing pages violate Wikipedia's copyrights because on that page, the term Wikipedia does not occur in any shape, way or form. Nearly all pages reuse text exclusively. That means there's no additional text on the page other than maybe some navigation or something. And that means it is a completely redundant page compared to Wikipedia. And from that, immediately questions arise. So why should there be another webpage having the same content as the Wikipedia page? Could this maybe negatively affect Wikipedia's ranking of that page? Maybe, maybe not. Definitely these pages will be quickly outdated. As I said, they will likely not be updated alongside Wikipedia, but there's also maybe a positive aspect of people reusing Wikipedia instead of, for example, generating text these days using GPT2. So better reuse Wikipedia than generate something. On all of these web pages, we found display pets on nearly all of them, except on the UN.org webpage, for example. And we tried based on this observation to very conservatively, in order not to over claim something, very conservatively estimate what might be the monthly ad revenue that these pages generate. We made a simplifying assumption. Every page contains just one display ad, which is definitely not the case. The average is higher, but we did not have any technology to actually tell us what links are actually from advertisers. We estimated the revenue per 1,000 views to be 1.4 US dollars based on some source from the internet, which is also cited in the paper. And we estimated that these web pages get monthly views of 10% the amount of the page views that Wikipedia gets. And from that, we derived 45,000 US dollars of monthly ad revenue. If we now try to project this to the entire web, which has, as per netcraft, at the time of writing 180 million sites, which would translate to 600,000 reusing sites, this would then yield more or less 5.5 million US dollars monthly ad revenue or 72% of Wikipedia's annual fundraising, at least when compared to the fiscal year 2016 and 70. So this is a lot of money that people earn based on Wikipedia content. And one might be outraged about this or one might question whether this is okay or not. But as I said, there may also be positive aspects because people just don't make stuff up at random but use probably higher quality content from Wikipedia. So this is something to be discussed at any rate. Wikipedia apparently has a high influence on the web and what other people do on the web or what other people put online. Perhaps more interesting to the Wikipedia community at large is Wikipedia text reuse within Wikipedia. We find two different kinds of reuse that is structure reuse and the second one is content reuse. Structure reuse pertains to articles which are most often in the same category. For example, articles about cities, mostly small cities. So Berlin Leipzig may not be a good example because the articles are very long but if you look at articles about very small cities, they are usually very, very similar. Templates have emerged in the time that these cities have been described. People reuse these templates and try to describe the articles in a very homogeneous way. In fact, the vast majority of the reuse cases, according to our estimation, are such kinds of reuse cases. But distinguishing them from the other kind of reuse that I showed you earlier in the example, content reuse, we found it actually quite difficult. Our heuristics are not perfect. Content reuse can also be perhaps best understood by looking at relations in Wikipedia categories. So for example, the article on spiritual opportunism is opportunism and both articles share some pieces of content. Similarly, for these history articles that I shared there. So in this instance, it might be interesting to focus on the content reuse and try to identify most of them because here the inconsistencies may arise mostly. As I said, it's difficult to distinguish the two and occasionally content reuse becomes part of structure reuse because the structure reuse pertains to larger chunks of text. So I want to show you a demo. Sorry to interrupt you. We have about three minutes. Just letting you know. It's all right. This is enough, I think. So I want to quickly show you a demo and you can access this demo also here under this link. Also later, it will stay online. I changed it up and I've prepared already search results. So you can basically, it's a search engine. You can search all these 110 million within Wikipedia reuse cases and you get basically all the cases that fit the search term. And here you see, for example, the two articles, theoretical computer science and quantum computing which apparently share a certain piece of content which has been rearranged quite a bit as can be seen from the coloring. So there are differences there. I want to highlight one special difference. For example, the attribution of who invented quantum computing here. On the right, the field of quantum computing was initiated by the work of Paul Binyov and Yuri Manin in 1918, Richard Feynman in 1982 and David Deutsch in 1985. On the left, the sentence is much shorter. And this is already something that may concern people because depending on which article you read, you get different information, very different information. So this may be something that one wants to unify. I could show more examples or we can in the Q&A session look at other things that you meanwhile may have found, but I will head on now. So together with our colleagues from the VR group and the information visualization group, Patrick Riemann and Bern Fröhlich, we are also working on other tools for visual analytics in order to explore better the Wikipedia article similarities or the category graph. There's not enough time to explain this in detail, but this is a large screen visualization of Wikipedia article similarities. Every dot you see here is basically a cluster of articles which share a lot of similarities and they even form very nice views, but most of them are structure reuse and hardly any are content reuse. So we still have to do some work in order to separate the two properly. So to conclude and just take away messages, I think it's not news, but it needs to be said once more. I think text reuse is second nature to Wikipedia and I think this is a good thing. Text should be reused. Encyclopedia, the editors should reuse text in other articles as much as possible if someone else has already written about the concept. However, there should be tool support for this and the reuse passages should be tracked and kept in a unified way in order to avoid such situations that we found before. Whether or not reuse of Wikipedia outside is a good thing or a bad thing. This is, I guess, still to be discussed. Dario Taraborelli talked about the paradox of reuse where everyone profits from Wikipedia but takes away readership and visitors. Thereby diminishing the number of potential editors versus, as I just said, better reuse Wikipedia than generating text with language models. And at any rate, measuring text reuse is definitely a way to measure the influence of Wikipedia at large. Our future work will be on categorizing reuse, content versus structure reuse. Perhaps we could study the induction of article templates that could be then reused and fed with information from Wikidata. And we also want to try to further scale up the whole process to the, perhaps, the entire common qual or even bigger corpora. And of course, we will continue our work on visual analytics tools. Everything you saw today can be found, as I said, on WebSDA, paper code, data, and this demo. Just go there and you will find it. And that's it. Thank you. Wonderfully done, Martin. Thank you. And also, everybody, the link to the paper draft and the demo are on webis.de. Those are also linked from the research showcase page. So we actually have a lot of questions, but we are going to save them until after our second presenter's presentation today, we will have time to get to your questions. Miriam on IRC is recording them and saving them for later. So your question will be answered. Sorry for being too long. No, no, you presented a bunch of interesting stuff. There's nothing to apologize for. With that, I'm going to turn it over to my colleague, Isaac Johnson, who is going to be presenting some really, some really exciting and in some ways concerning research, I think, on the gender demographics of Wikipedia readers. That's the theme for today, huh? Exciting with concerning, yeah, that's science for you. Thank you, Jonathan. And thank you, Martin. That was also fantastic. So I'm going to be talking today about a project that we put under the title, Characterizing Reader Behavior in Wikipedia. You might have seen some earlier research of this project actually even presented at this showcase prior, but I'm going to be talking about our most recent round of surveys that we did that kind of focused on reader demographics on Wikipedia. And yeah, so I'll be talking about those, but this ties back to some other research. This is work that I did with the team of collaborators as well around translating and getting these surveys out. So it's a big team effort. And also a lot of help from some volunteers as to what happened with that all of that. Just an overview of what I'm going to talk about today. So I'm going to talk a little bit about the motivation narrative for this particular round of surveys. For those who aren't familiar for the tool that we're using, I'll kind of discuss how that works. And then I'm going to give results. I'm going to focus on three demographics that we looked at, age, gender, and language, but there's a lot more data and data I can talk about. If you have questions about, I'll kind of hint at what some of that other data is, but for this presentation, specifically I'm going to focus on these three. And then I'm going to try to give some of the takeaways, some of the things that I think are important that we're learning from this data. All right, motivation and narrative. So there's a lot of different motivation that we're coming into this work. And rather than try to narrow it down to one, I figured I'd mentioned kind of three big ones that led us to run these surveys. The first is some work done by Aaron Shaw and Esther Hargitay called the Pipeline of Online Participation in Qualities where they really talked about if you want to understand, for instance, the gender gap among the editor population Wikipedia, we shouldn't just be focusing on editors, we should also be focusing on all these prerequisites to becoming an editor, being an internet user, having heard of Wikipedia, having visited Wikipedia and so on. And so that work really, I think, motivates this focus on readers and not just editors. The second one was some surveys we ran in 2017 under the tagline why the world reads Wikipedia that asks readers about things like their motivation for reading Wikipedia, their particular information need, whether they're looking for an overview or an in-depth read or a specific fact. And out of that work, we found some similarities. We also found some interesting differences where certain language additions, you were seeing really different motivations or needs from the readers. And it kind of raised the question of whether those differences were perhaps related to cultural language or whether they're maybe just representative of the population of readers that were present in that language. So maybe certain languages had a much broader population of readers, why the ones had a more kind of narrow population of readers. And then the third one I would say is kind of around gaps and we talk a lot about, for instance, gender gaps and we talk about it with respect to content. So what types of biographies are present on Wikipedia? We talk a lot about it with respect to editors, the ability to attract a diverse base of editors, but we don't really talk about it much with respect to readers in part because the data hasn't existed. But so this survey in some ways hoping to kind of fill that gap so that we can also talk about it from the reader standpoint, whether we see these same gaps appear in the reader populations. And just there's a number of narratives that I can kind of tell with this work. I'm gonna give you one example of kind of the narratives that we're thinking about as far as conclusions from this work and I'm gonna do it specifically around gender. And the way this goes is, so what we're finding from the data is that men and women do have relatively similar high level behavior in needs. Like there, men and women are both coming to Wikipedia for facts. They're all both coming to Wikipedia for an overview for things like that. And I say men and women here because that's the identities that we have enough data for. And so this idea being that like if you look at a given Wikipedia reader and based upon certain, since the articles are reading you're not gonna be able to tell their gender. You can't profile readers based upon what they're reading. But we do see different interests in specific topics and I'll share a little bit of this data later. But you see that men tend to read certain types of articles or women tend to read certain other types of articles. There are these differences in the types of content that are being read. And this kind of data raises a number, think of interesting hypotheses around this relationship between readership gaps. So the types of readers, the types of gaps in the reader population we see as well as the content and contributor gaps that we know about on Wikipedia. And so that's kind of a relationship that I'm gonna be talking about. We're really only at the correlation stage. I can't make causal statements yet, but I'm hoping that this data kind of pushes us towards thinking a little bit more deeply about some of these relationships. All right, so the survey. If you're not familiar with it, this is the way it works. So if you're reading an article on Wikipedia and this could be either mobile web or desktop and not the app. So if you're reading an article and we're running these surveys and you get randomly sampled in, what would happen is this little widget would be inserted into all the articles that you are reading on that browser. And what the widget says is take a short survey, help us improve Wikipedia and you can either click visit survey or say no thanks or you can even ignore the widget and it links the privacy policy. Now if you click visit survey, it'll take you to a Google form that has a number of questions and I'll explain those in a second. But this is how we give the survey to readers and the point being that readers are getting this survey while reading a Wikipedia article. So we can ask about, why are you reading that specific Wikipedia article? And we can also link the survey responses to a little bit of reading data associated with that reader. So when they took the survey, here's the session of page views that they read. For this particular survey, it ran for a week, as I said, at the end of June this year. We ran it in 13 different language editions. So we took the survey in English and then we translated into all these languages. I bolded English and French here because for most of these languages, we just randomly sample some proportion of readers. For English and French, we did that, but we also up sampled readers from the continent of Africa in order to get more data about those reading populations specifically. In the survey, if people click through, there were three motivation questions and five demographic questions. The three motivation questions came from those 2017 surveys I mentioned. And the five demographic questions came from a variety of past work and surveys that had happened, but they focused on age, gender, education, locale, which is like, are you living in a city or in the countryside, as well as the reader's native languages. And across all these different languages, we got about 64,000 responses. Like I said, because of that up sampling, that actually included a good amount of data from Africa through the English, French and Arabic language surveys, which was a population that we hadn't had much data on before. And something I wanna clarify before I get to the results and it's something that we've struggled with to kind of figure out what we mean by this, but what do I mean by readers? So I've been saying reader surveys, for instance. There's lots of ways you could define a reader. It could be somebody who reads Wikipedia daily is pretty obviously a reader. Someone who reads Wikipedia once a week, yeah, probably a reader. Monthly readers, people who've visited Wikipedia at some point in their life, it starts to get dicey as far as, or more subjective, I guess, as far as whether those types of people are readers. And I bring this up because this definition of readers kind of affects how we interpret the results. So I'm gonna give an example from a 2010 survey that Pew Research did in the United States. And these, they were phone calling people, but they asked, first, this question, do you read Wikipedia? And when they asked that question, they found that 53% of men said yes, they read Wikipedia, or sorry, yeah, and 47% of women said yes, they read Wikipedia. So you had about gender parity among readers for Wikipedia. But if you ask the question instead, and this is what they did, did you read Wikipedia yesterday? Here you only about a third of people, a third of the people who said they read Wikipedia in general said yes to this question. And if you then took the results from that and said, okay, so what does that tell us about our reader population? There you saw was 60% men and 40% women. So by asking a question that was more about frequency, so more frequent readers were more likely to say yes. You saw the population of readers skew from relative gender parity to something closer, 60, 40, to more imbalanced population. We don't know where our population of survey respondents fall, whether they're more do you read Wikipedia or did you read Wikipedia yesterday? We suspect based upon the data, it's more like did you read Wikipedia yesterday? But this is just something I would ask you to keep in mind as you interpret these results, that you could run these surveys in different ways and you might get different data. The last thing I'm gonna say before I get to the results is that we try to correct for some of the bias and who actually takes these surveys on Wikipedia. So here's data from actually Arabic Wikipedia from the survey we ran. And if you take 1,000 random readers on Arabic Wikipedia, that's the first column on numbers and you compare them to 1,000 of the people who took the survey on Arabic Wikipedia, you see some notable differences. So readers who are logged in were 10 times as likely to take the survey. Readers who only read a single page and then left Wikipedia much less likely to take the survey. Readers who read two to four pages, slightly more likely to read Wikipedia and so on. And so why this matters is if you look at the, from the survey, the percentage of people in each of these different categories who also identified as women, you see some differences. Specifically, people who are more likely to take the survey, that population tended to be a little more skewed towards men and people who are less likely to take the survey, that population was a little more balanced with respect to men and women. So from the survey results, if we take a raw average, we get 24.2% of Wikipedia readers on Arabic Wikipedia are women, but if we correct for these, so we try to make the survey population look more like the random population, the representative population of Arabic Wikipedia readers, we see that number jump up a little bit. In this case, 27.1% and those are the results I'll be presenting today. In most cases, it only moved by a few percentage points so it's not gonna dramatically change the conclusions we draw, but it is a kind of post hoc correction that we make. With that, I'm gonna talk about a couple of the results. Like I said, I'm gonna skip motivation, information, need prior familiarity. Though, if you have questions about those, feel free to ask. I'm gonna focus on age, gender, and the language. I'm gonna skip education and locale as well. But again, feel free to ask and I've also included a Wiki link for the meta-page that has the full set of results so you can check there. All right, so starting with gender. The question we asked on the survey was, what is your gender? People could identify as women, men, say prefer not to say, or there was an open text response where they could put in input any response. And for that, we coded as self-identification and I did some cleaning up there too. What I'm gonna show you are a series of bar graphs giving you a smaller version right now so I can explain what it's showing you. You'll see in these bar graphs, each language, each survey essentially will have a set of results associated with it. The Y-axis is the proportion of respondents who identified in a particular category. So in this case, you see that close to 70% of respondents from Arabic Wikipedia identified as men, somewhere around 30%. I think from the past slide, 27% of respondents identified as women in a very small proportion of respondents identified or self-described their gender. And you're gonna see these for each language, like I said, and then the little black bars are 99% confidence intervals. And so here is for the entire set of languages that we ran the survey in. A few things I wanna call out about these results. So the majority of respondents and almost every language identified as men, the blue bars, the exception is Romanian towards the right of the screen whereas much closer to gender parity. Some of the things I'll say that aren't necessarily depicted on this bar graph, but for many languages, younger readers are more balanced in gender, which is to say that the youth population of readers on Wikipedia is looking a lot more balanced. And so that kind of maybe is a silver lining to these numbers. I'll also say there's a lot of variants across countries as well. And finally, a small proportion of respondents in many of the languages did choose to self-describe. The next set of bar graphs I'm gonna show you are looking at the relationship between the gender of readers, so people identified as men and women, in this case where we had enough data. And then we correlated, like I said, with their reading behavior. And so we're able to look at the gender of the biography articles that they are reading. And so in this example for Arabic, that blue bar on the left, that's the proportion of page views where men read about men. So in that case, it was something like 27% of page views from men were two biographies of men. The orange bar is women reading about men, so you can see that men read more about men than women read about men. And then the green and the red bars were men reading about women and women reading about women. So you see from this that men are more likely to read biographies of men than women are, but women are more likely to read biographies of women than men are. And we look across a lot of the languages and this data has more variance to it, but we do see some pretty consistent patterns that in general, men read relatively more biographies of men than women do, and women read relatively more biographies of women than men do. And we see this across a variety of different demographics, this kind of self-focused reading content that relates to the reader's demographics. So for age, for instance, we see that younger readers tend to read biographies of younger people and so on. And the third thing I wanna talk about is topic versus gender. So out of the 15 surveys were 15s, 13 languages and then there was the upsampling of Africa in both English and French. We looked at the types of topics that readers are reading and correlated this with gender. And from this we're able to say what sorts of topics seem to attract relatively more men or relatively more women. So for the men readers, what we see is for instance, sports, so 12 of the 15 surveys we saw a skew towards sports articles being more likely to be read by men and these are things like the 2019 Africa Cup of Nations or Copa America technology article. So articles about YouTube and what's happened here. I'm giving the most red versions of articles in this topic space and military warfare. So articles about like the Cold War, American Civil War, those tend to attract more men readers than women readers. For women, we see that medicine articles, nine out of the 15 surveys that we ran, we saw a significant trend towards women be more likely to read these articles, schizophrenia, Asperger's syndrome. Same with broadcasting, which tends to be mostly like TV series and things like that, as well as biology. So these are articles related to biology. But there's also a good number of topics that have a much more balanced readership. So philosophy and religion out of the 15 surveys we saw no significant trends towards men or women. Same with many of the geography articles, in this case, Geography of America's. You see just as many men readers as women readers, relatively speaking, same with biographies, where 12 of the 15, and for the ones that weren't, it kind of went, I think, two more likely to have women reading them two surveys and one survey men were more likely to be reading them. Yeah, so yeah, so we do begin to see some of these topic differences and I'll kind of get back to this in the end. All right, moving on to age. So we asked, what is your age? We gave various time slots or age slots. You'll see that under 18 is not represented here. Fortunately, because we were asking about age and we couldn't get parental consent, or it's not easy to get parental consent via online survey like this, we could not allow people under the age of 18 to take the survey. So actually I should have said this prior, but all the data you're seeing is for readers above the age of 18. And again, you're gonna see this set of bar graphs. In this case, the bar on the left would be readers. So we were able to say how many people dropped out of the survey. So in this case, you'll be able to see about 20% of readers Arabic Wikipedia are under the age of 18, about 40% were age 18 and 24 and so on. And again, here is for all of the surveys that we did. The trends we see, so the majority of respondents in almost every language were less than 25 years of age. We're very youthful. Yeah, very youthful reading population. There are exceptions to this general trend. So in both German and Norwegian, we saw many more readers than the other languages in the older age categories. And you'll notice in Spanish and Hebrew, there's a large proportion of readers under the age of 18. This is, I think, a good point to call out. Some of these differences are certainly very real. In the case of both Spanish and Hebrew though, Spanish, many of the Spanish speaking countries are in the Southern Hemisphere. So school is still in session. Hebrew, I believe it was right around the end of the school year in Israel when the survey was put out. And so there you're seeing, yeah, it makes sense in many ways that you'd see many more readers under the age of 18. And finally, what is your native language? So here we would ask readers to identify their native language and we also gave them a chance to add as many native languages that they considered to be native as well. And for the second option, I would hand code these. And then we're able to, well, here, I'll give an example. So for Arabic, we have three categories that we coded the responses to. So if the reader took the Arabic survey and they said that they spoke Arabic, then we consider them to be a native speaker. If they only said they speak Arabic as a native language, then they're fit into the blue column, so monolingual native speaker. If they said, I speak Arabic, I also speak English maybe as a native language, then we'd put them as multilingual native speaker. And if they listed languages, none of which were Arabic, then we would say they're a non-native speaker on that language addition. And when you look across all the surveys again, I think two distinct trends kind of come out. So Wikipedia's like Hungarian and Norwegian have readers who are predominantly native speakers. It varies whether they're monolingual or multilingual. But for instance, I think Hungarian, it's something around 98% of readers identify as native Hungarian speakers. However, Wikipedia's like English and French are associated with much higher proportions of non-native readers. So English Wikipedia, almost half of the readers did not list English as one of their native languages, which for some this will not come as a surprise, but for others, I think it's very important to point out that many of the readers are not native speakers. And so it's something to keep in mind when thinking about editing and thinking about the population of people who are reading these articles. Finally, I'll kind of wrap up. I'm not gonna speak much about this, but I do wanna highlight there have been a number of past efforts that tried to survey readers. I included a couple of links and I'm happy to talk about any of those as well. But yeah, takeaways. So I think there's four big takeaways for me from this. The first is around reader diversity. So we see that these Wikipedia communities differ greatly, I think most notably in whether their readers are monolingual, multilingual, or native and secondary speakers of the language. But it's good to remember that we do have this huge diversity of readers who are visiting Wikipedia. Second one is around this pipeline online participation idea. So in many regions, we're seeing these large gaps in the reader population, maybe not as large as the editor population, but certainly large with respect to gender and age. And I didn't talk about a bit of education. We have much more highly educated readers than would be expected from a given country. So I just wanna highlight that we cannot address our gaps in our editor population without also addressing the gaps in our reader population. Now the next one is readers versus page views. So getting back to my earlier point about how you ask these questions and what we mean by readers. So while men and women may be equally aware of Wikipedia in many countries, which I think is what you get from a lot of the outside surveys that have been done, this survey indicates that metrics like page views likely represent a more skewed population. So page views, if you're a very frequent reader on Wikipedia, you contribute a lot to page views. If you're not a frequent reader, you don't contribute much to page views. And from this data, we're seeing that frequent readers do tend to skew more towards men, more towards the more youthful readers and more towards more highly educated readers as well. And finally, representation matters. So readers tend to read articles about people of similar demographics. I talked or showed you data for gender, but in this case, age as well, to them suggesting a relationship between content gaps, reader gaps, and this is work that we really hope to kind of be able to follow up on as we move forward and kind of better understand this relationship. And with that, excuse me, I'll say thank you, take questions, point out that again, if you didn't see a graph or you wanna see more of these graphs, that's a quick link that will take you to the MetaPage and thank you again to all of my collaborators. Thank you very much, Isaac. So I believe we have up to about 15 minutes that we can spend on Q&A today. And I wanna start with questions for Martin's talk first because he's been waiting longest. And I know we have at least several in the queue. So I'll hand it off to Miriam. Yeah, so both presentations generated a lot of discussion and on IRC for Martin. We have one question from YouTube about evaluation. So the question is, are these two methods for source retrieval and text alignment best according to performance measure or are there other methods as well depending on the task? Are there other methods as well depending on the task? Martin. So I answer the second one first. There are definitely a lot of different methods. As I said, we studied text alignment and source retrieval also in the form of research competitions or shared tasks for years ahead. And we tried to extract the best state of the art technology out of these competitions and apply it. However, the demands to scale this technology and to get to results in a reasonable amount of time also meant that we have to go a little bit back. In the paper, you will find a table, table one in the middle, the different source retrieval alternatives that we tried and evaluated in terms of precision and recall. We aimed in source retrieval for a very high recall so that we do not lose many cases but neglected precision mostly. Whereas for text alignment, we applied a technology that we developed for our plagiarism detection tool also and which work there quite reasonably well. So I would say it is state of the art perhaps a little bit sub par in terms of performance because of the scale demands. Great, thank you. Second question for you comes from ISE. It's a more general question about copyright violation. And so the question is, I wonder what to make of the huge number of copyright violations in the reused Wikipedia pages. How can we understand that? Is it too complicated to use CC licenses properly? Should we adapt our license accordingly or should we rather try to make people better comply with CC? It's a good question. I'm actually not sure. First of all, I'm not an expert on these matters. This is important to say first. But my thinking is that hunting down these copyright violators is not the right way to go because in the end it is good that Wikipedia is used. Perhaps making them aware of the fact could be something, but I think the possible returns one gets from investing the work to make these people or these web authors comply is probably not high enough, really. So the second thought that I had that I actually developed, it's not in the paper that I developed only today while preparing the presentation, given the fact that now faked content and generated text thanks to GPT2 and other such efforts will appear much more frequently on the web. In the very near future, perhaps reusing Wikipedia should be made easier regardless of who does it and why. Wow, that's a really interesting insight. Thank you for sharing that. I wanted to jump in with a quick follow-up question, just more of a clarifying question than anything else. Did I see that the United Nations is among the top re-users without attribution of Wikipedia content? No, I think they attribute it, but I would have to double check. But even it could be that they only reuse small passages. I think they are not among the ones who reuse larger parts exclusively on one page. So I think the United Nations was one of the few who were doing it right. Well, I'm glad to hear the United Nations is doing it right. It seems like that's a good thing for everybody. Yes, definitely. Yeah, and third question for you is about extending to other language. So there is a question from Morten, I believe. In our 2012 paper in search for the UR Wikipedia, we made a first attempt at looking at the extent of translations between languages based on templates, meaning a user explicitly marked an article as a translation of one in another language. Based on the results, we think this misses a lot of translations. What are your thoughts about whether your approach for identifying test reuse can work across languages to identify translations? So I've worked on cross-language similarity models. They don't scale that well compared to the monolingual ones. So additional work would be necessary. Perhaps the tools that have been developed in the deep learning community on language-independent representations will help in this respect. So I think there's a distinct chance of at least getting the source retrieval step to work in a language-independent way. But text alignment across languages will be a different matter because at some point, one would have to start to actually compare whether a certain shorter piece of text or sentence actually belongs to another piece of text in another language. And I believe there is research on trying to do this cross-language, for example, in cross-language parallel corpus construction. However, we do have to deal with text which may have been heavily edited, and this complicates things a lot. So perhaps source retrieval, yes, but text alignment not quite soon. Great. And I think, Lassie, we have some live remarks or questions from Leila over voice. Thank you very much. Martin, thanks for your presentation. A few points that I thought it's worth sharing as well as your presentation. One, that you may want to be able to integrate with me. I think the room in the office is unmuted. Now, unmuted. Great, thanks. One is that you may want to know that we have gently started spending more time and resources on the problem of reuse. So if you're interested, please talk with Isaac Johnson about it. As part of the Medium Term Plan for Wikimedia Foundation, we have kind of an organization-wide target for being able to at least measure reuse and get a better sense of it. So just be aware of that that we're starting and we can talk more about that with you. One of the things that is related to the problem of reuse that maybe it's good for you and others to be aware of is that in the long term, we are interested to be able to answer the question of what is the economic value of Wikipedia? And this, you started touching on that by looking at some of the advertising ad aspects of the websites that are showing content. Generally, being able to more accurately answer the question of economic value of Wikipedia can have a lot of positive implications for the negotiations that the Foundation does or the community does for partnering or changing policies. So just be aware that we're interested in that problem. May I comment now? Please. Because I have not told this in the talk, but we are actually thinking about the very same thing about actually trying to measure what is Wikipedia worth if we had to build it from scratch. So how much would we have to pay? And this is also in light of the recent EU laws on Germany's Leistungsschutzrecht. Ancillary copyright is the English term, I think, which has been passed by the EU now and which allows at least news publishers to collect royalties on even smallest reuse snippets of text. And this has been in place in Germany for a couple of years now giving Google a headache. And it's been even causing Google to shut down Google News entirely. So there's a lot going on around this. And we're actually trying to develop technology in this direction. So we could collaborate on that, too. Yeah, so let's touch base maybe after this talk. And then I'll quickly say a couple of more points I have and then we should switch to Isaac. Jonathan, is that mine? Or? Quickly, I think, yeah. One minute. So there are community bots that may take advantage of the technology that you're building. Martin, like Dettrow mentioned on IRC, Aaron Bot or a copy by O. So we will send you some links that you may want to follow up with the community members. Also, we have a white paper on knowledge integrity where we talk about the reuse of biased content and the propagation of bias and misinformation that you may want to kind of specifically focus as part of your research. And the last point I would say is about this question of whether it is worth doing the attempt for bringing people to Wikipedia or not. As you mentioned, this is kind of we have kind of mixed signals around this, right? On the one hand, we want the Wikipedia content to be used across the globe, whether on Wikipedia or not. On the other hand, we do believe that people coming to Wikipedia as a platform can help them have a more neutral view on the content that they're reading. So while they can read the snippets of content outside of Wikipedia, the neutrality is by the summation of all content on Wikipedia. And this also, the platform also provides them with an opportunity for a certain deputy and increasing their curiosity. So we are generally interested for having a platform where people come to, even if the content is being used in other platforms. Thanks. Thanks for the comment. So there are many points of connection that I would be happy to follow up on all of them. Yeah, I suspect that we'll have opportunities to talk more as a team, Martin. That's wonderful. Miriam, back to you. Questions for Isaac. Yeah, Isaac, your presentation generated a lot of joy and sadness on IOC. So you can see the feedbacks. There are comments also from Seddon about having similar insights from fundraising data. So maybe you should coordinate with them. There is a question on YouTube. Choose from Martin. So maybe Martin, you want to ask it. Yeah, good. I can read it. I wanted to know, you presented a gender distribution at the very beginning of your talk that you measured. And perhaps this gender distribution is normal when compared to the gender distribution of search engine users or to the gender distribution of the general web population. And the problem may not be with Wikipedia, but maybe upstream. Did you do this comparison? Yeah, no, it's a very good point. I think two things I want to say to that. One is it's something we've thought about. I can't give you a great response right now. If you look at, for instance, Alexa, the web ranking platform will give estimates of how representative your given size reader population is of the web. And so there's a lot of proprietary stuff there. And I don't really know what they're using. They claim that Wikipedia looks very similar to the rest of the web. So in that case, you want to say, oh, that's pretty good. But on the other hand, if you look at sites like Instagram, and there's a variety of social media sites where survey data would suggest that women actually use them more frequently or spend more time on them than men do. So we are seeing these sites where you don't have this skew, which at least is like forgetting the right word, but a proof that this is not like a we shouldn't necessarily just be expecting a skew towards men over women in use of the internet. Jonathan has a lot more hypotheses and things like that. We've been thinking about around what explains these numbers, whether they're surprising or not surprising. So there's that. And then the other thing I would say, though, is that regardless of whether they reflect an additional filter that Wikipedia somehow putting on, for instance, general search traffic or not, we talk a lot about kind of knowledge, equity, and this idea of getting Wikipedia content to everyone. And so it's kind of baked into our mission. If there is a barrier, if the reason we're not seeing gender parity is some sort of barrier related to Wikipedia, that's something we kind of want to figure out and hopefully be able to do something about. You may be looking at something that is perhaps even something more general in the information behavior, information seeking, that information seeking is hunting for information is something that men do that women ask their men to do. But anyway, it's really interesting. And this comparison might actually help to put things in context. Yeah, and I'll say too that we expected certain, we based upon access to the internet and things like that having gender differences in various regions of the world, we did expect some lack of parity in some regions, but we are also surprised about some of the areas where there's been kind of broad internet connectivity and high education levels for a long time and still not seeing gender parity in those regions. So I think the results are still more surprising than expected to us at this stage. All right, thank you, Isaac. I think we, apart from many observations that I let you read, I think the questions, we don't have any other questions for now. Oh yeah, no, no, sorry. So we don't have any further questions from now, maybe from the room someone wants to ask questions. I would just say one comment and Isaac, I think said everything that I would want to say. So I would just underline that what we are seeing in terms of the gender distribution, in terms of leadership, has basically opened up a new avenue for our team, the Research Team at Wikimedia Foundation to focus in this space. So I would consider this the beginning of the journey to understand why we're seeing the gender distribution, the way we're seeing it right now. It's important to understand what the why of it exactly as Isaac was saying. During the 2030 strategic direction discussions and eventually the direction that the movement chose, we decided to choose knowledge equity as one of the aspects of the work that we want to do towards 2030. And it's important for us to understand why we are seeing the gender distribution, the way that we are seeing it. Specifically, I would say for a language like Norwegian, we included Norwegian as a language in which we would expect to see very close to gender parity in terms of responses. So we may learn that what is happening is outside of our control. Or we may learn that there are things that we need to do in this space, but I would like to say that we are committed in the team to spend more resources on this space. And as Isaac said, Jonathan has started looking into the state of literature understanding, coming up with a series of proposals and hypothesis for the team about why this is happening. We also rerun, ran the surveys for a period of a month to basically give more chances for readers that are less frequently coming to Wikipedia. And Isaac is analyzing the results of those surveys. So expect more from us in this space in the next nine to 12 months at least. Excellent. Thank you, Leila. I think we'll leave it at that. Thank you, Miriam on IRC. Thank you, Emerald. Thank you, Jana. And thank you, Martin and Isaac, our speakers for today. Thank you out there in the audience. Please, if you have not yet filled out the feedback survey, do so. You'll find a link to it on the latest email update and you'll find a link to it at the top of the research showcase page on mediawiki.org. We will be back to you next month with two more action-packed conversations and have a wonderful rest of your day. Thanks. Thank you. All right. Thanks. Thank you, Dal. Thank you all. Yes, thank you very much. And thanks for having me. It was really great. Great. So if the opportunity comes up. I imagine there will be more conversations. Yeah, sure. Especially since what Leila said sounds very interesting. I'd be very happy to...