 Okay. Good morning, everyone. This is Daria from the Wikimedia Foundation's research team, and I'm thrilled to welcome you to the September edition of our monthly research showcase. This month I'm joined by two great guest speakers. We have Lucy M.A. Kafe from the University of Southampton. We'll be talking about multilingualism in Wikidata. And our second speaker is going to be Neil Thompson from the MIT Sloan School of Management. We're presenting some results from a control test on English Wikipedia showing how coverage of scientific topics shapes scientific communication. So, as usual, each talk will be approximately 25 minutes, will be followed by a short Q&A, and we'll have more time at the end of the showcase for people who want to stick around for additional questions. You're also welcome to hop on our Wikimedia research channel on FreeNode, our IRC channel, where Jonathan Morgan will be your host and relaying questions to the speakers. So, without further ado, I'd like to invite Lucy to give a presentation. Thank you. Thank you, Daria. I'll share my slides one second, so that should have worked. Yeah, so as Daria introduced already, I'm going to talk about multilinguality in Wikidata, and I worked on that together with two colleagues of mine in the University of Southampton, Alessandro and Pavlos, and under the supervision of Elena Simpel and Leslie Carr, together with Lydia Pinscher from Wikimedia Deutschland, we looked into what is the state of multilinguality in Wikidata as of now. So, I assume most of you will be familiar with Wikidata in general already. I'll just give a short recap anyway. So, this is what an item more or less looks like on the website. So, we have statements for each item, in this case Douglas Adams. So, Douglas Adams points as at Jane Belson with a property spouse, and each item has a unique identifier, in that case, Q42. What we looked in particular is those labels. So, unique identifier help us a lot to make a knowledge base completely multilingual, because we can refer to a concept without natural language. Douglas Adams in this case, for example, is the English label for the item Q42, and we have description and aliases in natural language as well, but we're mainly looking into labels at the moment. So, the dataset we worked with is a Wikidata turtle dump of March 2017. It contains all items, and so in total 26 million entities more or less, and about 3,000 properties. So, the interesting part for us were the labels. So, there are 134 million labels, which sounds a bit weird with 26 million entities, but you have to consider that each entity has one label per language, and we have around 290 languages in Wikipedia and Wikidata in total. And so, it's actually quite a low number for the number of entities. For the people that are more familiar with semantic web, what we used, what we do to understand labels is the tag RDFS label, the property RDFS label, which is one of the standards ontologies. Wikidata is a bit redundant when it comes to labels because they cover different ontologies, for example, Scott's name as well. So, this is an example of the item Q12345, which is actually that in Wikidata. It has the label count from count, and it will always have a language tag. So, we understand which language we're actually looking on. So, in English, it's count from count, in German, it's Grafzahl, and in Russian, it's Graf von Snack, I think. So, we can actually compare those and have them all in one central place with the item. Why do we even care about multilinguality in Wikidata? So, the thing is that labels are the access point for humans to understand the data at hand. So, Q42 is great for machines to understand this is a concept about a person and so on, but humans actually need to interact with the data in the end. So, labels are extremely important. They give language communities access to this existing knowledge in Wikidata. So, as soon as we translate data, more people have access to it. Also, Wikidata could be a central storage base for translations of those labels for especially under-sourced resource languages, given how big of Wikipedia communities we already have. And then, of course, the Semantic Web can support natural language processing or generation in different tools and in different domains such as translation, question answering, and chatbots, and all those possible applications for it. So, given all those problems, we thought about what do we want to find out with the data at hand if we look at Wikidata. So, of course, first of all, we thought about what is the state of Wikidata with regard to multilinguality. But also, we wanted to look in the difference of multilinguality of the ontology compared to the overall multilinguality of the knowledge base. Ontology, in our case, we just define it as the properties as we don't have a proper class structure in Wikidata. And then, finally, we wanted to see how does Wikidata's label distribution relate to the real world, first language speakers, and how does it compare to Wikipedia's language distribution. So, it's their connection between the two resources of knowledge. So, looking at the first thing is what is the state of Wikidata with regard to multilinguality. To give a context, we looked at the web. And on the web, over 50% of all content is in English, even though only 25% of the world population is English speaking. The second biggest language in terms of users on the web is Chinese. However, only 2% of the web content is in Chinese. So, you can see already there's a huge lack between actual language speakers and content on the web. And, yeah, we thought maybe knowledge bases could help to improve that gap, or decrease that, not improve it, of course. So, first we looked at the pretendages of all languages in Wikidata. And you can already see that English with only 11.404% is a lot smaller. So, the distribution is a lot better towards other languages that are not English. Still, however, only 11 languages hold over 50% of all content in Wikidata in terms of labels. So, the second question we looked at is whether there's a difference in the multilinguality of the ontology compared to the overall multilinguality of the knowledge base. So, we made the same fancy chart for properties only in Wikidata. And as you can see, comparing the English part of the knowledge base, English strings, that's because there's a lot of better distribution, which is very interesting for us, because items can be, the labels of items are often imported from Wikipedia article titles, while this can't be with properties, because properties have to be translated in Wikidata because there's no connection from properties to Wikipedia pages. So, coming to the third research question, which is how Wikidata's label distribution relate to the real world and Wikipedia's language distribution, we see there is a lot of mild distribution, similarly to the web in general. So, the first thing. So, Lucy, it sounds like we've lost your audio. I don't know if you can hear me. Yeah, I lost it as well. Okay, hopefully, Lucy, you can come back and see what's going on. Hey, Lucy, welcome back. Hi, sorry, I didn't see, I lost you. Where did I lose you? You're presenting the chart with the distribution of languages after the other pie charts. Okay, one. The properties? Yeah, the one right after the pie chart. The items. Right, great. All right, I'll share it. Is it that one? Here we go. That's right. Perfect, thank you very much. Of course. Okay, so coming back to the label distribution relating to the real world. So, I got a bit lost, sorry. So, going back to Chinese. Chinese is very small in terms of language information in Wikidata, has a lot of native speakers, however, which I forgot to mention, I'm not sure if that's recorded, is that Chinese is censored in China itself. So, mainly people that are experts actually work on Chinese Wikipedia and therefore Wikidata as well. Another interesting case is Swedish and sebuano, as there's an editor working on the LSJ bot, I think it's called. So, they have a lot of stop articles imported by the spot, making sebuano one of the biggest Wikipedia's in terms of articles, actually. Yeah, so he started on Swedish when it went over to sebuano, and this already indicates us how closely connected Wikipedia and Wikidata in terms of languages are. German and Dutch are quite interesting as well, because especially Dutch seems to be a very small language, in terms of native speakers, German so-so, but they have a lot of information. I'll later on show that this is not mainly due to Wikipedia articles, however, the Wikipedia's are quite big as well, but due to the fact that they have very active communities on Wikidata as well, and I think that's one of the ways I get into that in a bit. So, we looked also at how are Wikipedias, so what are the biggest Wikipedias in comparison to the languages in Wikidata and Wikidata properties. So, as I mentioned before, German is quite well represented in Wikidata, and as we said, Wikidata properties have to be translated by the community and can't be imported from Wikipedia, so we can see here the pattern is Wikipedia is well distributed in German, Wikidata as well, its properties as well, that gives us an indicator that the community is also German speaking. The same goes for Dutch. Swedish, however, and that's where it gets interesting, it's very big when it comes to numbers of articles on Wikipedia, Wikidata overall, it is also quite well represented, but then in Wikidata properties, it is a lot less, which gives us an indication that the community in Swedish is not as active as in the previous languages. And in Cebuano, it becomes even more extreme, where it becomes the second biggest Wikipedia in terms of articles, it is quite high in number of labels on Wikidata as well, which are probably imported from Wikipedia, but it's not even in the 25th biggest languages on the Wikidata properties. So, and then I got really fascinated by this topic and like how does it all hang together, how does it influence each other, and what are actually the users in Wikidata and what they are doing. And this is not part of the paper, but I thought I'd give you a short outlook, so it's not something that I have great conclusions with, so something is just an outlook on what I'm working on at the moment. So we can see, so the first idea was, I look at the user language setting in Wikidata and Wikipedia in general. So the user language setting is something each user can change in their preferences, and in Wikidata it actually has quite a big impact as all the data, so all the labels get changed. So in Wikipedia, for example, it's just your buttons that get changed, but in Wikidata actually all the entities and properties, since we know the language you prefer to read in, get changed to a different language. However, in Wikidata the default language is English, so this is why we assume English has such a big share. So we looked at the languages taking out English, and then it becomes more interesting. So French, German, and Spanish being Western European languages are very well represented. German we saw before is pretty good covered in Wikidata in all regards as well. The same similarly Russian has a pretty big share. However, we figured that this actually tells us more about how people read data rather than how they edit, because this is how they look at Wikidata, this is how they browse, right? And it also doesn't take into consideration that people might speak more than one language and might feel comfortable in more than one language, therefore edit in more than one language as well. So we thought about looking at the Bababox, this for example is my Bababox, and the Bababox, again most of you probably know, I just give a short recap anyway, you can self assess the languages you speak coming from native language down to zero. So in my case I'm obviously, you probably can hear from here from my accent, a German native speaker, and I know English pretty well I'd say, so I can give it a four and I just started learning Arabic so I can barely read the letters, so that's a zero for me. This is quite interesting because it can enable me doing things on Wikidata translating labels, looking at labels that are more than just the best two languages I know and more than the interface I have Wikidata on. So we looked into the Bababox information from users and we see that there's a sudden switch. The problem is we have only under 4,000 people on Wikidata actually having the Bababox enabled, so the sample set is quite small, but it already becomes very interesting because German suddenly switches becoming more popular in the Bababox as one of the well-spoken languages before English. This might be due to a bias in our sample set towards the German community rather enabling the Bababox, but that's something we want to investigate further on and also how can we enable and use this multi-linguality to make to reflect that in Wikidata's labels. So in conclusion we can say that even languages spoken by a large part of the world are not well covered in Wikidata at the moment and coming to minor languages it becomes even worse, not minor but minority I'm sorry, but as we have seen community can have a big impact on language data so as for example in the properties. Also there's more variety possible in the languages, given the languages the community actually knows and the question will be in future work how to encourage them in all the languages they actually know and encourage them to edit. Also the importing of labels so-called bots can have a big impact of distribution of languages as we've seen with Swedish and Sabuano, but of course there's still a long way to a truly multi-lingual semantic web. So yeah that's it for my side, thank you very much. There's the link to the paper of the first part and of course the Wikidata entity ID for the paper as well. Thank you very much. Terrific, thank you Lucy. And yeah so I want to ask Jonathan, are there any questions from the channel? No questions so far. Okay I have a couple, let's meet Jonathan so we're going to ask you directly. Lucy so this is fascinating work and it actually is aligning quite closely with something we're doing at the foundation at the moment. So we're studying ways to bridge the language gap primarily on Wikipedia, we're focused on Wikipedia, but it's a strong interest in figuring out ways to accelerate the creation of good content in local languages that in many cases are spoken by large populations of speakers but are underrepresented on Wikipedia. So I have two questions for you. The first one is what do you think would be, you've been looking at language preferences, you've been looking at the relation between Wikipedia coverage versus the label coverage. What would be your recommendation based on these results in terms of finding the right population of contributors that we could pitch some tasks for creating or translating labels? And the second question is you mentioned the use of bots for seeding or generating labels or descriptions and I was wondering if you had any evidence of the extent to which a first label created by a bot is then modified or curated by humans. Do you have a sense of what is the prevalence of humans curating bot-generated labels and descriptions as opposed to bot-generated descriptions that basically stay like that forever? So I had a brief look at going from back to start. I had a brief look at edits of labels in ReikiData and I have to say that beside English there's few edits actually on the labels because usually once they're imported they're there and there are minor edits on them. So I haven't seen much on how bots imports would be edited afterwards by the community. Those are very interesting topic I think. The other part is of recommendations of people editing the language. I think it would have... So my assumption is there's two points that are very important to you. One is the interface so making it easier for people that don't necessarily understand the structure of ReikiData and so on to just translate information. And then there's work on crowdsourcing with small language communities and it usually has shown that people are taking a certain pride in their language especially if it's minority language. And I think this is something where ReikiMedia could look out for new editors basically not getting the traditional Reikipedia editors also to edit ReikiData but rather to see if we can get people from the small language communities to go more into just translating a word here and there just translating a concept. Because I think with the right usage of ReikiData for example, small language Wikipedia is we'd win a lot of new accessible content. Yeah, that makes a lot of sense. Yes, thanks for the two answers. I guess we have a question from... Yeah, question from computer McGyver. I'll read it out so everyone can hear it. Was the Babelbox analysis only of users who had it on their ReikiData profile? If so, it might be interesting to check on linked Wikipedia user pages or look at the language editions that users edit. Yeah, that's exactly what I was thinking as well. I'm not entirely sure to be honest. If it pulls the user pages from I think it's meta into the dump or not. Why was I not sure? Oh, because I thought I had seen one person that has the user page enabled directly from meta but that's certainly something I wanted to look into further. It's just very preliminary right now. Awesome. Computer McGyver asked that question and also followed it up with great work and many thanks to Lucy for sharing. Thank you very much. That's all the questions from IRC. I have a question if I may. So I'm wondering about if you can infer some stuff from the data that you have about how well people are able to use their second or third languages to access Wikipedia. Implicitly, if you see someone maybe who is a native German speaker but if there's a lower propensity there to translate, that might be telling you something about how German speakers can a little bit more easily understand the English Wikipedia. And so I think the combinations of data you have maybe combined with these country surveys that are done about how many language speakers speak the secondary languages. I think if you combine those two things you might be able to get a nice thing that would help Wikipedia understand like which other certain key languages where if you translated it all of a sudden a whole group of people would now be able to understand it even though there's more than just the number of language speakers in that particular language. I looked a bit at the editing data already and also very preliminary but what I've seen so far is three kinds of users pretty much in the 4,000 users of the bother boxes. So either they don't add language information at all. They added language information in their first two languages very much. And the third kind of editors which is very specific I think to knowledge basis people that edit literally in any language. So they don't care. They have a lot of edits and they just look for the specific Wikipedia article in a different language so independently from their language. But yeah, you're right. I'm very interested in going further into this direction and also understand how it combines with the specific languages. And I think you would you could skip over needing to know the the Bible box information for someone by just using if you knew their base language and you knew properties about the country or like the IP address or something. So just a thought. Yeah, exactly. Yeah, sounds very good. Yeah, the main problem with that is that geographic information and IP addresses for registered users are private information, right? So that's not something widely accessible to researchers. But we thought about whether we can infer from their edits back to what languages they speak in order to understand can we suggest them languages to edit, for example. Yeah, yeah. Hey, we've spotted your Italian. Exactly. Fantastic. All right. If there are no other questions, Jonathan, from IRC. Okay. Thanks a lot, Lucy. You're going to stick around that we may have some final discussion at the end of the showcase. And we're going to start with our second talk by Neil. Neil, I'm very glad to have you here. Neil reached out a couple of years ago already to first talk about these studies, so I'm very excited to hear about the results today. Oh, thank you very much for having me. Let me share my screen here. And are you able to see my slides now? Yep, that's working great. Excellent. All right. So thank you, Dario, for having me here. And I'm very pleased to be able to talk about how science is shaped by Wikipedia. I just want to also acknowledge my co-author, Douglas Hanley, who has done just some excellent work on this as well. So I really appreciate his contributions. So I want to start by talking about something that I think Wikipedians think a lot about, which is public repositories of scientific knowledge. So, you know, if you think about this, one of the first ones that comes to my mind is the Human Genome Project, right? An enormous amount of work was done by scientists. And the sharing of information became a real part of just how they were doing the science to the great benefit of very many people, that it was all of a sudden scientists could come in and they could get that information. And I think it's undoubtedly true that people have used that information for far more than the people who actually did the original work would have guessed. But the fact that it was freely available allowed them to do some pretty amazing things with it. Now, the challenge that we have from an economics point of view about public repositories of scientific knowledge like this is that it's been known for a long time that it's hard to get the incentives in the market to provide, right? And so you end up with situations where, for example, academic journals end up being closed where they end up charging a price to access the scientific information. Now, the good part of that is that at least there is some stream of revenue that could be used, right? And historically, that's been an incredibly important thing. But on the other hand, it also excludes a lot of people. And that could actually be just not sort of just a first order effect of, well, I can't, the people who can't afford it. But there's also just a, like, one of the problems of knowledge is often you don't know if it's going to be useful for you until you've read it. And so there are going to be lots of people who just don't even realize that they're missing out on important information. And so here we have two repositories of scientific knowledge. And what I want to talk today about, obviously, is Wikipedia, which is in some ways a bit of a different type of repository for scientific knowledge, because it clearly has a lot of scientific information. And I'll actually talk about that in a few minutes. But it also has a little bit of a different flavor. It's a little bit more accessible than other ones. And so I want to think about that a little bit in the context of actually something that Charles Darwin said. And he said, I sometimes think the general and popular treatisees are almost as important for the progress of science as original work. Right now, this is a man who can really say that because he both wrote a sort of world-changing book but also wrote a book that is, to this day, incredibly accessible. To pick it up, it's actually very fun to read. And so he really was sort of combining those. For the sake of Wikipedia, though, I think what we want to think about is, well, Wikipedia is one of the biggest sources of general and popular treatisees on science, and I'll show you that. And I want to think about this question of, well, how does that influence the progress of science? All right. So if we think about sort of the traditional view of this, we would think, well, maybe there's a scientific breakthrough that happens. And here I'm going to use the Journal of the American Chemistry Society because our experiment's going to be in chemistry. So you have the scientific breakthrough, and that's going to lead to some follow-on research. So people write some follow-on articles. And I think we would all not be surprised to say that, well, what's happening is also that Wikipedia is reflecting that scientific breakthrough. So someone comes up with something new, and that article gets written on Wikipedia. That's great. But what Darwin was proposing was actually that these sort of general treatisees on science were actually important for the progress of science, right? And that's more about this, this arrow of Wikipedia feeding back into science and helping to shave it. And so that's what we're going to talk about today, and that's going to be the central research question of this. Now, I want to start, as we dig into this, by comparing some things which are almost, I would say, incomparable, and that's actually the point I want to make. Right? So if you think historically about Britannica, right, do you have 65,000 articles and 2,000 most visited websites in the world? And I think the fact that it's called an encyclopedia as is Wikipedia sort of creates an equivalency that people think about, which is not at all supported by the actual number of articles here, right? So if you think about, you know, 5.3 million articles, more than about 100 times as many articles in Wikipedia as in Britannica, fifth most visited website in the world, that what that's telling you is that there is just, you know, a huge difference in the volume of information that's on there. And actually, typically I would bet that Wikipedia article is also longer. So we might be talking about for 10 to 100 to 1,000 times more information in Wikipedia. And if you remember for just second that 65,000 number, it's worth thinking about what that means just if we narrow down not to all of Wikipedia, but just to science in Wikipedia. So we've done some analysis and these are kind of rough numbers, but it's about half a million to one million articles in science in Wikipedia, right? So that's, you know, again, at least in order of magnitude bigger than all of Britannica. So this is just an enormous repository of science. Now that might actually lead us to say that the right comparison we should start thinking about is to the scientific literature. And so a web of science is a big repository that catalogs a lot of the information about all the publications. That has something like 90 million records in it. The implication being that there is about one Wikipedia article for every 120 scientific journal articles. And that itself I find very interesting. I mean, something that we often point our students to are the review articles, which are summarizing what some of the literature has done. What this is telling you, right, is that if there is one Wikipedia article for every 122 scientific journal articles, one of the way to think about at least some of these Wikipedia pages is probably that it's effectively a review article that gets updated all the time. And that's an incredibly valuable thing for the field. Okay, so hopefully I've already convinced you that there is a lot of volume of scientific information on Wikipedia. But you might want to know not just about the breadth of that, but actually about the depth of it. How much detail is actually covered? And so for this, we actually did a little review ourselves. So we took a sample of chemistry topics from University of syllabi. So we went to Harvard, MIT, Cambridge, a number of others, and we got syllabi at the undergraduate level and the graduate level. And we went through and we identified topics on them and we identified 646 undergraduate topics and 136 graduate topics. And when we did that, we then went and we had PhD students in chemistry go to Wikipedia and actually search for those things and say how good is the coverage of those undergraduate topics or those graduate topics? And what do we find? We find that coverage at the undergraduate level is incredibly good. It's something like 92, 93%. At the graduate level, it's a little bit less. It's about 47%. And so this is sort of telling you what the frontier of knowledge is on Wikipedia, that there is very good coverage at the undergraduate level and people who are contributing are probably doing it more at the graduate level. Okay, all right. So that's sort of what exists today. It's also worth mentioning sort of how this is changing over time. So here what we have is a look at the monthly article editions. On the left-hand side, articles, right-hand side, words. The top blue line is all of Wikipedia and the green line is going to be chemistry, which is what we're focusing on. And so you can see there that there has pretty consistently been 100 articles per month added to Wikipedia and chemistry for many, many years. I will also make a little other notice here, just in terms of full disclosure. So that red line there represents econometrics, which is sort of the branch of statistics that economists use. We originally were planning to do a whole second arm of this experiment in econometrics. And unfortunately, as you will see from that bottom line, it turns out that there was neither a lot of interest in econometrics nor, as we saw in some of the user data that we got and when we tried to do some of this, is there much interest in those pages? So it just turns out that as researchers, we're a little bit more excited about this than is the general public. And so we focused on the chemistry, which actually was more popular. Okay, so I've now shown you, there's a lot of science content on Wikipedia and then actually in some cases, it's fairly advanced, right? So there's a lot of supply of scientific information. Let's now turn to the demand side. Are people actually using this? And for this, I want to reference another study that was done by Hughes and All, which was looking at physicians. And I looked at junior physicians and they asked them about how much they used Wikipedia. And what they found was that this group of junior physicians looked on Wikipedia for medical information in 26% of all the cases that they dealt with. An enormous number, right? Very, very highly used. And then over a typical week, actually about 70% of all of their junior physicians used Wikipedia for at least some information, for excuse me, for some medical information. Okay, so that suggests that professionals are using this pretty intensely for scientific and professional information. So it seems like there's also a lot of demand for the scientific content on Wikipedia. So together, if you put those two things together, you would think, well, lots of supply, lots of demand. I bet there's going to be a strong effect of Wikipedia on these people. So if you're an innovation scholar like me, you said, great, let's take it to the data, and then you run into a problem. And that problem is that the way that we would naturally go and look for this is we would go and look for the citations in academic papers. So we go and we look at the citations and we'd say, oh, look, they referenced Wikipedia eight times. That's a really good indication. But it turns out that even though scientists use Wikipedia, they really do not cite it. So there was a work back in 2016 by Tomshevsky and McDonald showing that only 0.01% of all scientific articles cite Wikipedia. OK, so that's going to be a problem. And you actually might ask, well, why is that the case? Well, I think one answer is probably embarrassment. So some of you may remember Robert Kelly, this analyst in South Korea whose kids walked in the room during his BBC interview. And it's just there is this sense of there's certain things that academics do. And citing Wikipedia still has that flavor of citing it in encyclopedia. And that actually is something that academics probably would shy away from. Second thing is also that there may actually just feel like they don't need it. So if you think Princeton's academic integrity statement, for example, it says that if the fact or information is generally known or accepted, for example, that the Woodrow Wilson served as the president of both Princeton and the United States or the Naviglader's number is 6.02 times 10 to the 23rd, you do not need to cite a source. And indeed, people might think that if something is in an encyclopedia, that is an indication that it is widely known and accepted, and therefore that it doesn't need a reference. In either case, we still have this problem. How are we actually going to make this? And so the approach we're going to use for our study is we're going to look at the creation of new articles on Wikipedia as a shock. And then we're going to use lexicographic measures to look at the influence of the Wikipedia article. So we're actually going to go to the words in the Wikipedia article and look at those words get repeated in the scientific literature. All right. So what's the data that we're going to be able to use to do this? Well, we used for Wikipedia, we used the full edit history back to 2001. I'm sure this is something you deal with all the time, but even at MIT, proposing to analyze 10 bits was something that got people's attention. And then we're going to focus within that on chemistry. And then I'm also going to have some data on the Elsevier journals. And so I'm going to have the full text of journal articles from 2,061 journal titles. And so actually hundreds of thousands of journal articles. And we're going to focus in on chemistry again and look at the top 50 journals in chemistry and the articles that are within that. All right. So I'm about to move to actually how we analyze this data, but I want to take just one little side note, which I'm sure many of the Wikipedians will be much more familiar with than I am. But just it will be important for where we get to, which is this the fact that when many new articles are created, they are created as stumps, right? So this is the initial article that existed for magnesium sulfate. And as you can see, it was just, excuse me, just one sentence there. And then over time, what happens is this gets filled in. And this is just going to be important when we think about the timeline for what, when we call a new article having been created. Okay. So with that little caveat, let me turn to how we're actually going to try and measure this impact of Wikipedia on the literature. So we're going to think about a timeline then, which starts with a before period, where we're going to have some scientific journals and they're going to be talking about some things. And then a six month period afterwards, where again, scientific journals are going to be talking about things. And then between those two, we're going to have the Wikipedia article being created. And we're going to be interested in saying, like, can we, is Wikipedia influencing that later literature? And if we think so, then we're going to think that that literature is going to move a little bit closer. So to calculate that, what we're going to do is we're going to calculate a similarity. So a similarity between the Wikipedia article and the journals, journal articles that came before and the Wikipedia article and the journal articles that come afterwards. And for those of you who are interested in the technical details, we're going to do this as a bag of words. We're going to down weight words that are meaningful using this term frequency inverse document frequency measure. And then we're going to calculate the cosine similarity between them. And so we're going to do that. We're actually going to give a three month window here for the Wikipedia article to be created because of that stub phenomenon. And then we're going to look at this. And we're actually going to look at it across 27,000 chemistry articles that have been created in Wikipedia. And what do we in fact see? We definitely see that afterwards, the literature is closer to Wikipedia than it was the literature beforehand. So that correlate, we got the correlation. That's great. But there is a problem here, right? It is only correlation and what might we worry about? Well, in particular, we might worry about this, right? If you have that breakthrough science happening on the left, what you have is a common cause, right? That break through the words that are in there get reflected in Wikipedia. And the words that are in there get reflected in scientific literature that comes after it. And if it was an important breakthrough, that may make it more likely to create the Wikipedia article and have more articles later talking about it. And that exact same thing is going to create a correlation between the Wikipedia article and the literature that comes afterwards. And so we have a correlation there, but it might not at all be indicative of causation. And of course, as we've said, what we really want here is causation. So how are we going to get causation? Well, it's a paper all about how science is used. So let's be good scientists and let's do an experiment. So we're going to create 43 new chemistry articles. So this is, remember earlier on, I told you about the graduate topics that we identified that were missing from Wikipedia? So we identified those. We had PhD students in chemistry go and write those 43 articles. And then what we do is we stratify them. So we divide them up and we're going to be very careful to divide them evenly along a couple of lines. So we're going to do that among the article author. So some of our authors were, for example, native English speakers and some were not. So we're going to make sure that we have the same number on both. We're going to make sure that that's true also on things like topic type. So in some chemistry articles are on specific chemical reactions. Some are on broad chemical principles. We want to make sure that those are evenly spread as well. And then we're going to, and then within those groups, we're just going to randomize. Okay. And that's going to leave us with 22 treatment articles. And 21 control articles. And then one of the things we're going to do is we're just going to want to check them. We're just going to look at them and say, how similar are the treatment articles to control articles? And the answer is going to be incredibly similar. Okay. So we look at it in terms of number of words in that article or the number of figures in it or the number of references that are being made. There are, these two groups are very, very similar. So that gives us some confidence that the control group is going to be a good counterfactual for whatever is going to happen with the treatment articles. And what's, of course, what's going to happen is that the treatment articles we're going to put up on Wikipedia and the control articles we're going to hold back and we're going to see how the scientific literature unfolds after we've done this experiment. Okay. And just a little footnote for those of you who are eager for more Wikipedia articles. In fact, when the total experiment was over, we in fact also put up the 21 control articles as well. All right. So let me give you a feel for a new article we created. So this was one of them. And you can see that we have, you know, a nice, nice discussion, you know, reasonably technical. We have a number of diagrams on the right hand side that our authors made to illustrate this point and the references. Okay. And if you, in fact, look at the characteristics of our articles compared to articles that are on Wikipedia, actually you see that this looks very similar to many other articles that are out there. So we can have some sense that they're roughly equal with one notable exception, which is important, which is that our articles are a little bit more advanced because they're at the graduate level. And so you can really see this, actually, if we look at the similarity, actually, I'll come back to that point in just a second, at the similarity here. So this is on the x-axis here, we have the document similarity, which is the how similar are these Wikipedia articles to the scientific articles. Okay. And this is just a density plot. So the higher it is, the more articles there are at that. And this curve is actually considerably more, it's considerably more of a right tail than does the natural one with Wikipedia articles, which ends up more something like this around three or 3.5. And so what that's telling you is that our article is a little bit closer to that scientific frontier, which makes sense given that we were writing about graduate articles. Okay. So the other point that I just skipped on the last slide, which I just want to highlight here is we also saw that we got a large number of views on our article. So we got about a little over 4,000 views per article per month for the time that we ran this. And so that's very comforting to us because one of the things that of course matters here is that we're adding these articles, it has to be that scientists are actually seeing them. And we do see in fact that there's a lot of views of it. Okay. So I just want you to focus on one more thing before we leave this graph and get to the actual results, which is that there is this long tail and it's important to note that it's sort of 0.35 up, right? We're at a fairly small tail of the distribution, but it's the part that is the closest to the scientific literature. Okay. So now let's turn. So we upload these articles, we wait, and then we actually look at how the scientific literature that has come out since then, and this is what we found. Okay. So again, on the x-axis, we have the document similarity, how similar the Wikipedia article was to scientific articles. And on the left hand side, on the y-axis here, we have the growth in the number of articles as a percentage of the articles that were in that group, right? And as we saw on the last page, the density of these things above sort of 0.3, 0.4, we're getting relatively small. So even though the percentage growth is going to be big, the actual count, I mean, it's still substantial, but it's not like, you shouldn't think of this as like thousands and thousands. But so okay, so what is this graph then telling us? Well, if we focus on at the 0.35 on the x-axis and go up, you see that we're looking at something like 12% growth in the number of articles that have that level of similarity. And that's with the very dark line. And then the two banded colors around it represent standard errors. And so that you can see that as we move to the right, indeed, we get statistical significance, showing that this is not just a coincidence, that we really see a very different set of outcomes for these treatment, for the words in the treatment articles than we do for the words in the control articles. So in addition to looking at it like this, we can also just do a regression over all of the articles that we saw. And what you see is that there's an increase in the similarity of 0.3%, right? Now that's a little hard to interpret, right? Because we're measuring everything in terms of cosine similarity and that's not a metric that people are very familiar with. So we've done some simulations to say, well, in practice, what does that mean? And it means that you're changing about one word in 300, okay? Now that's across all of the chemistry articles in our sample. In fact, the actual effect as we can see from this graph is concentrated on a smaller number of things that are more similar. And if you get into that group, this effect gets considerably larger. All right. So if we now revisit our plot here, we can see that there definitely is evidence. We found evidence for a common cause, but we now actually also have evidence that there is that Wikipedia is having a causal effect. So then in the last couple of slides here, what I want to do is start breaking apart that effect and say, okay, the proposal here is that Wikipedia is helping with the dissemination of knowledge. So first of all, can we actually see that? For example, in the type of journals that are being the ones that are being most impacted by the Wikipedia article. And so here what we're doing is we're looking at it by the effect by journal quality as measured by impact factor. And so on the right hand side, we have the highest quality chemistry journals are the most cited ones. And on the left hand side, we have the less cited ones. And what we see is that this effect is concentrated at the lower end. Well, that actually makes a lot of sense, right? I mean, it's hard for Wikipedia if the groundbreaking journal in chemistry is publishing about something that has never been seen in the world before, you would not expect the Wikipedia article to have been written before that and influence that finding. Whereas getting that information out to the rest of the world is exactly that's when you would expect a Wikipedia article to be there and that's really what we're seeing here as these follow-on work is done fleshing out the work that's been done in the breakthrough technology. Let's also look at some of the distributional impacts. So one of the things I talked about at the beginning was maybe there would be a difference in terms of the access to this kind of information and here we're going to again look at a quantile regression but now we're going to look at it based on the country. So how rich is the country that the researcher was from? And at the bottom here you can see the number of scientific publications that relate to that and you can see not surprisingly that there are many, many more scientific publications coming down to the richest countries than there are in poorest countries. But what's interesting here is if you actually look at the size of this effect you can see that as we go from the richest to the second quartile and then the third quartile this effect gets larger. And we don't see any effect on the fourth quartile but there are just so few scientific publications there that I wouldn't really expect to see very much. But it's pretty remarkable that we have such a large effect on the third quartile and so that's really telling you that this seems to have a sort of equity improving effect. All right. So let me summarize. What I've told you today is that Wikipedia is a major repository for science that there is about one Wikipedia article per 120 scientific articles and that makes it just a very major repository scientific knowledge and that the coverage is very, very good at the undergraduate level and emerging at the graduate level. I've also said that Wikipedia not only reflects science but it shapes science. So adding an article in Wikipedia moves the related literature about one word in every 300. The effect is stronger for those with less access to journals and I didn't actually talk you through it today but for anyone who's interested the calculation is in our paper we can actually show that it's actually very, very cost effective way to disseminate the knowledge that would into science and that means that it's, you know we hope that that as a clearing call to people who would fund this kind of thing or just, you know scientists who believe in it to actually go and do this. So Darwin said he thinks that popular and general treatisees are important for the progress of science. I think we agree with that. I think we put a little bit more detail on it but we really encourage people out there who are thinking about doing this to contribute to it because I think that we've shown that it really has a big effect on science. Thank you. Thanks a lot. Thanks a lot. That's a fascinating study. Excited about the results and I gotta dig much deeper into it. There are many people that have been following this question of, you know, how to engage experts broadly defining to contribute Wikipedia. I think this is going to be like an important piece of that discussion. So we're going to ask Jonathan first there are questions No, I see questions at this point. Okay, so I'm going to start up with one myself. So I was wondering if you say a few more words about the setup of the experiment. I remember when you when you reached out your first question back in the days was how to run a controlled study on Wikipedia that involves an intervention in a way that would be respectful of community policies. And I was curious to say something about how the process went what you learn in a process. Yeah, that's that's my first question, I guess. Oh, and you're muted. I think there's both a sort of a process answer to that and also a sort of design of the experiment question. So the actual sort of practicalities of it, right, is we what we did is we wrote the first first article and we put it we put it up and we were like, okay, let's do this as a pilot. And that one went very smoothly. And so we wrote the large set of articles and uploaded them. And I think maybe some of the Wikipedia article editors thought, oh, wait, this is someone spamming us. And so it was a little bit of a negotiation that had to go on and they like, no, no, no, really, we have good people in chemistry writing these articles. So that there was, you know, there was a little bit of rockiness there, but but we got through it. In terms of the actual design of the experiment, one of the real challenges that we had was, well, Wikipedia by its nature is self evolving, right. And so when we add these Wikipedia articles, they have a whole bunch of text in them. But what's going to happen almost immediately is people are going to move them around. They're going to add things are going to take things away. And so one of the things we had to figure out was how do we adjust for that? And so what we did was we we did all the calculations that I've shown you were based on the words that were in the document as we first uploaded them. So we didn't account for these changes. And the reason is that we had both a treatment group and a control group. The treatment group was getting these changes, but the control group obviously wasn't on Wikipedia so it wasn't. And so to keep those as good counterfactuals for each other, we had to use the version that we uploaded and then track those words not the words they got out of it. Yeah, that makes no sense. And if we can ask a second question, Jonathan? Yes, okay. So I was also very interesting in your track of the traffic trees articles. And I saw from the example that you gave that basically your students did a good job at like cross-linking cross-referencing the article with existing articles. My question is if you've done any analysis in terms of similarity of like these articles with existing articles in chemistry or in other scientific fields with a similar size when it comes to linking structure. And the reason why I bring this up is that we know that by far most people in Wikipedia are using external search engines as an entry point. But the link structure also affects their discoverability. And I just curious if you had any any additional thoughts about how links were created and how that compares with the organically created articles on scientific topics in Wikipedia. Yeah, so we have not looked at that. It's an interesting question I mean one of the things we did try to make sure was that the treatment and the control like I mean when they first created it those two looked the same but we haven't actually compared it to the reference it's a good good question. Got it. That's it. Yeah. All right that's for me for now. One question from Giovanni on IRC. Could you trace specific keywords in the treatment group articles into the scientific articles published after the article was published on Wikipedia? Absolutely. And so that was and in fact that was really the heart of what we were doing is seeing that these words that had got that had been put into the Wikipedia article we saw them go and flow into the scientific literature and be used more often and in particular use more often than the same set than a similar set of words that were in our control group right but that that was the same and basically every way except they didn't get uploaded to Wikipedia. And so that's sort of the power that we think that the experiment has is being able to trace those words exactly. Cool. And we have one more question if we have another minute. One more question from IRC. It is often difficult to write about bleeding as research on Wikipedia as it often ends up in a debate over original research. Users often say we should not write about original research rather we should not do really we should not do original research on Wikipedia. Any speculations on how to avoid that? Oh, I feel like I need to turn this over to the Wikipedia folks best advice on that. I certainly I mean I think what we saw in our experiment is we ran into a little bit of that and I think that you know part of that is going to be just that as Wikipedia for example on chemistry as you get to more and more graduate level topics. Right. It's going to be like it's going to be important to have both contributors and editors who who are at that level who are you know professors of chemistry for example who can judge well is this just somebody's theory or is this actually something the community now knows and therefore deserves a page but how to implement that I leave it to the experts. Excellent. All right. No other questions from IRC. Okay. I think we still have some some more time if people want to ask a question about both presentations. I'm I'm very excited about these two talks today. So sort of like thanking you both for being with us today. It's going to be hopefully the beginning of a conversation with the community without researchers going forward. Jonathan anything else on IRC? If not We have a comment by commuter McGiver following up on I think the last question. One further thing that researchers could look at is the citations of scientific articles referenced in Wikipedia. It might be that the science articles they cite in their treatment articles also got above in citations. So we we actually did look for that. We that we it was a hypothesis we had as well. And I actually I my suspicion was that that was going to be you know if anything a bigger effect that we saw and we did not we did not see that so it looks like it's the you know they're focusing on the words in the Wikipedia article rather than using it as a gateway to the scientific article. Interesting. Yeah, I would have I would have suppose that as well. Well, no other questions on RSE. So yeah. All right. I think this is a wrap. Thanks again to our speakers a virtual round of applause and I I will hope to see you all at our next division in October. Yeah, thank you very much. Thank you everyone. Thank you very much.