 Okay. Very nice. Hi, everyone. My name is Leila Zia. I'm the head of research at the Wikimedia Foundation. Thank you so much for coming to this session as if you're joining in the room or watching live now or in the future. Thanks for joining me. I'm going to share with you some research findings from the work of the team, the research team at the Wikimedia Foundation. And I have prepared 10 of them, but really the purpose here for me is to help you engage with the research. So if we end up covering fewer of them, from my perspective it's okay. So I'm going to be asking you at points to pull up your phone or laptop if you want to engage with some of the research findings more live. We can do that together. We have a lot of venues through which we communicate research findings. In some of these venues we go really in a lot of in-depth explanation about the research behind the projects. For this particular session for Wikimedia, our priority is to make sure you understand what is the research that is happening, like what are the findings that we have and how can you use those in your work. If you're interested about the research behind the scenes then I'll share some links at the end, how you can engage more with the research itself if you want to get into the depth of the technical side of it. Very briefly about the mission of the research team. The research team at the Foundation is a team of 16 staff members. So we were smaller and as a result of research reorganizations we are now bigger. We have five contractors, one research fellow, and 17 what we call formal collaborators. These are researchers who are usually in academic institutions across the world who volunteer and work with us towards the annual plan commitments of our team and the organization. Our mission as a team is to develop models and insights and the way we differentiate the work that we do with a lot of the other research that is happening, whether within the organization or in the movement, is that we put an emphasis on using scientific methods. So for us contributing to the scientific literature and also using scientific methods is important. And the other part of our mission is our focus on strengthening the research community around the Wikimedia projects. We do this work for two primary impact areas. One is that we want to support the technology or policy needs of the Wikimedia projects. So usually when we're doing research projects we look at where can it help, and we are looking at the technology space as well as the policy space. And what policy is broadly defined, it can be some of the work that is happening within the Wikimedia foundation. It can be policies that the communities continue to evolve and create and we would like to, with our work where it's possible, inform that. And the other area of impact that we seek is that we would like to advance the understanding of the Wikimedia projects. Whether it's within our movement or whether it's outside of the movement, we see as part of our job to help people understand the Wikimedia projects and Wikimedia communities better. We serve four primary audiences. Of course, we serve Wikimedia foundation. We also serve affiliates, and this is something that we're putting more and more focus on. Traditionally, we have worked of course with Wikimedia Deutschland through Wikidata, and you'll see some of the research that we have done with the Wikidata team. And now we're having our conversations with some of the other affiliates. We are generally very interested to work with organized groups because we can have more impact together. We also serve Wikimedia volunteer developers, or we would like to do more of that. And that's through bringing research expertise to volunteer developers for the tool building and the development that they do on the Wikimedia projects. And last but not least, we serve the Wikimedia research community. This serving has two angles. One is that we listen to the needs of the Wikimedia research community. For example, if they ask for publishing data sets or a particular line of research to be done. And the other thing is that we try to strengthen this community by bringing researchers who are out there but are not part of our community and help them be part of the Wikimedia research community. We have three primary programs within the team and areas of focus. One is what we call address knowledge gaps. This is really an area of work of research that is focused on defining what are the different types of gaps on the Wikimedia projects, measuring them and helping address and bridging them. The other one is what we call improving knowledge integrity. This is about the reliability and quality of content on the Wikimedia projects. We primarily at the moment see our role in this space as supporting volunteer editors or volunteers in general for enforcing content policies on the project or updating policies. And the last part of our work is focused on strengthening the Wikimedia research community. And that's more around, we do a lot of event organization publications, speaking in conferences and academic venues. And that's the way that we engage with this community and strengthen it. If you want to ever do a deeper dive about what we do, we publish a bi-annual report every six months and you can find that on research.wikimedia.org, the report tab. Okay, so now I'm going to go start sharing some of the findings with you. For those of you who have been regular Wikimedia attendees, you will see some continuation and updates on projects that in the past you have heard about. The first one is about gaps. So we have been talking about gaps starting 2018. So this is an update on the on years of work that has been happening since then. So I want to encourage you to start like thinking about knowledge gaps on the Wikimedia projects. And just before I go to the next slide, maybe take a minute to see if you can name three different types of content gaps that you think we have on Wikipedia, three different kind of readership gaps or reader gaps, and three kind of contributorship or contributor gaps. And if you want just raise your hand, say one gap that you know, what is one gap type that you have heard about or you're working on? What kind of gaps do we have when we say we have something missing? Yep, Martin, youth, okay, citations, okay, vocational knowledge, gender, okay, and this is on the and these were partly on the content, but it also can be on the contributor. And I don't know, Martin, Martin, when you said youth, where it goes on the contributor end. So we have a lot of gaps on the projects. In fact, in 2019, 2020, we developed what we call a taxonomy of knowledge gaps for the Wikimedia projects. There are three main areas that we are considering readers, contributors and content for this taxonomy. And for each of these, we have identified certain types of gaps. I'm not going to go through the details of this, but basically the way we have done this is that we read more than 200 academic and community publications on the topic of gaps. And we also talked with the different community members to understand what are the different kinds of gaps that they're trying to address. You in fact see some of them. So Martin, for example, what you mentioned is in either content age or recency, or in contributor side and or reader side, we have age. And of course, there are other types of gaps that exist. So the problem that we are trying to solve with this taxonomy and the measurements is that there are numerous types of gaps on the Wikimedia projects. And in the absence of laying out all of these gaps, what usually happens on the decision maker side is that they hear about certain types of gaps. And they assume that these are the only kind of gaps that we have. So there's a bias that you can try to break to some extent or mitigate if you can lay out what are all the different types of gaps that we should consider. So part of the attempt of building this taxonomy is to break this assumption that we have only a few different types of gaps. Once you have the taxonomy and you have defined the gap types, then the question becomes, can we measure these gaps? And what are the metrics that we should define for each of them? Why is this important? And it is important because it can help us as decision makers and decision makers is broadly defined. It can be an affiliate trying to figure out what is the next project for their affiliate to work on or it can be a tool developer who is trying to figure out what is the most impactful tool to develop next. We want to help decision makers identify areas of focus and impact. We want to diversify the investments across the movement so that we don't end up putting all of our resources in one area, at least not knowingly. It can be that we know it and we decide to do it, but that's different. And also, we provide a way for all of us to monitor the impact that we have for these areas that we want to focus on. So we have developed the taxonomy. We have metrics for many of the gaps that exist that we have defined. And we have started providing measurements for some of these gaps. This research is a multi-year research, has gone on for multiple years and we expect it to continue measuring and providing measurements for each of these gaps require a significant amount of time and deliberation. We had a wonderful team that has worked on this project for multiple years and you can now learn about the state of knowledge gap metrics in these links that are provided here. There is a prerecorded Wikimania presentation that is linked in that meta page that is in the bottom of the slides and if you want to go deeper in the topic of knowledge gaps and taxonomy, I highly recommend that you check that out. So I have one announcement to make here which is you can now access the measurements for four of the content gaps that we had defined. These are content gaps for gender, age, geography, and sexual orientation. We compute these on a monthly basis and they are published publicly. You can go and check them. And we have recently added one more content gap which is multimedia gap. This one is not refreshed monthly yet. We are hoping to be able to refresh it on a monthly basis as well. So that means on the content front, we have five gaps covered so far. Yes, Daria. So multimedia gap, the metric is defined around whether an article, which percentage of articles contain images on Wikipedia. Yeah. So I want to also mention something that as you go to these links that are being shared here, you're going to run into CSV files or data that is published publicly basically for raw data to be used. And we acknowledge that not everyone is comfortable, particularly within the decision maker cohorts. Not everyone is comfortable working with CSV files and raw data in that way. We see a long path ahead of us to make this data actually accessible to all of you who are decision makers. We have started experimenting with notebooks. So these are pause notebooks that once the slides are shared, you can click on them and you can go there. But for two of the gaps, the content gap geography and content gap gender, we have started developing pause notebooks where you can go and actually see some of the questions that we have asked. And we have queried basically the data and we have prepared some plots that you can access and learn more about. As I mentioned, I expect us to spend significant amount of time just making this data more accessible to all of you as decision makers. So if you're interested to be in that journey with us, ask the type of questions, share the type of questions you want to be asking from this data and all that, please reach out to me after the session because I would like to get in touch soon. So we did all this work. What are some of the things that we can learn today that we couldn't learn before or we couldn't confidently learn before? I'll share with you some sample findings. One is really nice news about gender of content. So articles about underrepresented gender groups, women, non-binary in this case, have a much lower coverage on all Wikipedia's. However, what we're seeing in the data is that the quality of these articles have actually become the overall quality and average volume of page views for these articles have surpassed the other types of biographies of articles basically for men that exist on Wikipedia. So this kind of analysis, this kind of data sharing allows us to plot these kinds of things and see that actually starting 2021, we are seeing that articles about underrepresented groups are having higher quality. This is, I think for any of you in this room that are contributing to the gender of the content on Wikipedia. Congratulations. Thank you so much for years of work that you have done. We all know that quality of articles are very important. So it's not just about adding new articles, but it's about improving the quality of articles. And we are now able to see this kind of trends and patterns and see that the effect of the work that you all have been doing over the years. Another thing just as a sample kind of thing that you can do with this data. We know that the proportion of articles about underrepresented regions such as Africa or Asia or Southeast Asia is low on Wikipedia. But what we see is that the coverage of articles in these regions is growing faster than the other regions. And that is a good news. So if you see here, for example, you see some numbers around 8%, 9% and up here for US and South America you see around 6%. Again, this is something to keep an eye on. These differences look small as numbers, but they're at least showing that there is a difference between how much articles are being created about the region. Some of which can be that because there can be saturation of content that is created in a particular region, but it's good pattern to watch out and just keep an eye on and be aware of as we are making decisions about where to invest or we want to see, get a sense of what happened to the investments we have done so far. Okay, so I'm going to switch to the second topic. Maybe I'll just take a minute here. If there are any pressing questions about the gaps, please ask. Tillman, will you be able to move? Yeah. Thank you. I was just wondering about these observatories. There are some existing efforts by the community. For example, this is called the Wikipedia Diversity Observatory, which was built some years ago, has a lot of difference. The manager seems to be a bit of overlap. But a question about general, how do you decide when how to integrate with existing efforts from the community or the academic researchers, in that case it's both Mark McGill, who also presented today, or when to build something new on your own. Yeah, thank you so much. And actually, Mark was on this project, so we had extensive conversations with Mark about this. On this particular example that you mentioned, we decided not to join effort because there are some challenges around licenses of the data that is being used, and we want to make sure that the data that we publish is, we can put it on their license properly, so we decided not to go with it. But in general, when we are deciding whether to do something or not, because we have done the literature review and state of community work in this space, we generally try to bring the community in and just work with them on this front, or figure out if we can reuse what they have already done, because it doesn't make sense to recreate everything from scratch. But sometimes we have reasons not to continue, or not to use the community resource that may be available. Okay, so now let's, from the macro level of metrics and measurements and gaps and all that, let's look at another project. I would like to drill down with you on edits. And before I show you some of the next slides, I want to ask you to think about, so this is about Wikipedia, so if you edit Wikipedia, think about the last edit that you have done on Wikipedia, and think about what kind of edit was it? What did you do when you edited that Wikipedia page, or talk page, or anything like that? Revert, very good. What did you revert? Vandalism, okay. Any other examples? Yeah, translate, okay. You participate, you basically added content to a discussion page, right? Yeah, Rosie. New page patrol, perfect. Okay, so we do different things when we edit Wikipedia, and in different pages. So the problem that we have is that we don't have a shared taxonomy of edit types for categorizing editor actions. We all talk about different things that we do, but we don't have a shared taxonomy. And even if we have that taxonomy, we don't have a way, or we didn't have a way, for automatically assigning edits to this taxonomy. We're trying to solve for this problem. We're trying to define a standard way that we talk about edits, what we call the taxonomy of edit types. And then we are also building systems that can help us automatically detect what kind of an edit is an edit across all Wikipedia languages. So why is this problem important? Because if we can solve this problem, we can answer some of the questions we may have. For example, what type of edits newcomers do? Or what type of edits are most common in my language? How has it changed over time? What kind of edits get reverted more often? Or we can support patrollers better in patrolling the content because we can help them focus on specific types of edits or specific types of the flow that is coming in. And there are plenty of applications here. We can help organizers assess the impact of their campaigns in more effective ways. So the research is a multi-year effort where we started building a taxonomy of edit types and build an experimental system that can automatically detect edit types at scale for all Wikipedia languages. You can read more about it in the meta page. But I want to share with you the taxonomy here. So the taxonomy is actually built. Of course, these taxonomies are never static. They can improve over time. But at the moment, for those of you who are a little bit further back in the room, I'll read some examples of it. So for example, we keep track of whether you have edited a table or a reference or categories or added media or external links or templates or white space. Did you add a white space or remove a white space? Did you work on a sentence or a paragraph? And on top of these bullet points that you see here, we also consider four types of actions. Whether you have inserted something, removed something, changed something or moved something. And this basically gives us a taxonomy of the different types of actions that happen on Wikipedia. We map this taxonomy to four broad categories that we want to consider. I want to emphasize that the choice of categories depend on what you want to achieve. So actually, our team is very much concerned around the question of how can we support patrollers better in their work? So as you can see, vandalism and patrolling has a category because we want to be able to better measure this kind of editing. Content maintenance, content annotation, and content creation are the other categories that we're considering. So we can basically map from these more granular edit types to these higher level edit types. So now this is the time that maybe we can play this together. If you have a mobile phone or laptop in front of you, please go to this link. And I'm going to say, wait maybe for a minute or two while you get there. It's, yeah, what is the best way to get it to you? Is someone on the front on the etherpad and can post it on the etherpad? Actually, to get to the etherpad myself as well. Has anyone made it to the link? Yeah. Okay, perfect. So here is what you can do. So you can basically choose the language that you want to have in the first box. So basically just enter the language code there and then you can skip putting a revision ID. If you don't put it, it will basically randomly assign a revision ID to it. If you're curious about a particular revision ID, you can basically add that revision ID there. And then let me just, for the front of it, I just put en so that we can hopefully all read. And then you press unsubmit and it assigns randomly a revision and then you can start exploring what has happened as a result of this revision to Wikipedia. And what you see on the right hand side is basically the different edit types or edit actions that have happened. So there has been an insertion of a white space. There has been a section change, a template insert. If you choose, so now I chose the simple option up here. There is a detail option in which you're going to get much more details about what is the kind of edit that has happened so you can play with it. Now, of course, this is an interface just for us all to be able to play with it and get a sense of what it is. But in the back end basically we use this kind of taxonomy to be able to do a lot of other things that I'll talk about in the coming slides. Are there any questions about what this thing is doing? Then let's move on. So what can we find? What can we learn once we can do such a thing? So this is an example. We have started doing analysis based on this data and this is French Wikipedia based on data on French Wikimedia preliminary analysis. So what you see on the x-axis for those of you who are a little bit further back, this is IP editors, 0 to 10 edits, users with 0 to 10 edits, users with 10 to 100 edits and 100 plus edits. And on the y-axis you have the reverted vandalism, reverted patrolling, content annotation, content maintenance and content generation. And what we see here, if you focus on content generation, is that across the different user types that we have on French Wikipedia, roughly all the user types spend around 20 percent of their edits on creating new content. The rest of the edits are happening on maintenance, annotation and vandalism detection. So you can interpret these results in different ways, of course. For us, in the research name, given that we know that we want to support the patrollers better, for us this is a clear sign that we need to better support editors for the work that they're doing on the patrolling and maintenance of content. So this is a clear signal that 80 percent of the work is happening on maintenance. This is again, acknowledge it's French Wikipedia, it's an established project, it has a lot of content that is already there. So the situation may be very different for a new project or new language project. But for French Wikipedia, what we take from this is 80 percent of the work is maintenance, annotation, vandalism detection is very clear what needs to happen next. We need to support these people for the work that they're doing. Someone else may look at this data and say, oh, we really need more content in this language. Like, let's figure out how we can bring more content to the language, right? So because only 20 percent of the edits are on the content creation side. And that is OK too, right? Each of us, depending on where we sit, what our agendas are, what we are trying to achieve may look at this data and have different interpretation or different ways that we want to engage with this data. And that is OK. That's what makes us a movement with one mission and different ways of arriving at that mission. So I'm going to move away from edit types. Any pressing questions on the edit types? Mako. Thanks, Santosh. The tool seems to give us like very granular information about whether references or headers and these things are had. How do I get from that to content maintenance and content creation? Yeah. And you should go to the meta page that Isaac has very beautifully documented. And there he is going to explain to you behind the scenes for the mapping. Yes. OK. Then let's move on to the next thing. Let's move on to newcomer onboarding. There are a couple of things you're going to hear from me if there's enough time about newcomers. One of them is about hyperlink recommendation. So basically, generally, the problem that the growth team at the foundation posed to the research team was that we want to help newcomers be productive contributors to the projects. And we have figured out through different studies and experiments that giving newcomers structured tasks can help them do things on the project and kind of find their feet and become a little bit more confident. So what we want you to do is to give us kind of structured tasks that you can automatically generate across many different languages that are relatively accurate. They can't be all over the place. And we're going to give these to newcomers through what they call the newcomer dashboard that is active in many Wikipedia languages right now. So I'm going to tell you a little bit about this. Sorry, I started already talking about the problem. So we initially made what we call a hyperlink recommendation model. This is a model that looks at the already the text of the article that is already there and recommends you recommends to you which hyperlink to be added on the anchor text that is already in the page of the article. So we initially developed the model. That model was in 20 languages. The growth team did some experiments and they came to us and they said we need this in many more. So part of the effort over the past year has been on building for more languages. So we have trained model for the model for 297 languages. There's a fabricator task that you can go to and learn more about and these models are currently being deployed in 185 Wikipedia's. Although Martin who is just about to walk out has educated me about the fact that when we say deployed there are different layers of deployment and until you see it actually on your Wikipedia it can take some time. The research is published. So there's a research behind it. There's a team behind it and there's an output API. If you're a developer you can go there and play with it. For the purpose of this conversation I'm not going to spend time on the API and instead I am going to encourage us to try it and see how it works. Now again I think I need to somehow get to the etherpad so that I can copy paste. So if one of you can tell me what is the link to the etherpad? Email me or something that would be great. No, Tillman cannot. You're wondering about it. Can you make an easy one? Oh, actually, why don't I just make one? Attach to the session. But what do we do? So if you can go to etherpad then P slash Wikimania 2023 the W is uppercase underscore research R is double sorry, uppercase. So Wikimania 2023 underscore and then research. Okay, I don't see any wonder so I may have said something wrong. Etherpad.wikimedia.org slash P slash Wikimania 2023 W is uppercase underscore research. Okay, now three people are in four, perfect. Somebody remove the link. Put it back. I can put it back. That's okay. Perfect. Okay, so if you click on the link I put it back on etherpad. Are you there? Got it on it. Yeah, yeah, all good. Try, yeah. Okay, so you go here and then if you go to test that actually I should have just said this from the start. Sorry, I had forgotten that. I didn't notice that easy. So if you go to test.wikimedia.org test.wikimedia.org login you can go to your preferences. You have seen this before, right? Yeah. And then go to editing and there's an option you can enable which is about add a link a newcomer dashboard. Has anyone made it there? Doesn't show me the option. Okay, so this one didn't go as successfully leave it at this and say try it. If you go there activate the newcomer dashboard and getting recommendations hyperlink add a link recommendations then once you go to your newcomer dashboard what you will see is basically recommendations for articles you enter the article and then it will recommend hyperlinks to be added. It's on test.wikimedia.org in this particular link that you go to so the edits that you do there don't end up in Wikipedia but there are in some of the languages. It is already activated so you can test it also in your language. So, yes. I mean, I'm not a newcomer so I don't have the newcomer option because I have a user page. Most of the people who has been around has a user page so we don't see the newcomer but it would be great to actually have this. Wouldn't be I know that it's another department but wouldn't be great to have like this somehow to include your user page or something like that. I don't know how but Yeah, so this is the product decision of course, right as you say however I'll say sitting in research and seeing some of the challenges my colleagues have I think it would be nice to give them that feedback you want to have this kind of technology available to you as an editor. They may have options in your basket, right? I understand. For everyone. Yeah, I think generally I would actually one theme that I would say is that I would like all of us to be more open to experimentation. So, if we can do that if we find ways that we can do that in a way that like you know we don't break the system in bad ways I think it's just going to help all of us in the work that we do. Actually. Okay, I am going to be corrected. Okay, actually it's already working. So, on the Hungary and Wikipedia for example if you click on your user page then it will go to the newcomer page and you have to specifically once so there is a link to visit your user page. And I'm not a new user. Don't get me wrong. So it's technically possible. I don't know how it's settled. I mean what's the settings but it's working. Right, and I think the point is like what if your workflow does not include like going to a newcomer page, right? Receiving this kind of recommendation from other places would also be nice, right? Suppose you're editing with visual editor and I can tell you, oh do you want to receive some link recommendations, right? That could also be nice. But it's great to see that it's working already for you in Hungary. Perfect. Okay, so the learning that I wanted to share with you here is a technical learning but an important one because I want you to know that we deeply care about bringing these types of technologies to many languages and it results in a ton of work that needs to happen behind the scenes for doing that and doing that accurately. Particularly, it is very hard to split sentences in Wikipedia across languages. Sounds like a very obvious problem. It's a problem that is not solved by the natural language processing community because many people really don't care. It's not on their top priority to solve this problem for 300, 2000 whatever languages that exist in the world, right? So when researchers run into this problem, they usually opt for choosing NLP packages for the languages that already the package is working well. So we see it as part of our responsibility to fix this problem, right? So of course because it's our mandate, we have to be accessible to and equitable across many projects. But I want you to know that, you know, each of these technologies, when we bring it to 300 languages there's a ton of work that's happening behind the scenes. Particularly like simple stuff, how do we know that the sentence has ended is very different. So in Bengali, Dari is used, Armenian does a different thing. Just we have a way to we need to have a way to understand where is the end of a sentence to be able to do these kind of things automatically. And that actually takes us to the next one. I think if I have yes. So basically the challenge that we run into in the question, in previous research on link recommendation was that we didn't have a good way. Once they told us they want to go from 20 languages to 300 we were like, oh my god, we don't even know what is a sentence like in these languages, right? So a new area of research opened which is around making Wikipedia's textual data more easily usable by researchers and developers. So focus on the text, can we break it down in ways that actually these machines tools can process it in better ways. So I talked about the problem already in the previous place. Basically the issue is that NLP packages that exist don't solve this problem for the languages and if they attempt to do it they do a poor quality. So we can't use it for Wikipedia purposes. If we can solve this problem it will allow us to solve many other problems. So it's one of those kind of unlocking problems like add a link for example for the newcomer dashboard and also for the library. So we now do have a package that can split Wikipedia content across 300 plus languages. It was a tremendous amount of work. Congratulations to the team for making this happen but we have it now. There are a few exceptions such as Thai language where we have not figured out basically there is no signal for ending of a sentence and we don't know what to do yet but we are trying and yeah so we have a relatively uniform way for doing this. This means if you give the machine now a sentence from a language it can multiple sentences. It can first of all break it down to sentences and it can also break down the sentence into words, tokenize the words so that we know this thing is a word. And we can do that. Yeah I mean the finding here is that languages are hard right and we really demand a lot and you know our team really is an absolute privilege to be in this position to serve the movement in this way. However let us not forget that each time we talk about adding a language and bringing equity in terms of what we offer in different languages these are hard. These are hard tasks and each of them will require significant sometimes not each of them but many of them require significant amount of deliberation and thinking and you know when we build a package we can't just build it and put it out there. We need to evaluate it and we need to make sure actually it works before we put it in front of people. And the fact that you know let's keep in mind that NLP and AI community has forgotten the languages. The priority is on specific languages, larger languages with more speakers and we see as our responsibility and this is actually some of the things that we also try to raise awareness about within the research community and also provide data sets to encourage them to kind of do this more of this kind of work. Okay. Languages. Any question about the tokenization of sentences and words? Yes. I'm just probably if you can say how were the differences when you were doing this research among the languages because they are really as you mentioned completely different. They have different structures so probably can you tell in my case it would be Arabic to know probably how difficult was it there but you could generalize the answer. I can't answer your question because I didn't do the research so I don't know the details about Arabic. What I can do is that if you can send me an email I can just follow up and send you the there is a fabricator task where we track about the quality and how it's done like in different languages and I know it's a very common challenge so I can just share that with you but I don't know it to be able to answer. Okay. Another favorite topic supporting organizers with list building for campaigns. So this is of course a long-term request if you have been an organizer you have spent time creating lists for the campaign that you want to organize. This has been flagged for many years I've been in this movement for almost 10 years always as a problem by organizers so we want to help organizers with the list building problem. Now list building is hard it's manual, it's time consuming and we want to support organizers if we can solve this problem then of course it's time saved, right? Things are more efficient on the organizer end they can decide to sleep a little bit longer or do other types of work that their creativity can be better used at and the other thing that I want to say is that on the topic of campaigns because we're focusing on campaigns there are always a lot of discussions about campaigns and values of editor retention newcomers we in our team see the value of campaigns as a multifaceted type of activity on the movement so we generally know that it's something that is important for the movement many organized campaigns we have also seen that they can create a lot of energy and focus which is a sign that campaigns can be really excellent ways for bringing people to the movement or helping them become contributors or continue to contribute and the last thing is that they're an important social aspect of the movement so this is also important for us so when we decide where we work on this problem this is a very sociotechnical and social movement and we know that campaigns are social activators within our movement so we want to do list building there are a lot of opinions about how we can do list building so for example you can look at the readership of a Wikipedia article and you can see if readers are coming to a Wikipedia article what other pages are they reading and from that similarity of what they are looking for you can look for whether you want to combine those types of articles as things that are of the same kind you can look at the content of the article you can look at search Wikipedia search you can use that and find similar articles and we did some research to understand which one is better which one of these many approaches and the answer unfortunately that we found is that no one approach is better we will gain a lot by combining these different approaches so that's what we did we're combining multiple approaches to basically create a fusion of different methods for creating lists for campaigns the research documentation again is in this place you can go and check it out and now we get to something that maybe you can try so I'm going to try it this time putting on the famous etherpad that we have it's on list-building.toolforge.org okay so if you go here you can again enter the language code and then enter a page and then say how many results you want to get so here the task in this experimental setup is that you give an article as an organizer and then you're going to get similar articles to this article and that can help you for example figure out which articles are similar and you want to you want to include in your campaign we are working on a parallel piece of technology related to this which is helping organizers find editors that maybe existing editors who may be interested based on similarity between edits and to be included in campaigns but you can play with this so if one of you has a suggestion for an article that I can say on English Wikipedia maybe something not super exciting starting the query it's loading we wait one thing while we're waiting for it is that I'm going to share the finding which is this is actually going to reveal it is not very trivial what is similar to another thing in a Wikipedia article so if you look at the recommendations as they come up you find sometimes curious things in it so last night I searched for a rat and then I found some cat articles there and they're clearly not exactly about rat but it's just like keep your mind open as you're interacting with it because it can give you sometimes curious things and of course if you have feedback about it provided I can't get anything has anyone got something interacting with it I don't know for some reason this doesn't come up for me ok so if you have seen it some of you at least then oh ok now it's here Wonder Woman ok Kate I think you need to help me here Amazon's DC Comics Superman Queen Catwoman Trial of the Amazons so stay curious yes ok and again we're of course working with the campaign team at the Wikipedia foundation this eventually is going to become something that you know you can use in the proper places in proper ways that you interact with this is just like a tool that we try to usually you know as soon as we get to a piece of technology that we can put in front of others we try to set up a page here so that at least people can interact with it and give us feedback so the finding I already shared with you stay curious you sometimes find new things as you look at it so we have around 10 minutes left so let's see how far we can go I want to talk about newcomers and tenured editors in this case going back to the question of what if you're a tenured editor and this is about image recommendations so what is the problem the problem is that more than 50% of Wikipedia articles don't have any image in them and what we also know is that this is an ongoing line of research but parts of it we know that is images help readers understand to some extent navigate and also engage with the content of the articles so readers tend to use images to contextualize the text that is available to them again this is ongoing there's a lot of unknowns in this space like how do readers actually learn from images but it's we see some signals on this front in the past years we had developed two sets of algorithms on the images front one two sets of algorithms that we are now combining so we had one algorithm that given a Wikipedia article would tell you what image can be added to this Wikipedia article or what set of images and we had another algorithm that we developed for the content translation product and the job of that algorithm was to align sections across Wikipedia languages because if you have worked with content translation basically if you if you want to edit a section it needs to give you the corresponding section in an article so we needed to align articles across aligned sections across languages so we brought these two algorithms together for this case and we are doing image recommendation at the section level so it's a little bit more granular it helps the editor better particularly because newcomers are involved if you just give them a very long article and just say add something it can be tricky so we want to actually be able to recommend at the section level and that is what this piece of technology does this pilot is there's a pilot that's happening for testing this and this is happening in Spanish, Arabic, Bengali and Czech so if you're in any of these languages you may interact with it this is for newcomers but of course you can see it in your newcomer dashboard and there are six other languages that are experiencing this through the structured data across Wikipedia project I'm going to maybe skip this one this is basically yeah you can if you're in these projects you can go to your newcomer dashboard and test it so the way this algorithm works is relatively simplistic what we are doing is that we are not going to comments and searching for everything what we are doing is that we are looking at Wikipedia itself across the many different languages and we are looking for which language for a particular article has used what image and we are using those images that are already being used as a pool for recommending to an editor in a given language the advantage of this is that there is less risk most likely that that image has been checked by someone by an editor in another language the disadvantage is that we may propagate biases so it's very important to keep in mind as we are recommending things to editors that we make them aware that there is a machine this machine can make mistakes there can be biases so there is a lot of human judgment that needs to go in place but at the very least for some of the obvious things people can benefit from propagating knowledge in this way from one language to another I have almost 7 minutes left I want to definitely cover this one this is a very exciting one so the issue that we are seeing is that some articles are much less visible than others almost 15% of all Wikipedia articles are orphans these are articles that are not linked to from any other article a good chunk of these articles are disambiguation pages for example which is okay they can be orphans and that's fine but a lot of them are not of disambiguation or the other types so there is actually a problem here that we are creating articles that we are not connecting to the rest of the network of Wikipedia now the problem is particularly tricky because what we are seeing is that biographies of women are disproportionately less connected and they are making up a large chunk of the orphan articles so in English Wikipedia 28% of the orphan articles are biographies of women in Spanish Wikipedia this is 35% and the last one is Catalan Wikipedia which is 42% so there is work to be done here and I want to encourage all of you who are in the space of we should support you and we have seen this we are going to work on technologies to help support this and address this but also if your focus is creating content particularly content around biographies of women please keep in mind and encourage people that if you are creating an article make sure that it is not orphan and it is connected and why this is important because of course everyone can go outside of Wikipedia and search for something and come back but a lot of readers also do serendipitous search on Wikipedia and they try to get from one topic to another and if the article is not connected there is no good way in theory you can search but again serendipitous because you don't know the person that you are trying to reach so I want to encourage all of you to talk with all of your friends who are in the space and just let's make sure the biographies of women are not orphan and can be connected and of course everything else that is orphan and can be connected this this study is done on 319 Wikipedia languages and we see this problem kind of persistent across the languages so it's not like one or two languages so we have measured basically the extent of this kind of gap and we have also done a quasi experiment in which we have shown that if you add the link the page views for that article will increase significantly so there is value in doing that because actually people can find it intuitively of course we know it but also the data supports this and I think the last thing I want to mention I said I think pretty much everything that is in this slide but I want to just also highlight the issue of structural biases that may exist this was not something that we expected we kind of ran into this problem as we were doing some other studies and then we were like oh wait a minute what's happening here why are there so many orphan articles and it's something, it's a reminder that we need to be aware of the structural biases that may exist and they may affect particular groups, particular topics in specific ways and I think yeah there's a tool we can go try it out I have put an example for we have put an example I'm going to put it on the etherpad and you can try it with this you don't have to of course try it with this particular example it is just that if the article that you enter is not orphan then you're not going to get anything so this particular one somebody may eventually de-orphanize it but until then okay I'll let you play maybe with that if you choose to outside of this session given that we have a couple of minutes left I would like to go to the last slide to say that you know Wikimania is an amazing opportunity for me and my team to connect with you of course like through me in this particular occasion because the rest of the team was not able to attend but this is not the only opportunity this is you know just to reinforce the connections and help you be aware of the work that is happening within the Wikimedia foundation to support you if throughout the year you want to be in touch we have monthly office hours the link for it is there you can book a one on one session with a member of the research team and talk with us about anything if you go to that page it is clear who is active in which part of the programs and you can just choose to talk to us you can book it we are here to do that we are here to serve you and if it helps to have a conversation we are here for you we have monthly research showcases if you want to dig deeper on specifics of research and the research world of course you are welcome to you can always go to research.wikimedia.org to learn about the latest that is happening on the research end and of course there is a public mailing list and now that Tillman is here he is going to remind me of the twitter wiki research handle wikimedia.org and the research newsletter thank you so much everyone I am happy I have around one minute to take your questions thanks for hanging in here and playing with us everything was very clear you have 40 seconds back thank you and see you in the party