 Okay, we're live. Hi everyone, thank you all for joining. My name is Kinaret, and I'm the senior research community officer, newly joined on the WMF research team. And I would like to welcome you to this month's research showcase. The research showcases are monthly convenings organized by members of our team to recognize and share recent research on our relevant to the Wikimedia projects. For those of us, for those of you joining us live, we welcome you to ask questions in the YouTube chat or on IRC. We will monitor these channels and pass questions to the speakers at the end of their presentations. We kindly ask that attendees follow the friendly space policy and the universal code of contact. Before we begin, we have a couple of community announcements to share. The first one is that registration for Wikimedia is open. If you will be joined, if you've been to Singapore, join us there. Otherwise, registration for the online is still open. There will be a research science and medicine track. Please join to learn more. The next announcement, if you want to learn a little bit more about the work that our team has done over the past six months, the research report number eight is out now. And Isaac will share the link. Thank you, Isaac. Next announcement is that the WMF research team is hiring a research manager. This role is fit for a researcher with research science or data science manager management experience. Check out the job description in the link and our final announcement for our final announcement. I'll turn it over to Hal Trayman from our colleague. Hi, everyone. I'm going to share my screen really quickly. Can you all see this? Great. So I am a senior privacy engineer with the WMF privacy engineering team and I'm proud to announce the release of eight years of differentially private page view data partitioned by country project and page. This is a longstanding request from many sides of the research community. It's a big data set. It's 263 million rows and it's growing every single day. And it encompasses 342 billion page views, including 255 million rows and 295 billion page views that were previously deemed too sensitive to release. There's a table here to compare the old data set, which is available by the page VAPI and the new data. As you can see, we're significantly more granular. There's a lower minimum release threshold. There's no limit to the maximum numbers of rows that are released in a given day. The data is partitioned by both country and project. And then finally, this is a very exciting thing, in my opinion, we were able to release about six and a half years of data, historical data that was previously deemed too risky to release. If, I think it's gonna be a great resource for research on page views in the future. If you want access to the data, you can send these links you can take a look at these links and I think Isaac will also send them in the chat as well. So yeah, excited to see what the research community does with this. Thank you so much. Thank you, Hal. And I will now turn it over to Isaac who will introduce this theme and speakers. Thanks, Kennerette. So I'm Isaac Johnson. I'm a senior research scientist at the Wikimedia Foundation. I'm glad to introduce the July 2023 research showcase. Today we'll be hearing from several researchers on the theme of improving knowledge integrity on the Wikimedia projects. I'm excited about the showcase. It's been a while since we've talked on the showcase about this very important facet of the Wikimedia ecosystem essentially how editors maintain content that represents all significant points of view as reliable as verifiable. And this showcase is also about improving knowledge integrity. So our presenters have also literally built on our understanding of knowledge integrity to design tooling to help support it. Our first presenter will be Itolkan Baigutanova, master degree student in the School of Computing at Keist in South Korea. She's been working with the team since 2021 and expects to graduate in February 2024. Very excited to have her. She'll be presenting work on the reliability of Wikipedia through the lens of references. And in particular, she will introduce her longitudinal analysis of how sentences needing citations and low quality citations have evolved over the years. And I think this is work that is particularly timely as the Wikimedia community grapples with what it means to be in this time of large language models that can generate very persuasive sounding Wikimedia content that might not actually hold up when it's actually verified. With that, I'll pass it to you Itolkan to share your screen and present. I'm sharing my screen. Can you please confirm that you can see it? Yes, thank you. Great. Thank you very much for the kind introduction. And as Isaac mentioned, I have been working with Wikimedia's research team for maybe two years now. And for this time, I've been focusing on this knowledge integrity theme. And I am gonna present it today to you. As a master's student, it's a great honor for me to work with the foundation and present my research here today. I will introduce our paper that has been published this year in the dub, dub, dub proceeding. And the title of my talk would be Assessment of Reference Quality on Wikimedia. So let me start by stating the core content policies of Wikipedia, which you may already know. So the first one is the verifiability policy, which ensures that readers of Wikipedia can check that information comes from a reliable source. The other one is neutral point of view. Here it basically means that the information presented is unbiased and all the significant views that have been published by reliable sources on the topic are considered. And the final one is no original research, which means that Wikipedia doesn't contain original research and any material for which reliable and published sources like this should be used. And all of these three content policies basically highlight the importance of reliable sources. However, unfortunately, and as we know, not all the articles would follow these core content policies perfectly. So there can be some articles that have, that site where the information is coming from, like for example, here, but in some cases, the information may not be supported by any reference. And commonly this kind of parts are marked by a citation need of site, but even in those parts that are cited, the reference can be either reliable or unreliable. So for example, Oxford dictionary could be considered as a reliable source, but when some information is cited from social media, that may not be as reliable. And definitely it affects the readers of Encyclopedia who directly get to see this information either on Wikipedia or on the search engines. But from a more global perspective, it does affect far beyond Wikipedia. And one of the reasons is that Wikipedia, along with all this information can be used and is commonly used for various NLP tasks for creation of knowledge graphs and training of today large language models. And recognizing this importance of referencing on Wikipedia. In our research, the question that we wanted to answer was mainly how can we measure the verifiability or the reliability of Wikipedia articles through their references? And to do so, we operationalize the notion of reference quality first. And in our work, we define two metrics to measure reference quality. So the first one is reference need index, which represents the percentage of citation missing sentences. And the formula is pretty straight forward. I will not explain it in too much detail, but we do release a tool that can be used to compute this score. And the details can be found in our paper. So for the second index, which is reference risk, it is the proportion of non-authoritative sources among all the references in a given revision. And for the second index, we refer to the perennial sources list, which is a Wikipedia community maintained list of sources whose reliability and use are discussed. So basically, Wikipedia community discusses the reliability of sources and collectively decides reliability label. This perennial sources list on Wikipedia has five categories, which are listed on this slide. And the categories of interest in our research were the lowest two category, namely blacklisted and deprecated as we refer to them as non-authoritative sources. Or in other words, we refer to blacklisted and deprecated sources as risky for use on Wikipedia articles. And let me give you an example of how the two scores are calculated so that to make sure that you understand what the two mean and how they can be computed. So imagine we have an article that has three sentences as here, A, B and C. And then to calculate our first score, which is reference need, we look at the three sentences and suppose all of them need a citation. However, out of three sentences that need to be cited only that needs to be supported by a citation, only two are supported. It means that one out of three sentences need reference. Now for the second index, which is reference risk, we only consider the references that exist in that article. So in our case, we have only sentences B and C that have citations. Then we combine this with a list of reliability indices. In our case, it's a perennial source assist. And then we find that one of the two citations that were used are risky, which result in one of the two of reference risk. So I hope now you understand the two metrics better. And with this, we created two datasets. The first is 20k randomly sampled pages and the second was 10k most viewed pages. And we track revision history for this 30,000 pages over 10 year period. So from 2010 to 2020. And here we are given the size of the revisions and for each revision, we also provide the reference qualities course. For those who are interested, the datasets are also released. So you can familiarize yourself as well. So with this set up, the first direction, research direction that we wanted to pursue was to assess the reliability status of Wikipedia through the lens of its references. Or in other words, reference quality of Wikipedia. And to do that, we looked at the evolution of reference quality over the 10 year long period. So here, the first score I will show for reference need. And we can see that reference need has steadily decreased throughout the analyzed period. And in year 2020, a mean reference need score reached nearly 30, 38% for the two dataset that we considered. What's also interesting is to observe the evolution for topics. Here we considered this four meta categories, history, STEM, geography and culture. And we can see that the culture-related articles, which are dashed dotted line, are shown in dashed dotted line, tended to have better citation coverage. And better citation coverage means a lower reference need scores. And for our second metric, which is reference risk, we observed that throughout this analyzed period, it remained below 1% for both datasets. And what's also was interesting is that it decreased here for both datasets after 2018 when the perennial sources list that I introduced earlier was created, which is the reliability index list. Regarding the topical distribution, we can see that STEM-related articles, which are shown in dashed line, tended to have fewer risky references in them. So we showed the distribution for the higher level categories, but we also wanted to show the distribution of reference quality for a more fine-grained topics. And consistent with what we saw in the evolution plots, we see that STEM domains like biology, earth libraries, they tended to have lower reference risk. While they also had tended to have higher reference need, like for example, mathematics and physics and also linguistics, they are in the more like right side of this plot. On the other hand, for culture-related domains like media and biography, they tended to have higher reference risk. And what higher reference risk means is that they tended to have, that it tended to cite unreliable references more often than other topics. Now, once we assess the status of reference quality, we also wanted to suggest the strategies or assess the factors that affect a reference quality of articles on Wikipedia. And as Wikipedia editors are the major part of the content and as they are the ones who create the content and maintain the content, the first thing that we did, we looked at reference quality and editors. And what we did, we conducted quasi-experimental analysis using matching techniques, where we looked at experienced and novice editors and how they changed reference quality scores. So we have these two scores and DRN and DRR basically means the data or the change in reference need and reference risk scores. So we can see that for the experienced users, on every issue they use, they tended to reduce the reference need and reference risk scores, meaning that they improved reference quality. And for novice users, they tended to increase or not, or almost not change the reference quality scores. So they had the positive change of reference need and reference risk. Another experiment that we conducted was to look at the interaction of editors between each other and especially novice editors with expert editors. And in our experiment, the scenario was co-editing articles with expert users. So we can see then there are two groups of users, those who co-edited articles with experts and those who did not co-edit articles. And then we observed that novice users who co-edited articles with experts were more likely to avoid increasing DRN and DRR scores, meaning that the change, the increase in need and risk that they caused was lower than for those who didn't co-edit with experts. We also looked at the importance of community initiatives like perennial sources list and how it correlates with reference quality. And here we look at the lifespan of risky sources one year before and after they get classified as bad or unreliable in the perennial sources list. And in our experiment, we define lifespan as the time elapsed between the addition and removal of reference, not to complicate the speaker too much. I will just explain the median lifespan, given the example with the median lifespan. And we can see that the median lifespan of risky references for before they get classified as bad was about 150 days while for risky references after they get classified as bad the lifespan decreased to around 45 days, which is more than three times lower lifespan, which may indicate that once users, once editors are aware of sources being unreliable, being unreliable, they tend to avoid or remove such references. Now, I also want to just briefly mention the further extensions of this work because our work was limited to English Wikipedia, but we recognize the importance of multiple language editions of Wikipedia. So we also wanted to extend and understand the reference reliability status across multiple language editions of Wikipedia. So we use the same classification for sources into unreliable and reliable as the one that I introduced earlier. And we looked at the distribution of reference reliability across here I show for 40 editions with the largest number of articles. And here this gray area shows a 90% confidence interval. What I think is more interesting to observe here are the outliers at the bottom and above the confidence interval. And for the outlier below the confidence interval, we can see English Wikipedia, which means that it has a larger proportion of articles citing reliable sources relatively to other editions. And this result is consistent with what I introduced earlier because editors being familiar with this classification into reliable and unreliable, they try to remove or avoid using such sources. On the other side of this distribution are here, I list three, Russian, Armenian, Chinese, which are here, which tend to have a bit larger proportion of unreliable sources compared to other editions. And we were curious in why this happens and we qualitatively assessed and checked what kind of sources are causing this kind of trend. And when we check what kind of sources were causing this kind of pattern, we could observe that for Russian and Armenian, it was caused by a Russian newspaper website and for Chinese Wikipedia, it was caused by a Chinese tabloid. And these are culture and country-specific information providers for those editions where for Russian and Chinese Wikipedia's, the Russian newspaper that I mentioned and the Chinese tabloid, they were not yet classified as bad. To sum up, our work shows that the reference quality of Wikipedia is improving, which is a positive sign. And we also want to emphasize the importance of expertise sharing. Maybe the platform should consider matching editors of different experience so that they can co-edit articles together and collaboratively create the content. Also, we want to highlight the importance of community efforts like creation of perennial sources list that can greatly affect and help with maintaining the quality of the content. Also, we hope that our work inspires more active discussions within non-English communities and the potential for creation of global reliability index, especially aiding this smaller edition of smaller communities. And these are the links to our data set and our paper. Thank you for listening for the presentation and here is my contact information in case you have any questions. I am open to questions now. Yeah, thank you, I talk and it was an excellent presentation. I'm going to start with a question from YouTube, but then I'll open up the other folks in the room and any other questions we have coming in from YouTube or IRC. The first question is how you might evaluate sources that aren't on the perennial sources list for reliability. Yeah, this is one of the concerns that we also had. So we tried using external reliability indices in our work as well. I didn't present it in today's presentation but in our full paper, you can have a look. So we assessed some of the popular external reliability indices, but what we found is that external reliability indices might not be as appropriate, not as appropriate, but they may not cover as well as locally maintained lists because the sources used in Wikipedia, more specific for Wikipedia were weather covered by the perennial sources list. That's why it was our choice for this work. Yeah, that makes a lot of sense. I also want to call on Pablo to put you on the spot here because I know you've been thinking as well about this, how do we evaluate sources that maybe we don't have these community annotations for explicitly? Yeah, well, in some cases, after bringing that, this is definitely an interesting topic. Thank you, I'm talking first in this work. This is really relevant because it's so the impact of these community efforts and also like limitations about how limited is this list and then the need to extend that resource. So there are some words that is based on this moment to try to quantify the controversy of sources inferred by data to try not to create another list, but maybe to have like an alternative list to assist editors in deciding whether a source is reliable or not or consensus. And from that, I have a question probably talking like when you did the analysis of the impact you focus on novice and experienced editors. And I'm wondering if you also examine like both edits, like whether like both might be in using directly or indirectly the parental source list or might be in effective the removal of risky references. Yeah, for our analysis, we excluded both edits. So we didn't consider both users, just the human editors. Because, and also we excluded anonymous users as well because we cannot judge their experience. And here, the experience that we refer to is number of previous edits. So for anonymous users, we cannot know how many edits they had made before. And yes, and we exclude bots as well. I'm gonna, Papa, I'm actually gonna forward on a question from YouTube that was questioning or asking whether you know bots that do add sources just as like an additional piece of context for that question. Sir, I didn't, I didn't catch that. Could you please repeat me, sir? Sorry, that was for Pablo. There was a question about whether there are bots that do add sources, whether you know more details about that. So I thought was interesting context for the question. The add sources? Yes. Yes, actually there are a lot of bots. And what we found, we actually also looked at bot edits as well. So we found that bots, there were a lot of bots that were adding references. So I remember, so there were bots, something like the bots that were cleaning up references. So there were a lot of edits from them, from those bots that were editing the link in some way also many edits were coming from bots who like set the security protocol of the link, something like changing HTTP to HTTPS, something like that. Yeah, and Diego like mentioned in the chat, internet archive bots as well. I think there were many edits from the bots. And thank you for the question. Thanks for those additional details. I'm gonna pass it now to Tanya who had a question about the editor experience analysis you did. Hey there, thank you so much for this fascinating presentation. Just kind of on the topic exactly of this slide, I was wondering if you found differences between this experiment on the topics. So like culture and et cetera, whether you're looking into that. You mean topics and editors like for this experiment? Yeah, whether there were differences specific to topic based on this. Like, so this experiment, the way we conducted this experiment is we tried to match revisions, like tried to match similar revisions, but from the only thing that would differ is that one revision is from expert user and the other one is from novice users. And also one of the variables that we use for matching was topics. So we matched editors editing basically the same topic article, something like that. But I haven't looked like specifically what topics caused more difference in the score, but I think it is an interesting thing to look at. Thank you for the question. A token, you had a really nice slide about bringing this to more languages. I know that's something that's gonna come up in the next presentations as well. And I think with references in particular, just preface by saying it's a lot of work to scale this research to more languages because despite like being very central to Wikipedia, they're often unstructured and it makes scaling and being able to identify templates and all the things very difficult. So I was wondering if you could just talk a little bit more about the work you did there and what challenges maybe you're facing in doing some of this language scaling work. Yeah, I'm glad that you brought up this because I should have also mentioned that because what we, so we started with this question like, oh, we need to consider more editions, not just English. And then we wanted to look at the, again, we needed to have some reliability labeling for this. But then when we started to look at the reliability indices, most of them, the external ones that we looked at, they were very limited in the geographical coverage. So they would be specific to certain countries, like more like Western sources. And we even look into the perennial sources list. What we discovered is that perennial sources list, the one that we used was from the English Wikipedia, but it exists in other editions as well. However, other editions were not actively maintained or they had the very small lists and not like with explicit labels. So there were a lot of kind of challenges there that we couldn't find this kind of reliability labeling in other Wikipedia editions unfortunately as accessible as in English Wikipedia. So I think this is one of the things that I really hope that our work or these presentations can inspire Wikipedia editors and non-English communities, especially to raise this kind of discussions on reliability, yes. No, thank you. And I think, yeah, the tooling can also help by demonstrating that there is a pathway for if you maintain this content, there's ways to make it readily accessible to editors and things like that. With that, I think we can pass to our next presentation. We're gonna have Diego Sides-Trumper and Pablo Aragon or a research scientist and my colleagues at the Wikipedia Foundation and they'll be presenting their research on multilingual approaches to support knowledge integrity on Wikipedia. So in particular, they're gonna present a novel machine learning system aimed at assisting the Wikimedia communities and addressing vandalism, that'll be Diego's work and then a dashboard that relies on that system to monitor high-risk content in hundreds of Wikipedia languages and that will be Pablo. With that, Diego, the floor is yours. Thank you very much, Isaac. Let me share my screen. In the meantime, let me congratulate Aitalki for the amazing presentation. She started working with us when she was an undergrad student and now she's finishing her master and it's amazing to see all the research that you have done in this time. So thank you very much for talking. Let me start now with this presentation. So thank you very much, Isaac, for the introduction and thank you to all of you for attending this showcase. As Isaac mentioned, today with my colleague, Pablo Aragon, we will be talking about our work on multilingual approaches to support knowledge integrity in Wikipedia. So in the last years, the Wikimedia Foundation Research Team has been working on different aspects of knowledge integrity and from different approaches. So we have wrote several papers explaining the role of Wikipedia in the information ecosystem, articles, and we have also released a lot of data sets that can be used within and outside of Wikipedia to train machine learning models related to content integrity and content quality. We have also produced several insights, some of them shared as scientific papers like the one that Aitalki just presented. And our more recent work is on creating a new generation of machine learning models to support Wikipedia and Wikidata controllers. I would cover a portion of this in a minute. But we are not alone on this path. Let me highlight some of the collaborations that we are having on this. So we are having external collaborations with external researchers like people from KAIST. And I don't want to start the big list because I will miss some people, but a lot of external collaboration to develop the scientific work, but also internal collaboration within the Wikimedia Foundation. I want to mention and highlight these two teams that we are closely collaborating with. One is the Trust and Safety Antidis Information Team. We collaborate with them. Basically, we give them some quantitative insights and the outputs of our machine learning models that they can use for doing their more quality of analysis that we receive back and that help us to improve also our work and in general, the knowledge integrity in Wikipedia. And also with the machine learning platform team that we have been working on creating a scalable and high-performing APIs that can support especially controllers on detection revisions that requires human attention. And this is what I'm briefly going to introduce today. So River Risk. River Risk, it's a family of models that we're designing to automatically identify revisions that might be reverted. So meaning that these revisions that have some issues, some content issues or problems and that might should be reverted. Why we are doing this? We are doing this as you might know and I think I don't need to highlight that much here, the motivation, but Wikipedia content, the content integrity in Wikipedia is important not only for the readers of the online encyclopedia, but also because Wikipedia is powering other sites and a lot of the information that you see online comes from Wikipedia. However, patrolling or keeping this content with high quality is difficult, especially for the amount of revisions that we receive per day. So currently just considering Wikipedia projects, we receive around 16 revisions per second that is almost 1,000 revisions per minute. So we need models that helps to identify bad edits and help to revert them. And now maybe you're thinking, yes, but we already have models for that and we have systems even in the foundation like AORUS, that by the way, AORUS is a project that we have learned a lot from and all these learnings we are trying to apply in this work. However, there are several limitations with these current models. Probably the main one is that they are language dependent. So that means that you need one model per language every time that you need to train it or you want to add a new language, you need to train a new model. But especially this is difficult because now we just covered a portion of the Wikipedia language edition. So language coverage is one of the current problems. Another problem that we have with the current approaches is that maybe rely on manually annotated data for training and this is difficult. So it's difficult to keep this updated and also to grow to other languages. And probably equally important is that these models are already a bit old and nowadays we have a lot of advancing technology, especially for processing text. And maybe most of you know these large language models based on transformers that are so popular today and for sure they can help us on the task of detecting bad quality edits. However, these super cool models that now everyone is talking about, it have some problems. One of them is that they require a lot of computational resources. So you need GPUs for training, you need a lot of resources also for inference. So they are expensive in terms of resources. But at the same time they have some problems in terms of the language that they cover. There's a lot of studies that shows that these models works good in the most popular language on the internet but the performance decrease on under resourced languages. And there's several languages that are just not covered. The other problem that we have is that, okay, with these large language models we can understand the content and we can do complex things in these more popular languages but we cannot apply them in many of the other languages. The other approach that we can take is go with metadata and with some rules like what the old models are doing but the problem with metadata is that this introduced a lot of bias especially related with the editor's characteristic. And we have found that it's especially difficult for these models to perform good with anonymous or IP edits. So in a nutshell, these new models are cool but they are expensive, they are a bit slow but the problem is that if we don't use it we cannot cover it for complex edits. So what is our solution for this? Basically what we did was to develop two models. One that is pure language agnostic based mainly on metadata and another model that is based on multilingual birds a popular large language models that can understand text more deeply. The language agnostic model as I said is fast covers all languages. So now with this model that is already in production we are supporting all Wikipedia projects. If a new language is added we don't need to retrain the model the model would be covered and because it's not a language event. However, the problem of this is that as I mentioned, this misses some context and it's not super good on evaluating more complex content problems. On the other hand, as I said the multilingual model that is based on bird it's using these advanced techniques. We have put special emphasis that the model is fair on IP edits on anonymous edits. So our model performs, I will show some results perform very good for anonymous edits but the limitation is the ones that I mentioned nowadays we are covering just 47 languages that is still much larger than the existing models. We are improving in 60% the amount of language covered but this is still heavy on computational resources. So our current recommendation if you want to use these models if you're a developer or if you are a researcher and you want to use this model our current recommendation is that you use the multilingual model based on bird just for anonymous edits in the 47 language that it covers and then you use the language agnostic model for all other revisions. In both cases they are performing better than previous models and so you can use it them confidently. Let me know very briefly like in two slides explain how the multilingual model works if you want to learn more I will send you then some pointers but conceptually the model of this multilingual model it's pretty simple. The technology is complex but conceptually the model architecture is pretty simple. First what we do is that we use the Wikimedia-Wiki edit types library to identify content that has been added, moved or removed. So basically we take the diff but we not only take the diff but we know for example when some content was moved when some content was added and when some content was removed. This by the way this library the media wiki edit types is a super cool library that was also developed within our research team our project lead by our colleague Isaac Johnson that is presenting today and with Jesse that was an outreach and then contractor intern and when using this library to extract the content and then pass it and we pass in different buckets the content that was added the content that was moved and the content that was removed to this intelligent multilingual bird model and this model outputs a prediction and what we do is then we take that prediction and we mix that prediction with the metadata that we have about the pages and about the users and with this we output a prediction. Here you have some numbers and on the results you have the comparisons across different setups of the multilingual model what you see on green and you can see that I performed the baselines in all the cases what you see on your left is how the model performs for all the revisions all the revisions means anonymous edits plus a registered user edit and what you see in your right is just a result for the anonymous edits and you can see that the difference in this case is much bigger and this is very important because there has been a lot of discussions about IP edits and most of the Wikimedia communities has decided that they want to keep IP edits so having a fair system with them it's very important because what you see here as a rule base it would be a model that just rejects all IP edits and that would be not very good and we are doing much better than that if you want to you don't want to just believe these numbers if you want to test and play with the models you can go the model is already published in the lift wing documentation AP page lift wing is the system that was developed by the machine learning platform team to serve machine learning models so if you go to that URL that will be shared in the chat too you can go play with the models it's pretty simple to work with them here you have an example of how they work you can work from your command line or from a Python notebook test it here if you remember the fair example that I show that was in the slide about Mariah Carey and why this revision needs to be reverted you can test how the language is an optimistic model how the multilingual model works with it and you will get a prediction it's important to highlight the prediction the true means that the revision should be reverted and here you also have the probability so this is a highly probable bad edit also if you want to go deeper here you can learn more about this so now all the models that are hosted in lift wing all the material learning models that the Wikimedia Foundation will be serving needs to have a model card so you can have all the details about the model and you not only can read about it you can use and following our open source policy you can get the model from the repository you can check it, you can clone it, you can improve it so following all the open source path and also if you want to understand more the scientific details of how we built in this case the multilingual model you can read our paper that was just accepted and will be published on the KDD conference now in August and I will put the acknowledgments at the end so with this I will pass the microphone to my colleague Pablo Ragon that will explain what other things we are doing apart from supporting patrollers with this model so Pablo the mic is yours thank you Dave okay so we present now the Knowledge Interity Research Laboratory next slide please well so well for some time Wikimedia was considered a non-reliable source a non-reliable source for instance in education like this perception is changing in recent years like analysis like the one on the left about the last U.S. election concluded that in comparison to social media Wikimedia was refused from big tech disinformation like core content policies like prefability and human centered approaches to content moderation as represented they were found key to prevent misinformation to a great extent and also as Diego mentioned this not only affects Wikimedia readers like this piece on this story on the right published yesterday on the New York Times explained like in the era of AI and large language model like Wikimedia is probably the most important single source for turning those models so the integrity of knowledge in Wikimedia is key for the reliability of the online information ecosystem next slide please however when referring to Wikimedia most people think about English Wikipedia and there are more than 300 language editions this might be known by everyone here in the room and also by most viewers of this showcase but I should emphasize that reliability across Wikis might be very different so the story on the left was published some years ago when I joined the foundation around the time in Slate and it covered different situations of the Japanese Wikipedia that might have favored the emergence of structural and wider misinformation and disinformation in particular related to historical revisionist and also similar patterns of historical revisionist and disinformation were found in an evaluation of the creation of Wikipedia conducted by an external expert on the subject matter that was published by the Foundation for Transparency that was a report that also suggested different factors that might have influenced the emergence of these patterns next slide please so we created a taxonomy of knowledge integrity risk from our iterative review that I presented in the December showcase and then we built a system with a dashboard with sewing metrics from these categories that were computed for each language edition in each month and to accelerate the deployment and maintenance of the dashboard we relied on the superset infrastructure of the foundation that requires a Wikimedia developer account with the staff or NDA access and also because some metrics they rely on non-public data like geographical edits and view so in that regard like there is also the work served by HAL in the start of this showcase on differential privacy might bring new opportunities for that next slide so to showcase this first version of the knowledge integrity research laboratory I will examine some claims about the Japanese Wikipedia extracted from that story published in the slate so it was said that the English Wikipedia is viewed by people across the globe but other language versions such as the Japanese are primarily viewed by and edited by people of one particular nation this is true and actually the Japanese Wikipedia is the most in equal among the largest wikis by computing the genicoefficient over the distribution of views by country and edited by country it is highly concentrated by most of them in just one country and drastically for views next slide please it is also said that the Japanese Wikipedia has the lowest number of admins per active users across all language editions and the metrics comparing these numbers like active editors and admins considering sysops and brokerage validates the claim next slide please but the claim is also followed by another claim that says that a few dozen people within the Japanese Wikipedia have power over what goes on in the platform so examining power over what goes on the platform cannot be directly measured with language diagnostic quantitative approaches that is the one on the observatory but that we can observe among the largest wikis like the Japanese one has the pull of of admins remarkably older older in terms of the time since they've seen up in the wiki and that is that editors with less than 10 years in the wiki hardly become admins like they are mostly odd liars in those cases so there is a lack of concentration of editors from first generation that remain with admin levels next slide so well this was the first version of the list of sabatoids that allow us to provide data to trust and safety about conditions but my favorite emergence of knowledge and duty issues but several limitations exist for sure and there are two that were very relevant the first one is like this design is not based on reads that are informed from retrieved from a literary review but they are not part of a formal or systematic approach to know when a project is under a knowledge integrity list so there is a need for acronyms and second there are many metrics and that brings complexity when examining if a language edition is in a risky status so in alignment with the occult ratio like good models are parsimonious like they they should have come please there's a level of explanation with the less number of prior was as possible next slide yes so to address these two limitations there is a second approach the resource observatory that was implemented first as we need a ground truth we approximate the ground truth it's not a ground truth but it's an approximation of the ground truth of a knowledge integrity risk by using the reverberates model to quantify the presence and the moderation of high risk revisions and this is basically the intuition that we expect that common forms of knowledge integrity threats risk for instance vandalism can be captured through revisions with high predicted likelihood of becoming reverted using the machine learning model I might know that when I mentioned the machine learning model and for the resource observatory we're using the language agnostic one since we need to provide support for as many languages as possible and as you mentioned like multilingual one has several advantages but it's limited to the languages existing multilingual and then for the second limitation to simplify or pull up metrics we set three single metrics like the ratio of reverted revisions the ratio of high risk revisions and the ratio of high risk revision that gets reverted so this create a question for us that needs to be solved and what is a high risk revision like given a revision and score when we decide that that revision is high risk or not high or not present in a risk next slide please well so we compile a dataset of revisions in all wikis in the first three quarters of 2022 actually it's a dataset very similar to the one that was being used for training and testing the model and for 100 results that is from 0 0.01 0.02 until 1 we examine the probability distributions of revisions getting reverted and not reverted and as you can see here the results so that the model works quite well in identifying where a revision will be reverted or not in large wikis next slide please but obviously these classes reverted are not balanced they are fortunately there are many more revisions that are not reverted that are bringing value that are part of knowledge creation so also there is a non unmeasured bias towards IP edits and the model was not trained with bot edits since this bot edits are generally expected to be done in good faith so therefore we conclude that a high risk revision must be a revision done by a register editor that is not on bot and with a revered risk score it means that the output of the model upper to the threshold in which the model accuracy is maximum for that wiki that means that this is an example of the first Wikipedia and that a high risk revision is a revision from a regular editor register and on bot with a score from the machine learning model upper to 0.96 which is the threshold in which we are able to is the model accuracy is maximum like we capture like larger percentage of true positives and true negatives next slide please so yeah with these thresholds an extended data set of all revisions in all wikis in 2022 but also in 2021 we have generated a new prototype with a second version of the Norris Integrity Resource Seventory again to accelerate the deployment process we are relying on superset next step and I will show you an example with the Spanish Wikipedia because I'm taking this one because my mother language so I can provide you with a better understanding like well we have said that there is a decade trend on the rate of revisions getting reverted over time but interestingly we only focus not on the the revision rate the revert rate but the revert rate of high risk revisions we found a remarkable drop in the months of July 2021 and July 2022 that means like in those months there were like lower rates to moderate those revisions and we're interested in gathering more data to check if this is a seasonal effect maybe for instance the Spanish Wikipedia there are many editors that are visiting Spain and this is typically time for summer holidays but to give a better understanding of one of these pigs we have examined pages that presented many high risk revisions in that months in July 2022 next slide please and as you can see many of them are related to football teams and players we inspected a sample of high risk revisions in these pages and many of them there were single acts of pandalines like adding offensive nicknames to football players however not all pages are about the sports there are in these top 10 ranking two pages about politics one about the biography of Raul Castro the former president of Cuba and the page of paramilitaries in Colombia we inspected high risk revisions with the observatory this data is also presented in the dashboard and this is one example this is a revision of this information on the biography of Raul Castro stating that he was married with a commander of the Cuban revolution that is not true and this could be an anecdotal case of pandalines but if anyone examined the contributions of the editor like similar edits with this information were made in pages about the recent history of Cuba in Spanish Wikipedia but also in Wikidata and Coma so now we have one dashboard that allows to navigate very easily and very quickly all this data next slide so what's next next slide please so one future line would be to to extend this approach of using implicit annotation like repairs in other cases that are relevant for for knowledge integrity like new page creation creation article protection and also there is a model that Diego and the team is working on on Wikidata is in evaluation phase and we're interested in creating new models but also we're also committed to analyze the already existing models to continuously improve them so we're interested in examining whether like the way that these models are trained using like general dumps with data from all of Wikis are all we're representing like patterns of large than with Wikis and we should balance to get more patterns and specificities of some of every individual project also Diego mentioned like there is a bias against certain edits like IP edits and we are interested in also examining other sensitive groups of community members like newcomers and that relates to what I talk in percent so trying to mitigate like these models such on like good faith editors that they are just missing some knowledge to to to contribute to Wikipedia and last but not least like these models are using implicit annotation like repairs but we can also incorporate explicit annotations like feedback from editors to retrain the model or also the model would be retrained by monitoring data drifts that examine whether like a model needs to be retrain with more recent data that capture like new dynamics next slide please so yeah like these machine learning models that Diego has been leading this world were built with a team with Muniza Aslam we call out Okimobit and Ico too with a great support from the machine learning platform and the operation with the resources about to have benefited from the very valuable feedback from the trust and safety this information team so thank you all thank you all for all this contribution like we are always committed to external collaboration but these internal collaborations are as as important the start of once next slide and thank you everyone here today for your attention and now we are we are pleased to start the Q&A thank you very much Diego and Pablo as we kick off the questioning I'm going to pass it on to Martin in the room and remind folks on YouTube if you have questions please put them in the chat there I thank you for the presentation very interesting I have a question around the observatory and the potential scores or risks you could surface and whether I'm interested whether there's a way you have in mind I imagine it's very difficult to validate the output of this model in terms of okay maybe there's it mentions there's a risk for a specific wiki at a specific point in time is that something we could yeah validate if that is true either like maybe with specific events from the past where we know there were specific issues or so I don't know if you have been thinking along these lines thank you Martin like this is an excellent question and this has been a question that has been like pushing me for months or even years so this is one of the main side like how to retrieve like quality high quality ground through data that can help us to validate and this was not the case for for the first version and it's formally it's not even the case for the second one I'm not this is an approximation so we're trying to to work with this information specialist that they are aware of possible situations since they are in quantitative committee trying to check whether these metrics are useful to capture some risk like some references are about like committee capacity like tools like there are very effective tools in noise integrity like a news filter so how the news filter activity in terms editing the rules the hits on the rules might help to find issues or some moment with also with some external collaborators who were discussing whether using the logs of the global logs on the local blocks as a comparison on whether like a community is able to moderate issues have internally or there is a need for having like a global moderation to solve those problems so yeah this is this is one of the main the main things we have experiencing I mean I didn't into the taxonomy in the taxonomy there were like reasons that there were external and in that case we were using like external data like for the geopolitical we were using the index from reporters without frontiers and using then the geographical data to see the community according to the views and the days were coming from countries that were under risk of press freedom and using them as approximately still then is this cannot formally be considered ground truth data so any contribution that I would be pleased to to discuss because there is this working progress yeah I want to add something about ground truth in general I think this is so two things one is in terms of improving the models the material learning models one thing that we are doing is we are collaborating with community and especially now there's a project by the moderation tools team that wants to start using these models in of large scales in all wikis and one thing that Pablo mentioned at the end is that we want to give the ability the model to be retrained considering a specific input explicit feedback from admins and I think that's that's very important because this will be always an iterative process that is not only an an iterative process is kind of an adversarial problem because if there are actors that want to introduce this information if they're because one thing is like the vandalism in one side but even with the vandalism people would learn what the model is able to capture so there will be all all the time will be some drift so one thing that for us is very important is the ability of be monitoring the quality of the system and being able to retrain the model periodically and this is something that we are that we we are studying this year and I hope that we get the resources to do it and also the and also that we develop a system that can take in consideration admins feedback but also don't over fit under feedback because in the much learning space that that's the problem that you have if you want to give a lot of weight to few inputs so if we're analyzing one thousand edits per minute and we'll receive I don't know 20 feedbacks or revisions with feedback from admins per day it would be it's there's a challenge there on how to consider this and this is something that we are doing research about this this hopefully this year thanks Diego I wanted to also follow up on your working connecting it with what I Tolkien had shared and I'm wondering whether you see value if there is like possibility for incorporating and some of that work around reference risk and just you know perennial sources that sort of thing into your modeling and I guess more generally where you see the most room for improvement in detecting vandalism like what sorts of features are we thinking about sorry I can say that again to Diego the thinking about how can others not hear me all right I think Diego got frozen for a while yes sorry now the brakes but now now it's good okay I'll just repeat which is thinking about how we can connect I Tolkien's work to what you were doing with vandalism detection incorporating and features like that or just features about other aspects of content yeah that's a super good question we are in this learning process because given that this applied research we always need to see what what things can be applied on the wide later so it's we have seen that the even these models like Bert that is not the last generation like it's very far from these super large language models like like the GPT or this new Yama or there's a lot of them now is that okay we can be able to isolate some specific type of behaviors like for example adding unreliable sources or I don't know adding there's a lot of interest to patrol on images and for sure we can build models that understand images there's a lot of work on on computer vision we have an expert in computer vision the team that is medium ready so I'm sure that we have the capacity the scientific capacity to do this the thing is how we how we how close we are to being able to apply this to evaluate every revision that happens on Wikipedia but yeah and I said you you're aware that the word that the edit team is doing also on on revisions so maybe we are able to understand some specific taxonomy on the revisions and that's why the edit type is super useful but also what the edit team is doing trying to work for example with with reference if we know that this is a specific edit on a reference probably using a tool or something on the visual editor that tell us just go analyze this specific piece of text I think it's something that that we could consider so I think scientifically the problem is not that impossible to solve it's not it's not easy for sure but technically how we do these large scales I think that that is the most challenging piece yeah I'm not sure what I'm going to mention relates to what Diego just just mentioned but just just to be sure like I see one connection between what was presented about predict like predicting the likelihood of variation getting reverted or or all this work of the value of this committee initiative like pre-assurance leads like the analysis that has been done so far and many use cases are about content moderation when the content is published but maybe there is an opportunity to provide input to two editors before publishing the edit and this I make this connection because we have this conversation with Isaac and maybe this could assist editors while creating an edit like knowing not only if it's like a revision will be reverted but also an important part of this model is also it brings some and I think the API brings brings also some context about the explainability of the models explaining if there is a likelihood of getting reverted what are the reasons and that can help to inform users so what are the these conditions of that specific edit that might be then problematic in the future so they can learn before the the edit being published and is I make a connection with what I told you presented because if user is starting a source that there is already a consensus that is unreliable within the community it might be relevant to inform the user before getting that that edit and I I think it could be a another interesting application thanks Pablo I think that's a good place to kind of wrap this up you know pointing out we've got a lot of good work in this space but also still lots of opportunity support you know both the patrollers and the editors and we'll wrap this up thanks Isaac and thank you for the presentations this is a great showcase I'd like to close out with a few words of thank yous to I told King Diego and Pablo for presenting today and this showcase was also made possible thanks to the coordination on the team with my colleagues Isaac and Pablo so thank you both and Isaac thank you for taking the questions Emeril thank you for the AV support and Janet for the coordination support thank you to everyone who watched listen and ask questions again there will be no showcase in August because of Wikimania so the next showcase will be in September our theme will be rules on Wikipedia and this is in response to the attention to this topic received in the wiki workshop so we're looking forward to seeing you all there thank you everyone