 Ladies and gentlemen, please welcome Santosh, who is going to be talking about machine learning on our projects. Hello, hello, hello. Hello and welcome everyone and hello to everyone watching this online too. So I hope everybody is doing good. Today morning I was attending two sessions, both on Generative AI. People were very interesting discussions, whether Generative AI can create articles. How will it scale to the scale of languages that we are supporting. But this session is about the current state of machine learning techniques used in Wikipedia projects. It's not about the future, it's about the current states. I am Sandosh Tautengal. I am a Principal Software Engineer at Wikimedia Foundation. So in the next 30 minutes, I will be talking about the existing major projects and briefly. And I will go through its use cases, guiding principles, product design, challenges and their impact in the old Wikimedia universe. The guiding principles of using machine learning technique in Wikipedia projects is not complex, it's simple. It's all about empowering our users, whether they are editors or whether they are just users. And we want to make sure that we are not discriminating based on any language. We want to make sure that all the tools that we are developing should be available in as much as many languages as possible. So that's one of the guiding principles. Second thing is we want to make sure that all the editors and users sender in the workflow. They are in charge of creating content. They are the masters. The AI is just for assisting them to save their effort, to save their time. But instead of going through these abstract principles, I will use the existing projects in Wikimedia Foundation and I will explain how the product design, the use case and how people are using exemplify these principles. I will start with one of the major projects within the Wikimedia Foundation called content translation. It uses machine translation technique and this was started back in 2015. Content translation is about translating articles from one language to another language when it doesn't exist in the target language. So this requires a slightly different skill for the users. They should be bilingual. They should know multiple languages for obvious reasons. And they don't necessarily need to know the complexities of wiki editing like wiki markup or wiki text or anything or references because the article already exists in another language. So they need to translate according to the grammar rules and it should read very naturally in the target language. So that's the skill it requires. We provide this tool to address one major problem in the Wikimedia Universe that is language gap, which is a well-known problem. Everybody is aware of this issue. We have 6 million plus articles in English Wikipedia. We have 2 million plus articles in German Wikipedia and then we have 657k plus articles in Indonesia, Telugu, 84k plus and then we have a long series of all these languages with different sizes and we have 320 languages. So if you look at this illustration, please don't confuse that the knowledge gap issue does not exist for English. English is also facing the same issue. So for example, there are articles exist in German Wikipedia or Indonesian Wikipedia or Telugu Wikipedia that does not exist in English. So English can also benefit from transferring knowledge from other smaller Wikipedias to English. So it's not one direction from a large single Wikipedia to smaller Wikipedia. So this is our language gap issue and this is one of the core issue that we want to try to solve using the content translation tool. But I want to emphasize that this is not the single solution to address issue. It's a complicated issue about the knowledge gap. This is just one solution to address issue. The way we design the product is to make sure that it's easy for everybody to use. We have this source article in the left side that's maybe a source language for example English and then you have a placeholder where you can add more and more content in your target language and then you will be presented with a lot of tools to work with this one. So as a user, they need to click on these placeholders like the add tab translation and they will be presented with an initial version of the translation of the source article as the placeholder and we will use machine translation wherever available to fill this content. So you have a draft version of the translation already presented using machine translation. Then you need to edit it. So if you think that you can just use the machine translation and publish your mistake and that's not the concept here. You are presented with an initial version and you need to edit it. You need to make sure that it is reading naturally according to your language because we know that machine translation is not always perfect. Its quality varies per language. Sometimes it's good, sometimes it's not bad, sometimes it can get facts wrong, sometimes it can get grammar wrong. You need to do all these edits but then only you can publish it and we also don't need to use the literal translation of everything. You can edit it, you can remove certain parts, you can make it culturally appropriate and then you can try to publish. But if there is some people who just want to publish without human edits, we do measure it, like how many post-edits are doing on top of machine translation and we will show you the amount of edits, amount of curation that you did in a percentage scale and we have a threshold. This threshold is like how much percentage of the human edits need to happen before a translated article published in a target Wikipedia. This threshold is not decided by us but decided by the community. The community tells like, ok, 95% of the machine translation is allowed in this wiki. Depending on the current quality of the machine translation, you send the current machine translation engine. So we enforce that threshold and if this is not followed, we will prevent people from publishing it. So this is how the machine translation quality is assured in every Wikipedia. So just to go back to the guiding principles, so here when you publish an article, it's not the machine translation engine publishing the article. It's the users publishing this article. It's up to the users to make sure that the article meets the quality of the target Wikipedia. The article is attributed to this editor. They are using machine translation only to save their time, to save their effort and to make best use of their time. So that is the principle. So far with this tool, the impact since it was started in 2015, we have published 1.6 million articles using this tool. Just to get a sense of that, if you combine all these 1.6 million articles into a single Wikipedia, this would be one of the top 10 Wikipedias we have. So we were also measuring whether the published articles are meeting the standards of that wiki. So we were measuring whether these articles are getting deleted based on the community standard later. And we were seeing that 4% of these published articles were getting deleted later for various reasons. But if you want to understand how this 4% is good or bad, we have an average 13% of deletion rate if you start an article from scratch in that Wikipedia using any editor. So if you start without translation, there is a chance, a probability, that your article will get deleted 13%. It's a rough estimate. But here it is, if it is published using machine translation, it is 4%. There is no machine translation engine that can support all the 320 languages, right? To integrate it with the current translation tool. So for that, we have been integrating various machine translation engines into the system since the beginning of this tool. And then only we can cover all the languages we can cover. So we started with Aperchium in 2015. At that point of time, that was one of the existing free and open source system with good quality for many of the languages. I want to note that Aperchium is not a machine learning based machine translation system. It is a rule-based machine translation system. It's free and open source. We used Aperchium for 45 languages. And then based on the community request, we were asked to integrate Google as well, Google machine translation service. So we collaborated with Google. We got discounted API credits to use integrated Google. We also got community feedback that they wanted to use Yandex machine translation engine for Russian and related languages. So we collaborated with Yandex. And we integrated Yandex also with the machine translation tool. Then we got feedback from the Chinese community that the Lingokloud machine translation engine is good for English Chinese and Chinese to English machine translation. So we integrated that as well. We also integrated Ilya for bus and related languages. So combined with all these languages, for example, Google is covering 134 languages. Yandex is covering 99 languages. Lingokloud, Ilya are supporting some other languages. But I want to mention that some of the languages will be covered by multiple engines. And we provide this as a choice to the users. So they can choose the best engine that means their needs and that is best for their language. So that's why we have multiple engines. And we give this as a choice to the users to pick the correct one. But even with the combined all of these things, we had multiple problems. One is dependency on these external machine transition engines is not sustainable. That's one thing. Second thing is like if you want to scale up the machine translation capability in our infrastructure, to more and more use cases, we should have our own capability on the machine translation. So for that, we were working this year to launch our own machine translation engine, a machine translation engine hosted within the Wikimedia Foundation that we can use it for any purpose in the Wikimedia Foundation. And we named it as MIND, M-I-N-T, the last one, it's self-hosted. We announced this in last June. This is a free and open source machine translation engine we hosted in our infrastructure. It provides a single unified machine translation API so that it can be integrated in various tools. And it is powered by various neural machine transition engines starting from NLLB. NLLB is a project by Meta, Facebook. NLLB stands for No Language Left Behind. So it supports so many languages. We collaborated, Wikimedia Foundation collaborated with Meta. And we also have Wikipedia fine-tuned machine translation models from NLLB called NLLB Wikipedia to produce content that are according to the Wikipedia style of writing. And then we have OpusMT, by OpusMT project by University of Helsinki to support some of the very low-resource languages that are not supported by any other commercial or open-source empty engines. We also got some feedback from the Catalan Wikipedia that the soft Catalan neural machine translation engine is good for, I mean, it's the state of the art for the English Catalan languages so integrated. All of these are like open-source license. I need to mention that. And then recently, the IndicTrans2 project by IIT Chennai, they released machine translation model for 22 Indian languages and from feedback and by testing we found that that's the state of the art machine translation model for 22 Indian languages from English or from these languages to English. So we integrated that as well. So combining all these backends, we provide a unified interface and we got the capability to support machine translation for 198 languages, which would be, I think, probably one of the largest machine translation system in production anywhere in the world, I guess. And when I say 198 languages, it's the unique number of languages. We allow translation between these languages to any other language, right? So currently, we are having, if we count that one, we are supporting 35,924 language pairs for translation between any of these languages. So you can translate between any of these languages, right? So another advantage with having this self-hosted machine translation system is like we can integrate with more and more products in Wikimedia Foundation to reduce the knowledge gap, to make users accessible and reduce the barrier of language. We recently integrated with our localization platform called translatewiki.net, where we translate, localize our software interface. This machine translation engine is also currently used in that system. Let's talk about knowledge integrity. Here we are talking about vandalism prevention and patrolling, helping people to review the edits happening in Wikipedia. So here, this project is one of the oldest machine learning projects within the Wikimedia Foundation. This was started back in 2015, and it had a lot of press coverage of how we use machine learning for making Wikipedia and its knowledge integrity to vandalism prevention and making the reviewers' job very easy. We also have this research published in various places about how to make this machine learning a participatory machine learning, where we collaborate with the communities to make this machine learning work for their language. In basic terms, this works in this way. So you have a lot of edits happening in every Wikipedia. All these edits will be classified into certain categories. This is a good edit, this is a bad edit, or this needs review into certain categories. And this was named as objective revision evaluation service called ORS in those days. So this is the basic technique how it is working. Recently, sorry, this evaluation will be further presented to the reviewers in special recent changes pages. Whenever you go to recent changes pages, it will be with some highlighting so that reviewers can filter it, and they can choose which one is good, which one is bad, and they can do their own decisions. I need to note that this is just to inform the reviewers and the machine translation or, sorry, machine learning tool does not make any decisions about your edit. It does not revert the content. It does not make any decisions on this one. It just informs the reviewers that the quality of the content. So here the editors are in the loop. They are in the charts, reviewers are in the charts, and they can also set thresholds, how strict it should be, how less strict it should be in certain wikis according to their preferences. But this system is getting modernized these days. It's getting the actual system that started many years back is getting deprecated now, ORS is getting deprecated. A new model that is coming these days, it's called the revert risk model, and it is hosted in our modern machine learning hosting platform called Lift Wing. This is already in production. One reason why it is getting modernized is to support more and more languages. It required a lot of work to support more languages to do this revision scoring. Currently, there are two experiments happening on this one. One is to use a large language model called EMPR to score these edits. And the another one is like, I think this is the direction we want to move. It's the language agnostic scoring system where we can score these edits independent of the languages. So this table shows just an comparison. Irrespective of whether which model we are using, the whole machine learning model, both of these are trained on the past reverts. The past history of all the edits happening in every Wikipedia. So that's where this was trained. If we are using a large language model, you might already know that it requires a lot of computing power. Training them is expensive and they are not good about supporting all the languages. So I think EMPR supports 47 languages. But we want to make sure that this scales beyond 47 available for all the languages. So we are working on the language agnostic scoring system and we are rolling it out slowly. It's in the initial days. Let's talk about another use case of machine learning here. Structured tasks. So here we are talking about very simple tasks that we can suggest to the newcomers, new people who registered on Wikis. So they get an idea like what to contribute. I know that I can contribute in Wikipedia, but what to contribute? This tool suggests a simple task, adding links to an existing article where it is missing. So this takes a new newly registered user in a simple workflow. So first this will be presented in their home page, newcomer home page. And they will be informed about this kind of suggestions. And in the second stage, in the onboarding workflow, it will tell you the importance of adding links, how it helps the readers. In the third step, it will warn you or inform you that this is a suggestion by a machine learning system. And you need to make your informed decision about whether adding a link is OK or not, whether the suggestion is good or not. That means you are in control. Then you are presented with an article. You are given with a suggestion. And you need to see what are the details of the link, where the link goes, and whether it is the correct link, whether it makes sense to add there. And then you can say yes or no. And if you press yes, that link will be added. And you can see things are published in the article. Later, you are also encouraged to continue this kind of contribution in the wiki. So this is the simple workflow where we are taking the newcomers into this one. How this is working? This is working based on training, existing sentences, the existing links, the missing links. And then we generate a list of candidates that we can suggest and we score it. We present to the users. That's the algorithm in very simple terms. The impact we are seeing with this kind of suggestions and newcomers is very positive. This is a rough estimate of our impact so far. We are seeing activation, 17% of increased activation. It's the probability that the newcomer makes their first edit, rather than just registering and not doing any edits at all. Secondly, we are seeing 16% progress in the retention that the newcomer is retained. They continue to edit in that wikipedia. And then productivity, that they increase the number of edits newcomers make during their first couple of weeks. These are all rough estimates. It's slightly difficult to calculate how it is working. And then we are also seeing the revert rate, decreasing the reverts rate compared to a newcomer, compared to the edits that a typical newcomer will do without this kind of suggestions. So we are seeing a decrease in that revert risk, reverts, that is minus 11%. But these are all rough estimates about how this tool is working and its impact. There are also few challenges with this kind of system, like if newcomers are adding links, moderates may say that it's more work for them to review. And if you want to support more and more languages, we need to have more models. And to train that models, we need more data. If the wiki has less amount of data, training may not be efficient as compared to a large language. Let's talk about a symbol, another machine learning technique we are using, optical character recognition. This is used in wiki source for digitizing historic documents or any other documents so that we can host it in the wiki source. Here we are using Tesseract, one of the free and open source machine learning system for optical character recognition to scan the document, convert that into a text. That's the optical character recognition I'm talking about. This is used in 100 plus... Tesseract is available in 100 plus languages. Plus, we are using transcribers. It's an externally hosted system. We are using it for historic documents, handwritten in palm leaves, and other sources which are not printed. So we use it transcribers as well. We are also using Google Cloud Vision OCR for the same purpose. I want to talk about the lift wing, which is our effort by the machine learning team of the wiki media foundation to modernize our platform so that we can easily host machine learning services in our infrastructure in a simple micro-services where people can host easily new machine learning models. We can deploy, we can iterate on that one, and it's fault isolation, all those requirements that we need for a modern machine learning platform. It roughly works like this. People need to prepare that model as a case-based system running on a Kubernetes, but those are technical details. If you don't know, just skip it. But this will be clearly documented using model cards. I will talk about model cards in a few minutes, but that's about transparency about these models. Then model cards can be used for the transfer into the community, and all these models will be exposed to a single APA gateway so that external tools or all the projects can integrate into this machine learning platform. There are many other machine learning use cases we are using. I'm going to list a few. I want to make sure that this is not the full list of machine learning projects we have. There are machine learning projects by foundation. There are also so many machine learning tools by our volunteers, our boards, and several services. So I just need to acknowledge that this is not the full list of all of them. There are some examples I just chose for this presentation. So we use one machine learning techniques to topic classification. Topic classification is to study the knowledge gap. So to find out which topic an article belongs to, given an article, it will tell you which topic it goes to. So by measuring this, we can say that some topic is lacking a lot of articles, or some topic has more articles than some other topics. So that's one thing. Second thing is like language identification system. Language identification is given a snippet of text. It will tell you which language it is written. So it supports 200 plus languages. Then we are also having a mobile version of the content transition tool where you can translate articles. It works by translating one sentence at a time and it's translating sections between articles, no matter whether it is existing or not. But we want to make sure that whether a certain section is present in one language but not in another language. The section title might be different. The content might be different. So we want to know whether they are talking about the same thing before we suggest to translate that section. So we use machine learning for that as well. So there are also a lot of machine learning tools that are not in Wikimedia Foundation infrastructure. We collaborate with several third party vendors for this one. I already mentioned about Google, Yandex, and area for machine translation. For text-to-speech, we are trying to see whether we can have our own text-to-speech engine that can cover more and more languages. Machine vision is used for debit statements like an image is given. You need to tell what it is in the image. Google OCR API and transcribers for optical character condition. London moderation can use to detect plagiarism and et cetera for using Ternitin, named entity recognition to find out what are the Wikidata entities mentioned in a given paraphrase from the Wikipedia. So these are some of the third party machine learning services that we use currently. Not a full list but some examples. Finally, I want to talk about model cards. This is all about transparency, about how we use machine learning inside Wikimedia. So for every model that we are using in the Wikimedia Foundation, we want to be transparent to the community. We want to document the use cases. Who are the users of this model, potential use cases? Which data is used for training this model? It's ethical consideration, if any. We need to be transparent about ethical consideration. Who is developing this model? Who is the current owner? Who is the contact person for this model? The license and model architecture so that people can understand how this is working. We briefly mentioned in the last session also. This is a recommendation from the leaders in the machine learning industry about being transparent about how we are using machine learning models in production and how we help people. So this is our effort for documenting all the machine learning tools we are using. Here is an example. So you will see a model card like this in Wikimedia. So this is an example. If you go there, you will be seeing a lot of details about every machine learning tool we are using. Okay, here is an example. All right, I think that's all I have today. I hope it was informative. We have only like two minutes if there are questions. I will see if there are online questions or from here. Hello. Hi. I'm Diego from Spain. I take a lot of pictures. I have a question about the translation tool I have been using for a while. And if you don't make many changes in the text it will not be accepted. You had a screenshot about this. And I was wondering when will you decrease the amount of text that needs to be changed to get this translation accepted because as the translation machines are getting better and better the need to change something in the text is getting lower and lower. So right now what I'm doing is making artificial changes in the text so that it gets accepted and later on I have to take it out. Right, got it. Yeah, that's a good question. As the quality of the machine translation goes you might need not do any change at all in the machine translation provided by the system. And we have been getting this feedback that we need to adjust our thresholds to allow this kind of use cases. And we were also already doing this one based on the community request they asked us to relax it, relax it and we are accepting it this one. So it's very simple to do this one. We will increase the limit so that you will be able to publish without much changes that it used to be. Because that also shows the advancements of the machine translation quality happening in the newer and newer version of the machine translation tools. We can do that, just file a ticket, we'll do that. Okay, thank you. Second question is you mentioned that it's possible to choose the machine for the translation, how can you do that? In the screenshot I was showing the right side tools there will be a dropdown where you can choose which machine translation engine you need to use Google, Mint or you can even choose not to use any machine translation engine and write the translation from scratch. It's there in the tools in the third column of the session. Okay, and is there somewhere an overview of what tools can translate which kind of text better than others? I didn't get that question. You know the machines but the normal user doesn't know which machine is better for what kind of translation. That's a good question. So which model to choose, which machine translation system to use? We ask community to tell which one should be the default. Then community tell us by doing a lot of evaluation and then we just make that as the default. If there are multiple choices, the default choice will be always the free and open source one. That's our based on policy. But if community is saying that one system is better than other, we will make it default. But once you choose, you can also make sure that that will be useful for all your future translations. So that's also possible. Okay, thank you. Thank you. I think we are at the most end of the time. Thank you for attending. Thank you for only people watching this one. Have a good day. Thank you, Santosh. Everyone, we've got some lovely bento boxes for lunch at the back here and come back to join us at 1.45 for the next session.