 All right. Good morning everybody and welcome to the March edition of the Wikimedia Research Showcase We're coming We're here in San Francisco on a rainy and sad day But I'm gonna warm your hearts with some amazing research and we have a crowded room here. There are some special guests we have a Robert West from EPFL and we have a Mikaela Catasta from Stanford and and of course the rest of the crew here and I'm delighted to have to do today two presentations that are really part of a major program that the Wikimedia Foundation research team has been conducting in collaboration with a number of institutions including EPFL and Stanford and We have a Titsanopi Cardi with EPFL is going to give a first presentation today about Metallurgical challenges in the use of the category network in Wikipedia to extract some meaningful relations and this is going to be followed by presentation by our own Diego Who is going to present the second part of this work about section alignment? Across languages in Wikipedia. All of this is part of a major program that we're investing into with with his partner Institutions now to try and understand how to design Recommender systems for article expansion. So I'm really excited to have you here as usual some house rules That presentation is going to be Broken down into two parts 25 minutes followed by a short Q&A Please stick around at the end of a showcase. We're going to have as usual our final open discussion At the end of the two presentations Miriam ready was here on the on the hangout is gonna be our host on IRC So if you're following a discussion on the IRC channel with media research Please ping her with your question and she will relay them to the room. So with that I think we're ready to get started and it's another stage is yours Okay, I'm sure the screen Okay. Hi everyone and thank you for the opportunity of presenting today this research This is as I said is part of Large project about sections recommendations I work on this project with Mickey Katasta from Central University Bob West my supervisor in EPFL and the Laila of Wikimedia Foundation today, we are going to talk about taxonomies and in particular how to clean the category network to Extract a taxonomic structure from the categories. Let's start with the definition of taxonomy Taxonomy is an hierarchical organization of concepts. Typically They are these entities are linked with is our relation like in this case we have a As small example of taxonomy where we have animals, mammals, felines and You can see there is an is a relation between felines and mammals and animals This is a very intuitive and natural way to organize concept and to reason About model the world in the taxonomy the terminology The typically people use the term hyper name and Iponym to define a hyper name apparent of our concept and Iponyms the sector of children so animals is an Iponym of mammals in for example in this case and Taxonomies are useful in many different domains. For example In a image classification now people use taxonomy to For example in satellite images to Bind a different level of doom to different Level into a hierarchy of the taxonomy so forest could be labeled as a green area in a different level of doom another typical contest is the medical domain where People can model Diseases and symptoms in In a hierarchical structure and can run on top of these Automated reasoning tool to find the similarity between concept and to generate a new knowledge another application could be Question and answering both So a program can infer new knowledge from the model and answer a question about An entity that is not Explicit in in the model. Let's see an example to make this clearer Again, we have a small taxonomy about animals and In the example labeled as number one we can by following this transitive is a relation infer that Feline is an animal. Even if this is not explicit in the model You can just follow the path of is a relation and the infer this new knowledge not Second example related to the question and answering both is Number two the sample label as number two we have a statement that All animals must eat to live and these are is Property in the top level entity animals as you can see on the left and If we have a question for the boat Like does Garfield have to eat the boat can check That Garfield is an instance of a domestic cat and by following these Path of is a relation can infer the answer to the question even if again, this is not explicit in in the model but we are interested in Wikipedia and One good and nice application of taxonomies is a test actions recommendation Layla presented this problem in the showcase of December, but to give you an intuition Basically, we are developing a recommender System to provide the recommendation to the editors about possible missing section in the article. This is a The interface of in vision for a next generation of the Wikipedia editor and In this case basically we want to generate a recommendation Based on on the concept. So if it's a biography, we want to extract a Set of section relevant for a biography In let's see in practice How these can work in these example we have people scientists and actors scientists and actors are children of people and scientists and actors have a set of Frequent sections that in particular our early life personal life and death and For the specialization Scientists that we have a word for actors. We have a filmology But if we go up and we search for common structure we can generate Template composed by early life personal life and death That are representative for the biography of people in general the problem with this is It's a nice application, but generating a complete taxonomy manually is not feasible because Wikipedia we have many many languages and Potentially thousand and thousand of concepts So we should find a way to automate this task But the good news is we have a category network So the I'm sure you are familiar with the concept of category in Wikipedia, but let's see an example In this case that we have the page of Claude Picasso's a son of Pablo and At the bottom of the page there is a list of categories Assigned to this article in this case we have a French artist the French photographer from French journalist if we click on a French artist, we can see the details of this category and we have subcategories and parent categories is Potentially we can see this as a set of Iponyms of the entity French artist and Iponyms for the parent categories. This is a It's very tempting to use this a category network to Generate a taxonomy Unfortunately is not so easy And people that work with the category network No, it is because the problem with a category network is it's community generated and it contains a lot of noise in in terms of links and entities in Internet work Let's see a couple of example of noise that we can find in the category network one Is the network as loops for example if we Check the category government and we go up into hierarchy by following the path public administration public economy Economic policy we reach again the same original category government in our Implementation we used The algorithm described in this paper a fast and effective heuristic for the feedback access problem and basically the feedback are set problem. It's a way to Organize the graph topologically based on the in and out to degree of the edges And once we have these graph sorted we can remove the back edges To convert the network in a direct taxicic graph A second problem with the network is not all the parent represent a taxonomic Relation and to going back to the previous example French artist as a parent category called French art and This is for sure not Is a relation because we cannot say an artist is an art even if the two concepts are related You cannot include this in a taxonomical structure This is a Active research topic There are many researchers are trying to extract it infer taxonomies using the category network The most recent works are multi we be and add taxonomy In both cases they start from the English version of Wikipedia. They extract Taxonomy for English in the case of multi we be using two taxonomies one for the pages and one for tech from for the categories and In the case of head taxonomy only based on the category network But in both cases that they use a linguistic feature to clean the network like in the case of head taxonomy they use the heuristic and heuristic based on plural to infer that the category is likely to be taxonomic Entity These two approaches work quite well But they have a different goal compared to what we want to do because they Have to go to have an high precision on taxonomy with clean ontological structure in our case. We are more interested in having a higher cola in on the category network because in The case of sections recommendations we want to generate the sex recommendations based on the categories of the article. This means we want to Maximize the number of Recommendation we can generate from one article Yeah For this reason for our specific use case is more important to keep The original topology of the network and only remove what can add noise to our recommendation You know, so let's see how our method works in practice Let's assume we have some generic high-level Types for for each article in Wikipedia if you think about a taxonomic structure articles in our taxonomic Should Categorists are taxonomic I should share the same types for example if We have the category scientist What we expect is to have only articles about people in in our case, we use the dbp. Yeah to assign the Label the type to the articles but we can use the Wikipedia or other semantic knowledge basis And in the case of dbp. Yeah, we selected the 55 top-level types are the top-level of the dbp. Yeah hierarchy and these Types include the people organization events Works so very very very generic types So now we can include introduced the concept of purity We can define a category as a pure if The distribution of the types is homogeneous like in this case on the left We have a category and the five articles of the same type We can visualize these with this Instagram where We see we have a five elements of type square in As part of this category This is the case of scientists where we have only people in the sub In the assigned article The other possibility is a non pure category in on the right and then in the visualization we can see we have a distribution that is more heterogeneous like the category football is not taxonomical and In our case we define it not pure because in football we have People Because of player we have events because of matches and organization because it contains also teams So the idea at this point is to run a bottom-up approach to clean the network by using This concept of purity and remove the category where Different concept converges and create an impure category This example we have again football with football people football venues and Football teams say the bottom part is composed by pure categories and Football is not pure. This is a real example taken from the real category network so let's see in how in These algorithm work sir in in detail Imaging we have a teaser This is a small small graph. Let's start with the Leaf of the graph like here is a category on the left bottom of the slide and We can visualize the teaser to this the types distribution On this Instagram and we can see this is pure because we have only three articles of the same type So we mark this as pure and we go to the next category again. We have the same structure same Kind of distribution and we can mark this as a pure We can go up in the hierarchy and We can check the distribution of the top category the parent category by considering all the articles in this in this dream because If the is our relation is respected they must be of the same type So again, we have here a pure category. These could be the case for example of football people where we Found a pure category now, let's move to the other leaf and Again, we have a pure category even if the distribution of Types is based on a different different type. This can be the case of Football venues where we have a pure category different from with a type different from football people So now we can move up and we reach the category that in the previous example was football where we have a very heterogeneous distribution and And here we can mark this category as not pure In this case, we can decide to remove the category. We can remove the edges It's it depends on the application this method It's more like a framework to clean the category network You can use it for different purpose and depends on the use case Consider that if you remove the category You are not disconnecting the Graph because typically in the category network a category as multiple parents for example on the left side that we had Football people is possible to have an additional parent that is a sport people That is pure and that we keep these the subs reconnected To the taxonomy the problem is the interreal world There is no It's not possible to have a perfect pure category because of the noise and again, it's a community generated and some people can assign categories to articles even if they are Not completely related so for this reason we are more interested to Mark as a pure the category with a high and balanced distribution like in this case we have isolation of types distribution where we have a high prevalence of people and We can assume this category is the pure To measure the unbalanced of the distribution. We use the the genie coefficient genie coefficient is typically metric using the economic to measure the Inequality of wealth in a country and in our case we want to measure the genie coefficient of the distribution and Set a threshold to decide when a category is pure and when is not Like in this case that we have two examples at the top histogram it's Could represent depending on the threshold pure category while at the bottom there is another distribution that is more heterogeneous and if the Genie coefficient is below the threshold that could be marked as a not pure category in Now you Are wondering how we can select the threshold again, this depends on your specific use case in our case for sections recommendations, we use the manually annotated data set of seven under Samples where we annotated a set of transitive is our relation like British scientists is a People is a person. Yes a positive example and Gaelic game ground is a football. No, it's a negative example. We use a transitive expansion of the is our relation to Basically Mark as a negative a path that should not exist in the taxonomy And then we prune the network with different level different threshold of the genie coefficient and We computed using these annotated data set precision and recall and based on these we decided the best The best value to have a balanced Set between precision and recall in conclusion We presented a language independent framework to clean the category network. This is more a framework because it's a sort of parametric Method and you can customize it for your specific use case for example, you can choose different Types from different data sources. You can use wiki data. You can use the dpdia or additional semantic and all these bases You can use a dynamic threshold. We didn't do it for our case, but you can for example choose a genie coefficient a threshold for the genie coefficient based on the size of the category in terms of article or the depth in the category network because for example categories on the top of the hierarchy are Are more likely to receive a noise from the bottom categories an additional customization can be the way to select the threshold of the coefficient the genie coefficient we use a manually annotated data set, but potentially You can use a different a different Method based on a different objective function like you cannot optimize the The threshold for a specific use case using a validation set and finally You can decide to delete or remove edges again based on your use case If you want to keep a super high recall or if you want you prefer precision Again, I repeat again. This is a part of a bigger project. So about section recommendations So if you are interested in more details that is there is the link to stay updated and and Thank you very much if you have questions Thank you very much this piano for the presentation. I think we have a few minutes for questions So when I ask first Miriam, there's anything that you can relate from IRC Yeah, I think you're the first in the queue that you Be sure And I was already familiar with his words. So really nice to see it presented in such clear way I have a question about something I've been thinking of for a while that is about the robustness of Categories as a function of who asked them. So I don't have any good data to decide here But I imagine that in some cases that might be batch editions of categories At large scale to an entire set of articles There may be a single bought or single users performed just because they decided with a good idea to add that category to thousands of articles and Conversely, I imagine there are Articles who are the addition of a category as being done more organically by a large set of users And I was wondering if that could be used a signal that some categorization more robust in some other types of organization Or basically just like someone going bananas Have you because of any thought about something like this? Yes, I mean in our specific use case We didn't find this problem For the section of recommendation it was enough what I presented there and in of course You have to apply additional Check on top of the gene-efficient if you want to be more robust to this specific cases Again, I present these are more like a framework to clean the network and you can plug a different logic to Manage a specific use cases or specific This case to clean This kind of noise I jump in here briefly. I think this is a great point. It points also at at a at a whole different research direction namely basically building a taxonomy of Of category usage like what are even the use cases for which Categories are used on Wikipedia because it's not only this Kind of pseudo taxonomy, which is what we would ideally want But people use taxonomies to organize their work like there are tax there are categories that are called stubs in Stubs around universities or something like that. So that shows you that it's really a meta There's this whole meta level and and characterizing one of the use cases and Are they successful or not? I think would be super useful in order to understand whether the category system is Already good or whether it should be redone or revised in any way Very, very. Yeah, I agree. There's a bit of like an edit persistence Direction that could be applied to categories as well. Yeah, that makes no sense Cool, I want to open it up. If there are other questions now You have anything from IRC Not for the moment Okay, so how about we're right at the half hours? So we can move on to the second presentation. Thanks to this channel Stick around until the end The second position by Diego and he's going to present It's gonna be a presentation about the second big part of this project around the second section alignment Yeah, go to the stages yours Yes, okay, and so hello everybody before attending this talk and So today I will talk about the lightning will give you a sections across multiple language Beyond the automatic translation. Hopefully. I'm sorry. I have to interrupt. Can you press present? It's pressed We are not seeing it yet on the screen. Okay, wait, okay What about now just give us a second. No, not yet Okay, let me stop sharing and share again share screen Okay Now yes, now it's working. Thank you. Okay right So I'm saying I would present this work About the line sections across multiple language on this young world with a Baja Bob and Layla and So the main idea of this of this Part of the of the project is create a dataset with section names align across languages so with alignment we mean all translation I will explain the difference between translation and alignment and Why this is important on this has on many different applications in the context of this project For sure the most important is do cross-lingual section recommendations So if we know something about an article that is in one language and Someone is writing an article in one language. This article is already in many other languages We can use the sections there to do the comment sections in the new language and In general for languages that we don't have too much information as everyone knows when you do recommender systems Having not too much data is a problem So if we can connect all the languages For every single language you will have the knowledge from all Wikipedia. So that would be very useful for sure is also useful for improve the content translation tool and always and also In general, it's a good idea for general and abstract ontology for section titles So we can think later in our section system items more than they just a title and this has many applications For understanding, but we keep it up those languages. So what do we mean about? Alignment so here, for example, you have sick languages and you have two section names Or protection titles in each language and what we want to learn is the mapping between on this this section. So for example Here you will see that on the ones that they can pronounce Premiers in Spanish is the same that awards in English And I was learning how to say in Russian, but now I forgot but you see they're in Russian and in Japanese and Farsi And on the other side, you have the alignment between the biography biographia Biographia in Russian also in Japanese and Farsi So You can say, okay, that's that's translation. No, so why not just to use automatic translation services? Can we trust in them and Anyone using sometimes, I don't know any of the public translation tools that are available knows that they have some limitations They're far from perfect. They're usually good for understanding, but they're don't produce excellent translations like human level translation and also Doing some tests from from our specific use case We found the accuracy depends a lot on the language So for example, we test in English to Spanish and in English to Farsi and the the quality of the Translations were very different while English to Spanish was acceptable in in many of the cases in English to Farsi the Result was very different But moreover This is a bit different from translation because there's some constraints that we have or some challenge that we have Where we want to for example keep the style and conversion of each Wikipedia edition So There ways that people write sections that are not necessarily translations of others in different languages And we want to keep that we don't want one language dominating the style of the other Of the other languages For example in in English you will find that notes and reference are usually two sections while in French This is usually one section with these two things inside of that section The other important thing is that we want to give equal importance to all languages So we don't want to use people's languages Meaning that we don't want for example to translate to every language to English and then from English to the other languages But we want to do translate use all the information possible So if we if we have information in in 200 languages We want to use that information to learn the mapping for the section for the language 201 And the other big challenge for this basically considering these two things is building the ground through So basically given that we don't want to use a people language. We need to build a ground through from having people that speaks All pairs of languages. So for example, we will need to find someone if we want to use labian and Bengali We'll need someone speaking this to language with Encyclopedia writing quality But at the same time we have this constraints and challenge, but we have also Assets and opportunities. So we have a this very active and committed community that is interesting These kind of things in Wikipedia. We have people self-reporting their languages skills here with the table template I was playing in a minute. We have entity links across languages. So Thanks to wiki data. We can know that a page talking about global warming in Spanish is the same page in French in Bengali in Arab, etc. And The other good part is that Given that now many people use in NLP use Wikipedia data We already have pre-trained models deep as a big set of different pre-trained models using wiki data across different languages Particularly today, I will explain a bit how we use the Babylon project models So let's start off. What are we doing to build in the training data set? So I forgot to say this is a work in progress. So this is what we are doing exactly now so the first thing that we did is try to build this training data set and We put some some constraints some Things that we want to to have in in our data set So basically select a set of languages with different scripts. So not just Latin but Cyrillic and Arabic and Other other scripts and different coming from different families not just in the Europeans But other other families of languages. So in the in the right hand of you can see The languages if you don't speak all these languages, maybe this is easier So we will considering mother Arabic, Russian, French, Spanish, English, and Japanese So we will use this for learning how to do these mappings and then we want to apply for all the others For all the other languages. So even that we don't have someone speaking the labia not Bengali We will learn from here and we will try to translate their To transpose their with with disperse So what is we'll be available we'll be available is a template where people can self-report their language skills So basically they they annotate from zero to five with five being the most proficient and Zero being that you don't speak the language So people can say which language they speak and in which level so then we can go to the public information in in the public service query and we can query there and Get all the user names and people that speaks this that self-report that they speak these languages and Surprisingly or not in this six language that we select we find some overlapping all of them So the most basic so most of them will speak English, but we also find people speaking Arab and Japanese or friends to Russian or friends to Japanese and Japanese to Russian etc. So These people is the one that we're contacting now through their user pages to To ask to them for helping us to build this ground through this is what we are doing exactly now later on I will give some pointers if you want to help us on that task So exactly what is the problem definition? So basically having two sections is one and is to where is one as two are from different languages We want to know whether as one is the translation or the mapping of as to across and do this and repeat this across multiple We can consider this as a link prediction task. So if you are familiarized with the graphs This will be a multi-part graph or K party graph which Where each key group it corresponds to a language and Notes are sections titles and the links represent the the translation or the mapping Just a quick reminder if you don't remember exactly what is a key party graph. So a Key part graphs are graph Was a vertices are or can be partitioned in two different independent sets? So the most usual maybe you have here is a bipartite graphs So typical example of this is you have a graph where you have notes that are actors and movies and So you will have linked between actors and movies But not between movies or between actors in our case the key will be equal to the number of languages in the in our data set for for this training data will be equal to six and As you can see in the right that would mean that we'll have mappings from one language to other and not inside within the language Whether the features that we will consider for this we will use automatic translation for sure It's not perfect, but it's a strong signal that we can use We will use the Levenstein distance or also known as edit distance that basically measures the difference between the different characters between two strings and this is especially Useful for languages that are close like our Portuguese in Spanish either Italian and Portuguese and Or Dutch to German, etc. Also We're planning to use the upcoming links within the sections. So basically where they are linked to But the two things that I will explain today. They are the most innovative and interesting I think it's so we will use word embeddings across different languages and they what we call the co-currence count So an embedding what is an embedding? Maybe you heard about this is has been kind of trending in the last years in the NLP world So basically when you embed an object It's taking a set from one Domain and projecting another domain preserving some structures that you are interested in more specifically a word embedding it's when you take words and You project in a victorial space And in this vectorial space the the structure that you preserve is the semantic structure So two words that are close in the vectorial space Should be close semantically Here I'm projecting use Disney I'm projecting this This section titles and Here you see in two dimensions. So this is what this algorithm Disney does, but in fact you are using vectors of three hundred dimensions And you can see that for example, you have some clusters that are pretty clear that they're related like career and career or in the right top you see Bibliography, filmography, discography, publications, works, gallery or on the bottom you see awards and honors So this is very useful usually it's done in one language And it's what you use to measure so you can then measure the distance between these two vectors for example using cosine similarity and That will mean that if two words are close in the vectorial space. They are semantically close, semantically related And the point it is that if you do this in multiple languages The usually the embeddings what we're using in this case is fast text So maybe you have heard about word to beck or glove also in this case. We're using fast text If you learn if you learn in different languages, you won't get these characteristics because you won't have the words similar in the vectorial space so here you see that just taking the cluster about Bibliography and filmography in the left you have the The section titles in Spanish And and you see that both generate clusters, but they are not close between them So where here is where it comes into the play the Babylon project These people that basically is doing alignment between different vectorial space in different languages What they do is a dictionary based alignment of those languages. So basically they give information They know that same points are they are the same for example using a dictionary or using nouns and with this they align they perform a linear transformation and this allows to put in the same space both a pair of languages So after applying this linear transformation, you can see that all the all the words are clustered The the cluster is not the alignment is not perfect as well as the as the Embedding is always not perfect But now we have a more clear sense of the sense of distance here gives gives more information about Semantics so this is one of the main things that we're using to understand that the things are translation this plus the other Plus the automatic translation and also what we call the calculus and counts. So basically Week data allows us to align articles. We can know that an article is talking about a wiki data item That can be a biography or can be any item of wiki data can be a place or whatever. So basically We can align the table of contents of Every single article in every single language and if they exist in these two languages We can count the co-currencies. So basically, we know that in the same article that usually appears Personal life appears Primero Sáñez in Spanish or Publications and works or publications and publicaciones. So this this doing this Co-currency counting was to get a good information about this alignment So as I said, this is ongoing Research or we are contacting people that know we know that speaks are one of these two of these are six languages So if you are one of these persons, please help us here You have the pointers if you cannot remember now, please send us an email and we will put In contact with you. You also have the fabricator task in the bottom So this is about the mapping But for doing a good mapping we also need to understand our synonyms So within the same language, we know that there are two sections that means Exactly the same or partially the same And this is very important to know when you are doing the linements Because this will help to do better linements and also to to get a better understanding of the of this Of this hierarchy of sections of these graph of sections So here is an interlanguage Interlanguage problem. So basically having two languages from the same language. We want to know if it's one on us as to our synonyms and one interesting thing that that we understand here is that Many times you have our synonyms like exactly the same Meaning but also many times what you have are partial synonyms. So two section or titles that are usually use it In similar ways, but they cover they have an overlap, but they are not perfect translations And for doing this we are using again the embeddings. So the embeddings in one In one language that you already saw Again the edit distance This is also very useful because many times and it's related with the third feature that is a subset So basically many times you have One section that is overlapped with another one even in the in the title. So education on background It's overlapped with education or with background. So this is very useful also to understand when you have synonyms but the The feature that I want to explain you today is about what we call gift IDF similarity So basically Our intuition here is that two sections are likely to be synonyms if both tends to co-cure with similar section But they never co-cure among themselves almost never co-cure among themselves. So basically if you If you have for example background and education They can't co-cure, but it would be rare that the co-cure, but they will cure cure with similar sections And the problem with applying this is that the most frequent sections tend to co-cure with almost all the sections So for example reference see also and these kind of sections are present everywhere so we need a way to to remove this and one way is to do with Kind of a stop word removal. So basically have a blacklist But the problem with this is first is there's some sections that are very frequent, but this is a They're they're not Equally distributed across all languages and also that is difficult to we don't have a stop This is top words for every language that we're working with so The solution that we developed for this was basically represent sections as TFIDF vectors So now you need to think as a section If you know what TFIDF is and you need to think on a section represented as a bag of sections where the Term frequency will be the number of Co-currencies of these sections with other section and then we can apply the standard metric of cosine similarity between these two pairs of sections and Then we can divide or filter somehow With the number of co-currency of these two sections Yeah basically basically that We had got super good results with that. So here you can see some examples This is just considering the TFIDF similarity for sections that never co-cures. So the co-currency here is zero between them So you can see that there are things that are very easy to get But there are things like for example the last one line up and bad members that It would be difficult even where they were the embeddings, but the TFIDF similarity works pretty good with that Yeah, so As I told we are working on this We are now in the part of building the ground through for both for section alignments and synonyms So if you don't speak Both of the language two of the languages two or more of the languages that we're working with But you speak one of them you can help us with the synonyms You can again refer to them link that they post earlier and if you speak about more than More than one of these language. It would be very helpful if you can help us label it them We're testing different frameworks for the link prediction or link detection and Another thing that maybe it won't be part of This particular research, but it's something that for sure will be working if you sure is to improve this word embedding so now This word embeddings alignments are done based on a dictionary list synonym list Between these languages But for sure with information that we already have in Wikipedia with especially with wiki data We can come up with a better way more more effective way to align embeddings across languages and I think Yeah, you have some information there, and I think that's also please I'm happy to hear your questions Thanks very much for the presentation Diego And with that we're gonna open the room to questions as well as Anything coming up on Irish Miriam, is there anything any question from the channel? So we have a question from the YouTube channel from Dumisani asking what language translation localization engine is used for this project So as I mentioned We are not Basing everything in a search in a translation engine the test we have we have done till now would be with a Google translate Also with Bing But for the rest of the project We are not tied with any of them. So we might use another And another Engines I forgot now the name of the one that is used by the translation tool that it's a different open source one that We might use for production as this Is it a yandex Diego or no? I Don't think so. I think we have something that different, but I don't remember It strikes me that This work aside from the the use case that we're trying to solve here internally for the as from the recommendation sexual recommendation problem Is a much broader Problem in a solution that you guys are designing And the data that we're collecting is going to be potentially useful for a much broader set of machine translation and P use cases, you know, we're doing here, right? Yeah, absolutely one one comment like that. I think Diego mentioned this in his presentation I think the key and focus here is here is not to get just a translation in a way that is Understandable by the audience but to get the level of translation, which is encyclopedic. So this is kind of Highly specialized translation for the case of an encyclopedia, which is really useful I mean, that's kind of an addition to everything else that machines are considering But the machine translation on the web is not done within context, right? So now we're saying Refine and constrain yourself to Wikipedia. Yeah, what would be the translation? Yeah? Yeah, and it relates I was thinking as you were presenting about a question that we hear in product A lot is machine translation sufficient for creating stubs or adding to articles And what is the reaction of the people who might read that? From one language to another. So I think this could be very some of it. I'm not sure exactly what parts but it could be very useful to the product department. So we might want to ask you about that. I also see sorry just to jump in on another potential use case here is We key data descriptions like labels and descriptions. I know this is complex problem of how to map concepts concepts with the articles to a variety of entities in we data, but it strikes me that For some sections of we P articles that map to entities. This work could be really helpful to identify good candidates as Descriptions or labels for this item. It's just me thinking aloud. But I see we can get a good also benefit significantly from this work. Yeah, for sure. I think section translation. So section titles usually had a lot of semantic power, let's say, no, if we are able to do this, maybe Labels from different types will be easier. And then yeah, anything that we can do after this works. The concept of alignment is very interesting for anything any grossly well work that we do in the future. I have a question about I think you mentioned that in some cases are just not a one one to one correspondence between languages. There's some section titles which might mean what Before we have two different things in the other language and I'm not sure. I think later maybe you just finished out this project. But a year ago, we had An intern who worked on a just counting the most frequent section that essentially we had some observations like this, like, I think one example was A wall or sea on the French Wikipedia. It's see also, but it's also external links. So they work with internal links. And I think it's also with literature, bibliography, North references. There's a lot of different approaches. And post that map into in this framework or Can you like recommend two different section tiles as translation of one section? Yes, so this will be one of the goals. So knowing that because if you have two sections mapping to one section. So let's say the example with French, you have sections and in French is one section. If we also know that these two sections are not synonyms in English. So we know we know that nodes and reference are not the same thing And they map to the one both map to one section in French that would means that Basically this section is represented in two different things In in the other language. So it's something that we are considering. So basically the synonym detection part is for helping also in that If we know that two things are synonyms, we can collapse them in one node and then we can select one of them By different things popularity might be one, but maybe there are others But if we know that they are not likely to be funny synonyms, uh, this will help for this A node one to one translation recommendation One to end Um, we have another question from youtube, um about Are there any plans to represent languages not supported in google translate? mostly african languages Yeah, for sure as I said, uh, the The translation We're not just using google and the automatic translation is one of the features that we are using but not Not not the not necessarily the main one and We already have What they did with the Babylon project. I think they have around on 90 different languages listed there But given that this is based on fast text. So basically and uh, and uh, it's a it's a open source project We can align to any any embedding where we have, uh Information so basically if we have a wikipedia in one language We can create embeddings for that language and then we can easily, uh, I want to mean easily is really easily Align with the other languages for sure for languages that we don't have too much content too much, uh In this case pages, uh, the quality will be lower than uh, in uh, in very populated wikipedia But for sure it would be better than uh, than nothing and for sure it could be Useful for for do the the recommendation of the most popular sections I think that's all from the channel Thank you Diego Cool. Um, can I make a one remark? um Yeah, so um, you know, basically I want to start with the first part of the presentation, uh, I think um Tiziano started working on the project of what we call cleaning up the category network with all the caveats that he talked about in september 2016 And we have been meeting basically every week and discussing this project since then And there were weeks that we just left after the meeting and I was not sure if tiziano is going to come back the next week I just want to say that he has done an incredible amount of work and has worked on one of the hairiest problem That Bob Michele I have seen in our research career And it's amazing to see this coming to this level and I'm really looking forward to Doing the daily releases that we have been discussing associated with basically like empowering others to use What you have brought this project to tiziano You have this been discussing, you know, blog posts and like how to use this as a community as a feedback to the work that the community has been doing on Potentially cleaning up parts of the category network with the caveats that Bob mentioned So thank you for all of that work to the three of you, but tiziano you specifically used a lot of heavy lifting over the past year and a half Um, and of course, uh to Diego Bob and Bahar, right? I mean, this is the project that we started in in In the past, uh, I think six months ago and it's been it's been moving fast and it's really exciting area. It's I attended a weekend daba remotely over the weekend and as do me and others have thought of in youtube There's a lot of discussions about what is our role in the smaller languages, right? And what do we do if the internet or web is not supporting these small languages? And the work that we're doing in this space is critical to basically empower Users editors readers in these languages to have access to the content So it's just push as much as we can in this space and get the results data everything out. Thanks for everything you do Yeah here I also want to ask, uh, uh, Michele involved with any additional questions or comments that we have an opportunity to have you here Sorry for calling you off like this Oh, yeah, can you help me soothe my daughter? Yeah, I think we'll pass Um Yeah, no one thing that I want to mention about, um the language that are not what is represented in wikipedia is that, um So this is um The quality of word embeddings improves with the amount of text So I hope that instead of if we if we can produce a medium quality, let's say Recommendations in these small languages. I hope that this will help people to write more and in While they write more the word embeddings will improve and with this improve all the alignment will be better And we will produce better recommendations So, uh, I think it's very important. Um to work in this uh in this kind of spaces because, um, I think can be uh, it can break the inertia of So this happens with everything. It's kind of uh, uh, networks or effect, but um But even with uh with some uh, thousands of articles we gotta start working with embeddings So I think we can we can work on this with uh with most almost all the all the projects in wikipedia So I think it's a good starting point Yeah, and I also want to say I just uh retweeted the call for participation in the labeling campaigns Obviously if you want to be able to cover Uh, all languages including like uh underrepresented languages really critical that we get um Multilingual speakers to help us out. So please if you're watching this and if you want to help Check the links that you'll find uh In the presentation or on twitter or on the research list Sweet anything else Miriam from the channel or from the room? Not really No, I think we're good. All right, so I think we're gonna wrap it up Thanks everybody for joining. Thanks to our speakers Diego and tiziano And everybody else who joined us here and I RC and on youtube, of course And we'll see you all uh next month in april the next showcase