 Okay, we're live. Awesome. Hi everybody, welcome to the September edition of the Wikimedia Tech Talks. Today, we will have Diego Seys-Trumper, a research scientist on the research team, talking about how to compare text across multiple languages. You'll find the YouTube stream here, and afterwards, we will take questions on the IRC channel or on the YouTube stream. If you do have a question later that you're thinking about, maybe you're watching this not when we broadcast it, but later, you can send questions to me directly on IRC, and I will collect and post the answers to our technical talks page. I'll have a link here to our Tech Talks page. If you're a member of the technical community, you're more than welcome to propose a talk. We'd love to have you. And without further ado, let's get started. I'll pass it on to Diego. Thank you, Sarah. Thank you, everybody, for attending today, or if you are attending this video later. So let me share my screen. I need Sarah to stop presenting to share. Okay. Great. I'm down here. So can you see my screen now? Yes. Okay. Yes. Okay. Thank you. So today, we'll talk about an introduction to cross-linked word embeddings or how to compare words across languages using word embeddings. Just let me introduce myself very briefly. I'm a research scientist on the research team in the Wikimedia Foundation. My background is in machine learning, especially working in social network analysis, graph theory, and coordination. I'm trying to apply NLP techniques to working in cross-lingual problems. And also, I'm starting a project now in disinformation in Wikipedia. So if you are also interested in that topic, you can contact me for talking about that. And you can find me on IRC or in my Wikimedia email. So let me explain a bit what we will talk about today. This is an introduction thought for people that doesn't know too much about word embeddings. If you are already an expert on embeddings, I will give a disclaimer here. I'm trying to explain this is a high level. So I will do some generalizations that you might find not always as deep as you will want. But please, if you see anything that you don't agree, send me a comment or make a question. But the scope for this is for a general public. And what I expect this general public would learn with this talk is basically three things. First is understanding what is a word embedding, how you can understand how word embeddings work and what they can be used for. Then how you can use cross-lingual word embeddings, in which case they are useful, in which cases they are not. And especially this is the third point, what you can't do with word embeddings. So many times you show an OP to people and people say, okay, but this is not super correct or it is not amazing. Because one thing that you should keep in mind is that natural language processing machines are improving a lot, but still they are not as good as a normal human will do. But I will explain why cross-lingual word embeddings can be also very useful, even with these caveats. So let's start defining what is an embedding in general, and I'm here not talking already about word embeddings, but embeddings in general. So the definition that you can find in the Wikipedia page of embeddings is taking a set in one domain and representing another domain, preserving some notion of distance. So what we mean with notion of distance, so if you are in the domain of words, we are thinking in words, and we want to create an embeddings that will pass them to the domain of vectors, to vector space, the distance that you want to preserve is the meanings of words. That means that words that are close in meaning will be close in the vectorial space. If you don't know what is a vectorial space, don't worry, I will go with some example later, but basically means that there are numbers that you can measure the distance between these numbers, and distance that are big means that these words are not similar, but if the distance is small, means that the words have similar meaning. But you can not only create embeddings for words, you can create embeddings for many things, just some examples, you can create embeddings not just for the word, but for a full document, and the distance that you want to preserve there maybe is the documents topic. So documents that are close in these numbers in this vector space will be close in terms of topics, they will talk about similar topics. You also can create embeddings for images. In our team, we have an expert on that medium ready that can explain you more about this, but basically the idea is the same. You will create a vector that represents the image, and if two images have similar content, so if two images are about a picture of cats, they should be close in the vectorial space, but if one is about a cat and the other is about a car, they should be far in the vector. Real space. So this is the general definition of an embedding. In particular, we are talking about word embeddings. So as I said, this is transforming word in vectors. We want to transform word in vectors because this will allow machines to understand similarity between words. So machines are very good measuring distance between numbers, but they are not very good understanding text. So as I said, the distance that will perceive the meaning or the semantic distance between words. And here you have a toy example. So imagine that we are projecting these words in two dimensions. So basically, you can see that the word cats and the word tiger will be similar. Like in the first position, you see that they have similar numbers, and in the second position, they are exactly the same. So maybe the first position is capturing if they are an animal, and the second position is capturing if they are a machine. So you will see that cat and tiger are close in this, are similar in this vectorial space, or car and track will be also similar, but very different from cat and tiger. If this is not clear enough, you can think also in sentences. So you can not only create embeddings for single words, but you can create embeddings for sentences. So two sentences, like this is a great date and this is a beautiful day, will have similar numbers, while open source is great, should have a completely different vector. So in this case, I'm showing examples in two dimensions. When we work with real word embeddings or sentence embeddings, we work with much larger. I will show some details later. But it's good to think in two dimensions, because in two dimensions, you can visualize things. And this is an example with section headings in the English Wikipedia. So every word or sentence that you see there, it's a section heading in the English Wikipedia. I'm taking the most popular here, basically. And you can see that, for example, the word awards and the word honors, so the section awards and the section honors are closed in the vector space. Or career and club, they are closed. And you can see things more interesting. So for example here, you will see the bibliography, filmography, discography. Everything about kind of publication of people are closed in the vectorial space. So what is, and this is what I was saying before, what is a vectorial space? In this case, you have a vectorial space of two dimensions that we are visualizing. So you have the X and the Y. In a perfect world and an ideal word embedding, each of these dimensions will have a meaning. So you can have a meaning that will say that if this is a verb or is a noun, or you will have another axis that will explain if this is about personal attributes or about, I don't know, different semantic dimensional meanings. In reality, this is not like that. And the dispositions here don't have a meaning itself. So that the words is in the bottom and geography is in the top doesn't mean anything. What is important is the relationship between words. So the distance that there are between them, but not the position itself. So going back to here, it means that this 0, 8 doesn't mean anything. And this 0, 7 doesn't mean anything by their own. But what is important is the distance that are between these two words. Why word embeddings are so different with other approaches? So before word embeddings, people with, and for many things, you can still use a string similarity. And for string similarity, things like cat and car will be very similar. But for an embedding, they are completely different. And this is the cool thing about embeddings that is completely about semantics and it's not about how similar the strings are. If you're interested in a string similarity, you can take a look to other metrics like the edit distance or the Levenstein distance. And this is useful for comparing these kind of things. So with this kind of distance, you will count how many characters are similar in two strings. But as you see here, with this example, cat and car for this will be very similar. And you're not interested in that because semantically, they are not similar. And the other cool thing about embeddings is that even they are not making this string comparison, embeddings are not script dependent. So you can use basically scripts in, you can use embeddings in any script. I will show you some example with Alachim, Cyrillic, Arabic. We have used the Japanese characters, the Chinese characters. For sure, the morphological properties of each language and how they are written in each script will have some impact. But still, you can apply it with large text. You can apply it and get good results even for symbolic languages. So this is one of the most cool things about embeddings. So everything is about semantics. The character itself are not important. And why we want to represent words as vectors? We said that we want to represent words as vectors because machines are very good with vectors but not very good with words. But the cool thing that we can do when we have words represented as vectors, as numbers, is that we can do mathematical operations. So this is an example, an ideal example. So in reality, it might be not exactly like that. But you could do this kind of operations with word embeddings. So the most classical example is the king minus man plus woman should be equal to queen. You can see this is some an addition or you can see this is a relationship like a proportion. But for example effects, it's the same. Or you can do cool things about entities too like taking France and taking out Paris and adding Portugal and should give you Lisbon. So in the first example, what we are capturing is the gender. In the second example, what we are capturing is the relationship of capital of a country. Or even you can take the time of the verb, the tense of the verb. So you can take a verb and add past and you should get the verb conjugated in the past. For sure, this not always work as good as in this example. But I will show you some example with real vectors that will show you that this is not that far to work as in the ideal example. So if you want to play yourself with this, you can go to this link, the word to beg tutorial. Word to beg is one of the implementations of word embeddings. And this is specifically the example that I'm showing here is working on a data set trained on the large corpus of Google news. So the Google news really is set years ago as data set. And some embeddings were trained there. If this is very important to know to train word embeddings, you need large corpus. So a lot of text, basically. I will explain later why, but you can train your old word embeddings in your corpus. But you need that this corpus is big enough. And that's why usually embeddings will work better in languages that you have larger corpus. But even with some small languages, we have a good performer for embeddings. Not as good as in the super big ones, but they work good. And what are you saying? So for this new corpus, you can do these kind of cool things. You can take a company like Microsoft and say Microsoft is to Bill Gates, that is the founder of Microsoft, as Apple, is to Steve Jobs. So what you see in red here is what the algorithm captures. This is the output. So Microsoft Bill Gates and Apple would be the input and Steve Jobs would be the output. And you can go and play with this. And you will find also, I don't know, things that are the bias that are in the Google news. You can try, I don't know, to put some sport heroes in one domain. You can say Michael Jordan is to basketball. Or you can say basketball is to Michael Jordan as baseball is to. And you will find who the data set, plus the word embeddings, consider that is the equivalent of Michael Jordan in baseball or in soccer. So you can do these kind of cool things. Not only learning relationships that are dictionary defined, but also relationships between entities. So in this case, the vector that we supposed to capture here is the vector of the founder of a company. This is the example that I showed before, and this is how it's working in reality. So it's not that bad. And when you see in the top here, it's the top candidate. So basically this works as a ranking system. And what is showing in the red box is the top one in the ranking. So for the embedding considered France is to Paris, like Portugal is to Lisbon. So this is a correct answer. But the second candidate for this was Madrid and was pretty close. So here basically we were lucky that the algorithm is working good. But it could make a mistake to think that the capital of Portugal is Madrid. Or can say that the capital of Portugal is Porto. That is not that far away. If you see all the four first candidates make some sense. So Lisbon is the right one. But then Madrid, it's a neighbor. Spain is just beside Portugal. So we can think that these countries and cities appears usually together in the corpora. Porto is also in Portugal. And Sao Paulo is much more far away. But it's also a Portuguese speaker in country. Although that Sao Paulo is not a capital. If we see in the example back for this Microsoft is to Bill Gates, like Apple is to Steve Jobs. But also it could consider the same company Apple. So this is more a big mistake. But you can see that the probability is much lower. So here for the algorithm was not that difficult to find that Steve Jobs was the one with more probability. And then you have also a lot of noise. You have Apple that is not a person. You have iPhone that is not a person. So when you have Steve Jobs written with different captions. So in this they are not capitalizing all the words. But what you can extract from this example is that the algorithm is completely blind. So it doesn't make a difference between Steve Jobs and Apple. There's no knowledge base behind this. And this is good to remember and if you're in material learning you will understand more what I'm talking about. But this is a completely unsupervised approach. So there's no knowledge base behind this. This example I will show. And you can do other kind of things. So you can learn relationships or you can try to get the most similar concepts to a concept that you add. So considering the discussions in the last hours and in the last days, I put the example of global warming. So for this specific case, we see that the most similar concept to global warming is a climate change. And again you will see that it considers the climate change as the most similar. And then global warming written with the capital G or global warming capitalized the G and the W. But somehow also relates with the greenhouse or manmade global warming. So you are getting similar concepts to the concept that you will use. And this is the property that we will use when we work with cross-lingual word embeddings. So please keep this in mind. But embeddings are far from perfect. As I was showing you, we can say that we were lucky or I showed some success cases, but there are many cases that this won't work perfectly. So the embedding in this case, things that the capital of Bolivia is Caracas. And as we all know, it's not, Bolivia has more than one capital, Sucre and La Paz, but for sure it's not Caracas and nor Bogota. So why this is happening? This can happen for many reasons. It might be that the word of Bolivia and Caracas appears many times in the corpora together. It can be also that none of them appears very often about awards or less times award or a concept appears in your corpus. Slightly that less accurated will be the result that you get. And also you will find some biases. There are some interesting studies on showing the bias of word embeddings. You will capture all the bias that are on the people that is writing or is creating this corpora. So you will find some sexist examples. You will have some racist examples. Because if this was in the corpora that the embedding was trained on, this would be reproduced by embedding. So remember that usually much learning reproduced and given the characteristics can amplify the biases that are in a corpus. So for example, if you have a corpus that is full of typos or full of misspellings, it's for all that the result that you will get is also not very good. That's why people usually train their embeddings in corpus that's supposed to have some quality. It's very likely that you will train a good word embedding in Twitter if what you want to capture is formal language. If you want to capture maybe more slang, Twitter can be interesting. But usually people would use a more large text and more well, more accurated text, like Wikipedia or text written by journalists, very professional journalists. So also keep in mind, embeddings are far from perfect. And this I'm saying and it's something that I'm already mentioned, but this is super important to keep in mind. And I already have some conversations with people that is trying to apply embedding or say that embeddings are not working good. Because remember that embeddings are corpus dependent. So if you train your embeddings in Google News, don't expect that they will work good on Wikipedia. Or if you train your embeddings in a specific topic on Google I don't know, articles about living people and then you want to apply this in science, it won't work good because everything depends on the context. It's like this example that people living in places with a lot of snow will have many different words for whites. So if you train your embedding in their language, you will have a lot of different concepts to refer to the white core. But it won't be the same if you train these in languages or in corpus that are talking about a lot of difference in the forest. Maybe you will have a lot of different ways to talk about green. So this is super important. And the good news about that is that there's a lot of free trained word embeddings on Wikipedia already. So you don't need to train your own embeddings. You can use the versions that already exist on Wikipedia. And why they are so corpus dependent? And this is the only more theoretical theory part that I will talk about today, but the main concept that you should keep in mind is this definition that this researcher Fischer made in the 50s, saying that you shall know a word by the company it keeps. Even before, and more clear, but the less catchy maybe is the definition given by Harris that says words tend to occur in a similar context, words that occur in a similar context tend to have similar meanings. And this is exactly how embeddings works. So there are two ways to train embeddings. What you will find if you are playing with embeddings that is called SIBO or continuous bag of words, that is giving a context trying to predict what is the word missing. So if you have gestured words are really and then you want to predict the word there. They, very strong candidates will be to find nice or beautiful. So depending if the person or the corpus that you are using is written for by positive people, maybe if not, you will say an horrible and terrible day, but strong candidates will be this. And this SIBO works very good when you are trying to do exactly that. So predicting, giving a context, predicting what is the word there. There's another approach that is called Skigram that is giving a word and giving a context trying to predict if they will match together. So if you take the same example, gestured was a really something day, nice and beautiful will be the most common and the lightful will be less probable. But if you ask if it's probable that the lightful can be here, it is probable. So depending on if you are working with very usual words or if you are working with more unusual words, you will apply one method or other. But for sure in both cases, you know that the answer for this missing word is not bicycle. You will never say gestured was a really bicycle day. You can say other things, but not bicycle. So just keep in mind, and there's this example of the beautiful day I took from some Stack Overflow explanation that is pretty good. So you will prefer to use Skigram's training models when you work that will work well with small amounts of training data. So if you have less training data, maybe Skigram is a good approach for you, even with red words or phrases. But if you have enough data, it's better to work with the continuous bag of words because it will be faster and it will be slightly better working with frequent words. So this depends a lot in the application. I'm putting this because if you are already paying a bit with word embeddings, this is one of the parameters that you need to select and it's a question that I get a lot. So here you have an answer. In general, I would say if you have enough time, try both because it will depend a lot in your application. So till now I talk about implementation of theory of word embeddings. Let's talk very briefly about implementation. So how you can use this in reality. Now that you understood that a word embedding, it's a transformation of a word in a vector and then this allows machines to work with words basically measuring the distance of these words and building relationships, mathematical relationship between words. Now you can go directly to the implementations. I think there are three most usual implementations that people is working with. One is the first one, the first one that becomes popular that was word2vec, that was done by people in Google. You can download word2vec, it's in GitHub, it's open source and you can play with it. There's another approach that is a bit different but it's also from the same people in Google if I'm not wrong, that is Glove. They have different applications but the idea is always similar, the idea is always what we are discovering and then you have fast text. Fast text has been developed by researchers on Facebook but it's also open source and you can download it for free in GitHub. Oops, I have any problems with that. I don't know what happened. Let me refresh this, oops, give me a second. Okay, I have some problem with some slides. I will share my full screen for a moment and then I will go back to this, okay. Please, can you confirm that you're... Yeah, I'm seeing your whole screen right now. Okay, let me go here. So I will show these two slides that for some reason are not loading and then I will go to the presentation mode. So I was saying that we have these three versions, word2vec, Glove and fast text and I will, I'm working with fast text for many reasons. One is that it can work standalone so you don't need to use on Python, you can just use directly and there are many applications there because it also allows supervised and unsupervised tasks. So I told you that this is completely unsupervised meaning that you are not doing classification here but with fast text you can also give a document and give a tag, let's say a topic and then ask Facebook to categorize new documents in topics using some subword information. So it takes some advantage of knowing that dog and dogs are kind of similar. And the most important thing and the reason why I'm working with that is because they are sharing pre-trained embeddings in multiple languages. I don't remember exactly how many languages now but I think there are more than 100, I think it's 170. So you don't need to train your embeddings. If you want to play with this, you can just download them and play directly with the embeddings. And this is something that I'm not sure if I'm being clear enough, but you have two parts here. You have the part that you trained embedding that I'm not talking about. So the part that you train embedding is when you convert the words to vectors but then you have the part that you play with embeddings. That is, for example, the example that I showed with Barbie so with Steve Jobs. So you don't need to always to train your embeddings. You can just download pre-trained embeddings. So basically you already have the numbers to start playing with that. And I think this slide was also not loading so I will show here. And in practice word embeddings, so I show you in two dimensions but in practice word embeddings are usually between 150 and 300 dimensions. So instead of having these two positions that we can visualize in a play, in a plot, you can, real word embeddings have these dimensions. The pre-trained version for Fastex for Wikipedia, if I'm not wrong, it's working with 300 dimensions. So let me see if I can share, I can do present again. Yeah, now that's okay. So this is the first part of what was a word embedding. So remember word embedding transforms strings into vector. Words with similar meanings will have similar vectors and embeddings are computing using words context. So that means that depends a lot in the corpus that you're using. So these are the three main takeaways for this first part. So now that we know what is an embedding, that we know that there are vectors, we can go and start talking about cross-lingual word embeddings. When we want to use cross-lingual word embeddings, the usage that we are giving to that now in our projects and where we have seen that was very good is when you want to compare cross-text across different languages. So here, for example, you have the section headings in five different languages, Russian, Japanese, English, Farsi, and Spanish. And you want to know if Premios in Spanish or if a word in English connects with this word in Farsi that I cannot read or connects with this word in Farsi. So you can start measuring these distances and try to create these kind of alignments. And this is very, very, very useful for many reasons. I will show some examples later. Why this is technically difficult and what is the contribution that many people is doing there and we have showed that we can also do some contribution in that space is that, as I said before, the values of the vector, so the position that you see there is before, it doesn't have a meaning. So basically every time that you train your embedding in a different corpus, the position that the things will be will be different. And then if you train in different languages, you will be usually training with different corpuses. So one corpus will be the English Wikipedia and the other corpus will be the Spanish Wikipedia. So the position of vectors will be different. So if we go back to this example, as I was saying, this position of library here, it doesn't have a meaning itself, it has some meaning because the relationship with the other sections. And if I train another embedding for Spanish words that are similar will be in different positions because this was trained in one, the red ones were trained in one opportunity and the others were trained in another opportunity. So they are basically in different places. And what you want to do here is to align them somehow. And you have two solutions for that. One solution is to use pre-aligned word embeddings. The most famous now and that you can play with is Facebook laser. So basically what they do is from the beginning, they force, they use some anchors and they force the words to be in some positions that they know previously so they can create from the beginning these aligned word embeddings or you can learn your own transformation using some bilingual dictionary. And then you can rotate basically this space to make them fit. What we are doing now is for example, using Wikidata information, we can create these kind of alignments. And we are doing some work to show that this alignment using Wikidata works better than other approaches. So basically what you will do is that you will know that for example, from these four points, you know two of them, they're mappings in the other language. So you know that discography is the same that discography or that publication is the same that publications and then you can rotate the space and you will put them in similar spaces. And as you see now, the ones that we knew from before are more or less aligned but also the ones that we didn't know from previous from our mapping are also aligned. So then using something like we saw before about the climate change that was the same that global warming or similar with global warming, we can start creating this cross-lingual alignments. Important to keep in mind, we are, these alignments are far from perfect. Again, as I said the example that I showed before, we are learning analogies, not identities. So many times the words that you will find will be similar but not identical. So you shouldn't use cross-lingual embeddings for direct translation. They are not designed and they won't work good for translation. This is not the aim. The aim is to measure the distance of the semantical distance of words in two different languages and to know if they are kind of related or not to have to rank them by some notion of similarity. And when this really works good, not for translation but when you have a small set of candidates. So if you try to do this alignment with all the words in Spanish and then get the most similar word in English, it won't work good because you have too many candidates to be the alignment. But it will work good when you have a small set of candidates. So in the example of the section alignment that I'm showing, you have kind of thousands, tens of thousands of sectioned meanings in its language. And then it's easier because the algorithm, doesn't need to compare Premios with all the words in English or with all the words in Japanese, but just with 10,000. And we'll know that Premios, it's more likely to be awards. So it's more similar to awards than to be biography. So, and this keeps in mind the same for sentences. So the algorithm will be good knowing that Buenos dias, the distance between Buenos dias and good morning, it's smaller than the distance between Buenos dias and thank you. So that would be gracias. So if you want to compare some candidates and rank them, this is a very good usage of cross-lingual embeddings. If you want to do automatic translation, cross-lingual words embeddings are not for that and for these are other kind of solutions like machine translation that is not completely, but very different approach. I'm almost finishing. Let me show you some examples. So I gave you a lot of the theory. I hope that you are not worried at this point. So I will show you how we are using that. The example that I'm using for all these plots is the sectioned headings alignments across languages. For many reasons, we want to create alignment between section headings and we don't want to have translations because we don't want to basically impose the style of one Wikipedia and another Wikipedia. So if in one Wikipedia in English, Wikipedia people is using the section early life, we want to see what is the most similar section title in Spanish or the most similar section title in French or in Japanese or in Russian, but not translating literally. So for this and the results that I'm showing here are using mainly cross-lingual embeddings, but also the features that I don't have time to explain now, but for this we are creating an API. And this is more or less what we are taking out of this. So if we take an input language, put language English, and we asked the algorithm to say, what are the most similar section in English to Historia? We'll see that the first one is very accurate to its history, but then you have story that is not the best, but they are kind of similar. And the number that you have here, it's the probability of being similar. So for this one, we have 0.97, so we are pretty sure that Historia is the best mapping of history. And we know that Origins is some kind related, but it's much less probable that the top one. There's an API that you can play with that is online now, the documentation is also on meta. So you look for a section alignment API on meta, you will find a lot of examples for this. But not only from Spanish to English, it's good to show that this also works in different scripts. So you will give the section history in English and say, give the most similar output in Russian that has a silly out of it. And you see if you speak some Russian that the mapping is also kind of good. And also Spanish to Japanese, and this is one of the cool things. You won't find many people that can translate from Spanish to Japanese, but with these kind of approaches, this is not as important, so you can basically learn unusual language pairs. So this mapping of Historia to Japanese. If there's some Japanese and Spanish speaker in the audience, bilingual, it can give some input, some feedback. English to Russian, I think I already showed. And what we are currently supporting is Arabic, French, English, Spanish, Japanese, and Russian. This is just because we were looking for languages from different families and with different scripts, but the code is there and we will be happy also to extend this to other languages. And in fact, this is what we have done for another application that is the name and parameter templates alignment. So this is something that we were doing and that we are already working and it's already implemented now with the content translation tool. So thank you all the language team, Santosh and Pao, this is the people that I have working more about this. So when you're translating with the articles with the content translation tool, the content translation tool do a lot of improvements on top of what the machine translation engines are doing, but one problem they had is to translate templates, especially name and templates. So templates where you have parameters with names that will differ in every language. So using wiki data information, you can know that two templates are supposed to be the same in two different languages, but then the parameters that are inside you have no mapping and even they have different numbers. So the template about sport person in English has different number of parameters that they didn't played about a sport person in Spanish, but we want to find at least the ones that are the most probably to be the same. And the solution that is not perfect, but at least it's better that the content translation tool has before is the usage of cross-lingual embeddings. This we have implemented in the most popular language language pairs in the content translation for the content translation tool. We create the mappings across all of them. So you can see I think there are 40 languages and you can check more about this in the fabricator task that I'm listing on the bottom of this slide. And just to show one example, and this is the info box about motorcycle rider that in English has a lot of parameters. So you can see here, this is just one part of them has a lot of parameters it continues in the other page and this is not even all of them. And we want to show that we can map this to the Hebrew version of this. So Hebrew, a different script, a different family language even going right to left. And I cannot read Hebrew, but what I can see is that they have much less parameters. So we have many complexities here. We have a mapping that is not end-to-end. So they are not exactly the same parameters, but we want to find the similar parameters across languages from different families and different scripts. And we do not that bad. If you speak Hebrew, you can see here, in this case, this number is the distance before I was showing the probability. So higher was better in this case, smaller is better because this is distant. So we know that between bag number and this word in Hebrew that I can read, the distance is very small. Like for name, the distance is higher, but we are sitting at threshold. And from what I checked, all these translations looks more or less good. Some examples in Spanish, just if you speak these two languages, the InfoVox publisher, it also maps pretty good. And one cool thing about this is that this is very resilient to the subword characteristic of Fastex to, for example, these underscores. Or in many cases, we have a surname one, surname two. This kind of things also works good in Fastex. So you can map, even that here you have one word and here you have two words or one word with the underscore. Fastex will do a good work with that. Or if you have image one and image two, this also will be captured by the embeddings of Fastex. Last warning, cross-lingual word embeddings won't be as good as bilingual humans in creating alignments. So as I said in the beginning, computers are very good with numbers, artificial intelligence, machine learning, it's very good in many things, but in another language processing, it's still very far away of an average human. But they have some advantage and there's some use cases. So they can do the work really fast and can work all the day. So you can start doing this parameter alignment in a short amount of time. Currently with the implementations that with the process that we are using in theory, you can use in all languages in Wikipedia, even in the smaller versions. As I said before, less articles in this case, you have less accuracy you will have, but still for things like the ones that we see, like mapping a name, image size, these kind of things, it will work good. And maybe the most cool thing is that this will work good for unusual language pairs. When we were evaluating the result of the section alignments, we need to find people that can translate and can explain as if the translation from Spanish to Japanese was good. And this kind of people is super difficult to find. But with cross-lingual word embeddings, even if there's no one that can speak these two languages, you can learn in a part of languages that you already know, and then apply to a completely unknown or unusual pair of languages. And that's a pretty cool application. So just for closing, remember, so if you need to remember three things about this talk, is that word embeddings allows machines to understand similarity among words, that cross-lingual embeddings allows machines to compare words across different languages. And please, if you are working with Wikipedia, use word embeddings straight on Wikipedia. Either train your own embedding or download a pre-trained model by the own Wikipedia. If you use a legal text or other kind of news corpus, and the result won't be as good as if you train on Wikipedia. And to finalize, if you want to know more about other possible uses of embeddings, so here I was talking just about word embeddings, but as I said, they're document embeddings. They are image embeddings. You can check our white paper. This was mainly written by Adnan Houthaker about topic embeddings, so how we can use embeddings to find similar topics in documents. And if you want to do some hands-on work, please don't hesitate to contact me, but you can also go and clone this repository that is the Jupyter NoBoot version of this slide. So you will find similar things that I mentioned here, but with some code that you can do in Python and start playing with your own embeddings. With the small knowledge of Python, you can start playing with that. And I think that's all. So I'll be happy to take questions. Thank you. Awesome. Diego, I'm going to check the IRC and the live stream, and I'll send some questions your way just a second. Thank you. So we have a question from the live chat. What about sense embeddings? Wikidata has a sense notion of lexemes. I noticed that sense can also be used to build embeddings, and that was from Thomas on the live chat. Yeah, so Wikidata is a bit different in the sense that you could create in our team, Isaac Johnson is doing some work on that using word embeddings, but another approach that can be interesting on Wikidata is to work with graph embeddings. In a graph embedding, you take the advantage of the network. So basically, you will create a graph, so a connection between different Wikidata items, using the properties as edges. And with this, you can learn that, I don't know, that two cities that connect to the same country are kind of similar. There's a lot of work now in graph embeddings. There's an algorithm called Graph Sage, S-A-G-E, that can be used for these kind of purposes. And the idea will be similar. The idea will be that you represent a Wikidata item with a vector, and you can do these kind of operations. I'm not aware of results in that, but it's for sure something that I would love to explore, and if there's people that wants to work on that, I'll be happy to collaborate with. Awesome. And then I see a question from Subu on the IRC chat. I am curious how the vectors are computed in the first place. I know you mentioned CBOW and Skip Graham as training models for computing these. And how do you pick the number of dimensions? Is that also an output of the models? Yeah. So basically, this is done by one holding coding. You define a window. So for example, in this case, this is not perfect because what you will do is create, let's take this sentence, let's say predicting words by the context. So you will take a central word that in this case will be bye, and then you will take the two words that are surrounding. This you can consider as a bag of words for you can consider the position of this. But basically, you start building a one holding coding that will represent bye in this case, but the two words that are preceding and the two words that are after bye, and then you will move into our window to the right, like a sliding window. And with this, you will see, we compute a probability of the word to be surrounded by other words or to be co-occurring with other words. There are different ways of initiating that. These values, many of them are randomizing the first values and then trying to adjust that the values of the vectors meets with these probabilities. In terms of the dimensions, no, this is a parameter of the model you need to pre-define. I've seen some work that they show that better results are obtained in word embeddings between 150 and 300. If you create more dimensions that 300, if you use more than 300 dimensions, the accuracy won't improve. So basically, they measure, they have a list of words that they know they are related and they can measure accuracy by that. And after 300 dimensions, the accuracy don't improve and doesn't improve and it spends more resources. So it's what is used more now, it's between 150 and 300. If you use this in production, you might want to go for the 150, if you are not, if it is a small difference in accuracy between 150 and 300 are not very important for you because you will spend basically less memory, less resources. But this is an input. This is an input and all the pre-trained models that you will find in reality will have this between 150 and 300. Usually these two values, not even, I haven't seen too much with 200 or 250. I'm always see 150 and 300. Okay, awesome. I am not seeing any other questions on the live stream or on IRC. We can probably give about one more minute of a live stream just to see if something shows up. Just to use these seconds to say that all the code for creating alignments for the APIs that we are using or for the template parameter that is based on dumps, it's all in GitHub. I can set the pointers and also there's these notebooks and this specific notebook that I point into if you want to play a bit. If you want to play with this, my only recommendation is that you do this in a machine that has some RAM, like 16, every model for Wikipedia, it takes use around five gigabytes and you don't need to have all this memory all the time but for loading you will have at least these six gigabytes of RAM, five gigabytes of RAM. Thanks Diego. And any of those links that you have, we can also share on the Tech Talks page as well. So that people can just go directly from there. I don't see any further questions. So thank you again for presenting. This is super interesting and I'm really glad that you were able to share this information with us. So yeah, I will post information to the Tech Talks page after this, including a link to this talk and have Diego upload slides as well. And then we'll have another talk next month and again, just a reminder that folks are welcome to join and participate in these talks and all they need to do is reach out. Okay, that's it. Thank you. Thank you.