 There we go. So good morning everyone, 9.30 a.m. New York City live Here we are for another class of deep learning and this time we're gonna have Marco Aurelio Renzato with us Which is gonna be talking about translation and so many interesting things especially with low scarcity that we scarcity offer of data Marco Aurelio is a research scientist and manager at the Facebook AI research in in the lab here at New York City He's generally interested in machine learning computer vision natural language processing and more generally artificial intelligence His long-term endeavor is to enable machine to learn from weaker supervision and to effectively transfer knowledge across tasks possibly leveraging the Compositional structure of natural signals Moreover, he's from Padua, Italy And he's a fellow engineer Electronical electrical engineer electrical engineer is the electronical electrical. I guess it's electrical in English after spending several months in a kyle tech He started a PhD in computer science at New York University with our common supervisor here, Jan Lekhan Afterwards in 2009 he joined Geoffrey Hinton labs as a postdoc in 2011 He moved to industry and was one of the first very early members of the Google brain team in 2013 He joined Facebook and was a founding member of the Facebook AI research lab. And so with no further ado I'm looking forward to listen to this spectacular lecture lecture from Marco Aurelio In fact, you could you could say that Marco Aurelio was at Facebook before I was Oh, wow, preceding me by about six months. You could say to some extent that he kind of hired me as at Facebook Nice nice All right, I will disappear from the from the camera. Thank you for having me. It is a great honor to be here and to tell you a little bit about machine translation and in particular machine translation when you don't have a lot of level data And the idea is that hopefully you will learn about a practical application on deep learning in this case for machine translation, but there are some principles that You can hopefully generalize to other applications So let's review what is machine translation So, you know everything starts with data. So let's assume that we have What we call a parallel data site, which is a data site that consists of a Collection of parallel sentences. So a sentence in the foreign Language, let's say Italian with a corresponding translation in English, which is Let's say our target language And so we have a large data site where we have a lot of sentences in Italian with a corresponding translation in English Okay, so now this is our label data and now let's review How you train a machine translation system to fit this data and hopefully generalize to new Sentences in Italian in this case So the way that this is done is by these days is by training a neuro machine translation system This is the architecture is these days a transformer and this is trained By a statistic gradient descent. So the way that it works is that You try to model The joint probability of all the order sequence of tokens in the target sentence given the source sentence and of course you are trying to maximize this probability Over the set of parameters of your model, right? now one way to so We maximize this by using the Product rule of probability theory and so now the trying probability of all the sequence of tokens Let's say the catch up on the net and I'd say that each token is one word here So you have a string that is composed by this case one two three four five six seven tokens is you can write it as the product of A token given the prefix and the source sentence. Okay, so in the log space then This product becomes a sum and so that's what you see here. I hope to see my cursor And so you you write this join distribution as as the sum of the lot probabilities of One token on the target side for for in this case the this is a word from the English Sentence given the prefix and the source sentence. Okay And so essentially Everything boils down to a bunch of classification tasks, right? So it's very easy and all these specifiers and sharp parameters. So Let's dive a little bit more in the detail to make sure that we understand how this works So let's say that our Source sentence is this Italian sentence, which is translated into English with the cassette on the map Now I'm gonna abstract from the details of what is the architecture So it really doesn't matter if it is a regarding your net a conclusion your net or a transformer What they say applies to all of this, okay, because I'm talking more about the algorithm Then the specific architecture and so I borrowed this old diagram from Jan Where circles are variables and boxes are transformations between variables In this case, I'm depicting a recurrent net but very much if you have a transformer. It just means that these hidden state Is produced by a transformer blog that depends on the input token that is embedded here and all the previous Hidden states So in this case, we're doing language modeling right because we are not considering the source sentence And so in language modeling what you do given The word that time T you are trying to predict the word at the next time step. Okay, so from Kat You're trying to predict that But now we also had the source sentence, right? And so we should be able to do better than language model because we have actually access to the source sentence and so We should be able to do about prediction. We also consider the information from the source sentence and the way that We do this is by Taking the feature representation at the CT plus one And At the same time by Representing each word in the source sentence. So let's say that this is your recurrent neural net or transformer for every input token You have you eventually go to You represent this. Okay, so if it is a transformer For every token you are going to produce another token at the output, right? And so you got Let's say a d-dimensional vector here, which matches the dimension i here And so what you do you take a dot product between these Hidden state with all the hidden states coming from the source tokens And after the dot product you take a softmax and so now you have A distribution of our tokens. So something weights that are positive as sound to one and then You produce a representation of the whole source sentence It is conditional on this CT plus one on the target side by multiplying these weights by The values and the values could be some transformation of this input Representations of the tokens and so after you take a weather sum with these weights You got the vector and then this vector gets combined with this CT plus one Let's say you concatenate you do some sort of transformation You may have multiple layers and eventually you predict a distribution over the next word. Okay And you can apply cross-entropy loss on this distribution over the next word because you also have The ground truth, you know, what is the next word in the sequence? And so you do this for every token at every position and then you can back up through the whole thing. Okay So that's a little bit the overview of the mechanics of how to train a sequence to sequence model Also for machine translation, but not only Okay, so the cool thing about this is that um You don't need to align So for a long time people in machine translation were a little obsessed with how to align a word with On the target side with the word on the source side This is learned by the model and it is done in a soft way by using this Softmax essentially the other nice thing is that if You are using a convolution. You're a net or a transformer All these representations can be computed in parallel And uh, and so this is also very efficient So, um Okay, so this is how you train the neural machine translation system But essentially what we described here with this joint probability blah blah blah here, right? You are scoring so it means that for a given source sentence You produce a score for a Translation for for a For a sentence in the in in the target language But this doesn't mean that you can generate right The only thing that you can do is if I give you a sentence in italian another sentence in english I can give you a score that tells you how plausible is that this sentence is actually a translation on the source sentence And so how do we generate the way that you generate? So there are many ways to generate this Most popular Way and and also very effective is by using beam decoding and beam decoding tries to approximate the maximum of Tries to find the maximum in an approximate way of the probability distribution of y given x of your target given The input you don't have y so you maximize over y, right? And you we are given x you are searching on the space of y's Of course the space of y's is very very large. So you cannot do it exactly and and therefore You use some approximate technique like beam search The way the beam search works is that you start with A beginner sentence token, right? And so that's the very first step of your decoding. Let's say it is your Actually before even this one you you need to predict there, right? So you start from beginning of sentence you you're supposed to predict there. Yeah Uh, so from the beginning of sentence token You produce at the output at the very top a distribution over what is the first word in the sentence, right? And so each of these words we've got a low probability score. Okay And so here we have uh as many words as many uh branches as words and as tokens in your vocabulary In this case, I'm using words, but It could be also character angrams. Okay, so let's say life has a score minus one and today has scored minus point five And then what beam search does is that among all these branches? He selects the top case scoring. Okay, so let's say that for us k is equal to two So then we among all these scores the one that scores the highest is life and today We score minus one and minus 0.5 After this Beam search goes to the next step and says, okay, how about if I expand life In all the possible next tokens. Okay, so now you expand vocabulary size from life So you you'll get like a like the life is and so and so forth and the same with today Okay, so that's what you do and so for every next token you got also low probability score, right? Or a scoring channel. It doesn't need to be normalized And uh, and now You need to compute a score for for this path and the score for this path is simply the sum of the scores on on the arcs that compose this path So in this case, you got minus one plus minus one, which is minus two. Okay, and the same for Uh life is and the same for today this and today that okay, so again You have in this case two times vocabulary size Paths and you're going to select the top k Best scoring highest score So in this case for us between let's say minus two minus 1.5 minus three minus 3.5 You're gonna select life was and life is and you repeat this process, right? So you got then you go to the next So you drop this plan C's you keep like like was and like this you expand this vocabulary size times One arc for every token the vocabulary you compute the Score for the path you select the top two scoring parts in this case is going to be life is beautiful And life is great And you keep proceeding like this Until you hit An end of sentence. Okay. So and then at the very end You select the path that has the high highest scoring Value in this case it is this path. Okay So that's the basic of beam search, which is a greedy procedure. Okay Uh, there is a trade-off the the wider the bigger is the value of k in in Your top case selection the battery is the approximation but also Uh, you increase the computational cost Uh, in practice beam search is pretty effective. I find I finding High scoring parts But it has also its own issues For instance, let's say that in your dataset La vita bella was translated two thirds of the time with life is beautiful and one third of the time with life is great Now when you do beam search Beam search is going to pick always the most likely path. And so assuming that the mono is well calibrated It's going to always translate The sentence with the life is beautiful 100 percent of the times. So that means that The translation that your system produces is going to be biased Just because of the natural of the of the max selection And this may not be good for instance, let's say that you go from a language that is Uh, not inflected like chinese or english to one that is inflected like italian or french If you say, I don't know that The truck driver is strong. Okay Well, I don't know what is the gender of the truck driver, but in italia or friends But in italia and for sure you need to specify what is the gender and now if you use Beam search, you're gonna You know choose one of the two genders 100 percent of the times whatever was most frequent in your dataset And therefore you are going to introduce some sort of bias So this is one of the problems that you have with Team search, so it doesn't handle well uncertainty. And so there are Yes, uh, inflected means that It's a language That specifies for instance gender number So you have the ending on the word that changes with the gender the number for instance Um, so yeah Um, and so let's say beautiful in english goes well both for male right now. Actually, this is a bad example All right, so if I say, um Driver, okay the noun driver in english applies to both female and male but in italian, uh, you need to specify the gender, right? Okay. Okay um and so So now you may say well Uh Since we have a scoring function that is a normalized probability distribution What you could do is also to generate by sampling from the model and that's a totally reasonable thing to do except that because the The distributions produced by soft monks you never assign exactly probability zero to the docus in the vocabulary so that means that In the data if you look at the actual data distribution, it's very spiky So there are a lot of reasons that have zero probability months, but the uh, the model much is probability must say to be everywhere and that means that when you sample you are pretty likely to hit Places that are not very good. Okay, so usually Uh Sampling produces not very fluent sentences. They are very diverse, but not very fluent not very good quality so then People have come out with ways in between like this top case sampling where you say, uh, I can take At every time step. I have a distribution over tokens at the next step I uh, slide the top pay highest scoring Wards and the sample from those So it's a bias sampling And there is a huge literature on how to Uh, generate how to decode there are, uh, uh, re-ranking approaches but generating discriminative In discriminatory ranking, particularly you back propagate through beam search essentially And so you actually instead of training for scoring you actually train to generate And so if you're interested in these topics, I can I'd be happy to follow up with you and send your references It is a very nice and I think this connects also with what only, uh, talked about uh, sometimes And okay, so for this lecture, I will stop here unless you have questions And this is just covering the busy background about how what is machine translation from the data How you train it and how you use it Okay, so there is a question actually from a student here Uh, so the question is the following. So how does this work intuitively? I think the first step is what we did in homework three, which was, uh, handwritten recognition So to get a set of embeddings part two I am not quite sure how the dot product and softmax replaces the alignment So is it something like the dot product is trying to find some which embeddings align with z t plus one So how is the source and target sentences aligned? I think So, okay, so, um So the idea here is that, uh, you are processing set You have some representation coming from the prefix. Okay, and now you want to include A representation of the source in particular in order to predict an export you need to figure out you know, essentially What set maps to the source sentence? Uh, in this case is is this, uh, diagram here this asiduto This sequence of two words and, uh, point the model that, you know, it should pump up the Probability of the translation of the word which in english is on, okay um And so if I were to do it by hand, what I would do is to say I need to align I need to figure out that sac corresponds to these two things and the next thing to translate is This word which in english translates to this one. Okay, so this is like manual alignment What I described here with this diagram Essentially does it for you? So you let the model learns the alignment. How does it work? So you got you embed as, uh, You were saying you embed each token, right? So you embed here each token. You have some way to refine this representation So to produce contextualized embeddings for for instance, if you have here a transformer block Okay, and now from these representations of these words Uh, we do a dot product with this hidden state. Okay, so that essentially tells you Which word on the short sentence matches with with the word on the target sentence, right? That's what the dot product does When you take the softmax you are Converting that score to something that is normalized that is like a probability distribution. Okay So the sum of all these numbers is one and they are all positive And let's say that you have a high score here for for Uh, this token that corresponds to to this work And now you need to convert this course to a vector because that's what you need to plug into Into the the decoder on the target side And in order to convert this into A vector you take for instance these vectors Maybe you transform them and you take it with the sum with these Scores and that will give you a single vector that represent the whole source sentence For this specific target board Okay, and so it does alignment because of the dot product that you are kind of matching and then you know if you look at In practice if you look inside a machine translation system and you look at these scores they actually Correspond to It's somewhat interpretable alignment of the Target sentence with a source sentence. I don't know if it was clear Yeah, and on the on the right hand side So that z t plus one comes from that also untrained f model, right? Or is it pre-trained? So Initially initially it is not trained. So okay, it depends If you have if you do If you have a very large label data set then you start from random Okay, you start from a randomly initialized weights. And so initially this app is going to be random But you back up through everything and so you're going to get the cross entropy loss here Gradients coming here you update the parameters there and also the gradients will flow through the the encoder here Okay, and we'll now if you don't have a lot of label data then You may want to pre-train And we'll go into that The the next minutes And so yes, you can pre-train With a language model like a language model on the right hand side Yeah, so you can you can pre-train with a language model. You can pre-train with a language model that is trained on multiple languages at once Sharing parameters between the encoder and the coder. There are many ways to like a birth. I think you have been Yeah, we talk about yes. Yes. Yeah, so you can apply birth on multiple languages at once And I will do it too. Yeah Okay, makes total sense. Oh, so the the student is actually following up How do you uh feedback the fact that some sentences are uncertain? If you train, uh, for a long time with only one sentence to a sequence aren't you going to eventually bias this network? It's almost sorry this network is almost Going to get impiles probability. Isn't it? You just stop early um Could you repeat the second part or the very last sentence that you said? So if you train for a long time with only one sentence to see So only one second sequence to sequence aren't you going to eventually bias this network? Uh, hold on. Maybe the sentence is broken Okay, so, uh, let's let's say that you have let's say that you have It's very much very much like with amnesty, right? Um, so let's say that For the class zeros You add an extra class class 11 and you say that Whenever you see a digit zero we probability 0.9 you assign the class the label zero Which is the ground truth label, but we probability 0.1 you assign you assign class 11. Okay, so so now You know, uh, whenever you say digit zero, it could be Class zero or 11. Okay now you can train this thing and um Because you are trying to match the output distribution Assume that you're not overfitting all of that what happens is that the the distribution the output is going to have 0.9 probability for label zero and 0.1 probability for uh label 11 right Now if you are if you're going to predict by taking the max you're always going to Predict uh label zero 100 percent for the time But if you were to sample then you would Assuming that the model is well calibrated and doesn't work it you would actually match the distribution of this in the data Same here. So if you have a sentence that is translating in multiple ways the model should Give you A probability distribution over all possible translations Um Assuming that the model is well calibrated. Okay Okay makes sense um Yeah, I mean that that is actually a whole topic because essentially if you want to of course, you know, you can help the model To deal with uncertainty by introducing latent variables Right. So now you can have a latent variable that a certain state of the latent variable that Allows you to translate in a in a in a certain way. Let's say you go again from the Non-inflected language to the inflected language then you can have a state of the latent variable that produces all things in female singular another state that can go on female plural and so on and so forth, right? So You know, you can play with the model to Uh to better represent this uncertainty certainly um I think we saw recently some news about like a google translate and how Translating back and forth from turkish to english. Uh, the translation get basically, uh One way only right it gets like, uh, yeah, it's not surprising again. Just because Usually when you do when you decode with beam search, uh, the translation is very fluent. It is, you know, the majority class Of course, if you go from something that is not implanted like english to turkish, which is highly inflected And back and forth. I mean, you're gonna lose a lot by doing this because It's not that it produces a distribution over translations, but it just gives you one and this is a problem also with I guess the ui Uh other questions That's it. These are these are excellent questions, by the way. Um, thank you for asking and Be free to ask more. Uh, I share obviously the slides with you. So even if I don't go through everything, uh, you will have Access to the material and I prefer to make this interactive So to conclude the review of machine translation at the pretty high levels there Uh, so we talked about training. We talked about inference how to, um Generate from the machine translation system. The last part is evaluation And I'll just say that the evaluation, well, unless you do human evaluation Automatic evaluation is very simple. Essentially you have typically you have, uh Source sentence, of course, and you have a human reference a human translation and you have a system hypothesis. So a prediction for, uh, the translation produced by the machine translation system what you want to do is to Match these two strings somehow and the way that this is done currently. So there are a lot of metrics again It's it's a huge literature here, but the most common, uh, metric that is used today is called blue And it's simply a geometric average Average of precision scores where this p and here are, uh, um Unigram by ground diagram and foreground precision scores So a unigram is a single word a diagram is a sequence of two words a trigram is a sequence of three words and four Gram is a sequence of four words a precision score simply checks whether a certain n gram That is present in the system translation is also present in the human reference And you count how many matches you have and you compute the precision score like that So it's essentially stream matching at the level of n grams Okay, so that that's all you need to know It is a number that usually is given between zero and hundred. I mean, it should be between zero and one Really, it's great between zero and hundred. So hundred is perfect match like if you Do prefer match and zero if you don't get any single n gram And of course as we said before there is also the fact that For a given source sentence, there are multiple plausible translations. So it is very unlikely that you got 200 Let's say for English friends very good Translation systems on domains that match the training site. You got to something like 50, right? Because Because of this uncertainty Okay, so This was the recap about machine translation Uh Sequence to sequence modeling to the translation task And let's step back and think about the assumptions that we made implicitly So the first assumption is that the languages under considerations are Italian and English which are two european languages that have quite a bit of commonality The second assumption is that we have a label data set, right? So we have a lot of parallel sentences Why do we need a lot because the neural machine translation system has a lot of parameters if you look at A standard textbook. It would tell you that you need roughly three What was it three data points to estimate each parameter, right? And here, you know, we are talking about models that have hundreds of millions of parameters if not more and Do we have You know three four times Parallel sentences, maybe not And so it turns out that there are six thousand languages in the world And we uh, at least the community working on machine translation is fairly English focused and very European languages focused and by turns out that you know only five percent of the world population is native English speaker In fact The top languages account only for 50 percent of the population meaning that The distribution of speakers in each language is very heavy tail and you have a lot of people speaking Languages that that are spoken overall by very few people. Okay, but overall these Some or small populations makes a big fraction of the world population So this is a problem, right? Because essentially if you look at machine translation engines Google translating What now They only are able to translate to and from English typically and only 400 or so languages That means that there is a huge tail in the distribution that of languages that we are now currently Are not able to translate and now I think that honestly there is kind of little hope to translate the languages in the Far right of the tail because many of these languages are only spoken. They are not even written you barely find any digitized resources but For the intermediate section here that they're showing yellow I think there is hope and for these languages, maybe you don't have A parallel data set but perhaps you have at least text raw text in each language And the question is can you use this unlabeled data to build the machine translation system? And so why is this a problem? You can see it from here. So the gray area Is how much parallel data that is going into English for a bunch of Languages okay, so from the one that is that that is highest resource to the one that is lowest resource And the darker area is the performance of current machine translation system And so you can see that there is kind of a phase transition here below a certain amount of parallel data The quality degrades a lot and in this area really the machine translation system is doing so poorly that it's not useful and so currently you need a lot of Level data in order to train machine translation systems. And so Let's think about how we would build let's say an English Nepali machine translation system Nepali is a language spoken in Nepal lovely country With 25 million people. So it's not, you know, very few people right and it turns out that The amount of parallel data is very very small in this case. So It's not what we were expecting before So if you're interested in machine translation if you're interested in general on corpora That are translated in multiple languages for whatever for tax classification for whatever purpose is You can go to this website. It is a public repository of all the available parallel data It is very very nice. So if you enter English Nepali, for instance It gives you this list, which you can download for free of data sets And now if you look at the number of sentences You realize that the Sources that have the largest number of sentences are jw 300, which is jowa witnesses magazines and then you have Chinomaki d4 which are ubuntu handbooks so You see the problem, right? So the problem is that essentially that A you don't have a lot of parallel sentences and b the parallel sentences that you have Are in domains that are not super useful that are either noisy because like wiki matrix. They are automatically aligned so It's not very high quality or if it is high quality is in very niche domains that are Generally, you're not interested in translating new sentences from, you know, uh ubuntu handbook, right? and so What is machine translation practice for the large? Majority of the languages so it looks like this. So let's say that now. Let's represent with boxes The sentences so the blue boxes Sentences in English and the red boxes are the corresponding translation in Nepali So it turns out that A parallel dataset Is an object that is a little bit more complicated because it is composed by some sentences that originates in English and some that originate in Nepali And so now i'm going to show with an empty box The human translations And so in this case this Red empty box correspond to the human translation of these Sentences in English that come from the domain bible while these sentences in Nepali originate in Nepali are from the parliamentary data And they are translated in English So now you may agree we need that translating all those sentences from the bible It's not the most interesting thing to do. So maybe what you want to do is really translate Let's say newsletter from English into Nepali but the problem is that You don't have parallel data In the news domain if you're lucky what you have is Monolingual data. So monolingual data is just text or text without the corresponding translation in a given language So maybe you have just news in English. Let's say from BBC or whatnot and you have news in Nepali from Local news outlets over there and eventually what you want to do is to translate novel news data That's your test set from English to Nepali How are you going to do that? So It was not obvious to me when we were looking at this problem And it turns out that the problem is actually a bit more complicated in a way by being more complicated It becomes simpler to solve in the sense that You just not Besides having data for English in Nepali, you also have perhaps data in other languages. For instance, Hindi now Nepali and Hindi belong to the same family and Hindi is much higher resource than Nepali and perhaps You you can find a large parallel data side between English and Hindi But perhaps it is on another domain And if we take this a step farther, you will find that what you can collect is a data side where you have multiple domains here across the roads And multiple languages here across across the columns Then what you want to do is to translate news data from English to Nepali The question is can you leverage all these grid of data sets? To improve generalization of your machine translation system So this is a Mondrian like learning setting which I don't think you can find in your textbook and One goal of my lecture here is to tell you how to tackle this learning setting Any questions so far No, no, I don't see I don't see any question here. Excellent. Okay. Okay um, so The take-home message is that oftentimes when I love doing modeling. I love coming up with new algorithm new architectures But and oftentimes when we think that that's the coolest thing to do, right? So Pretty much, you know, uh, when we're given tasks. We're giving a data set We spend 90 of the time figuring out a good architecture Good learning algorithm But in practice the the picture is a little different in practice Where do you get the data? So keeping a task feeling figuring out what is it good data set for solving the task is a challenging problem And it takes a lot of time to coming up with a good data set And if you don't have a lot of parallel a lot of in general labeled data you need to come up with ways to Fantasize data or to come up with auxiliary tasks. Okay, so I would say that In practice In my experience, uh, there is a lot of science here on the data side that is often neglected when we study machine learning Moreover, uh, after you get some data you come up with some model then There is a lot of effort that goes also in analyzing what the model does Analyzing the properties of the data and there are a lot of feedback loops go back from the model to the data from the analysis to the model and data in order to I don't know, uh, clean up the data to extend the data to refine the model and not to mention when you go to deploy There are other considerations like I don't know is my model fitting the computational budget that I have And then you need to go back to the model or is The model performing well across the full input distribution in which case you need to go back to the data And so there is it's an iterative process. And so I think it is important to understand that There is a full picture oftentimes we focus on the model side because we are machine learning practitioner and we love that But uh, if you want to solve an application, you need to have the full picture in your mind And so the goal of this lecture is not to talk much about I mean, I won't talk about architectures at all. I will talk about some algorithms, but we also touch on data and analysis So the second Tip is that whenever you don't have a lot of label data You can do two things you can downscale the model and say, hey I'm gonna do small scale learning, but usually that's not a good idea. So instead what you I would encourage you to do is to think about ways to gather more data to get Coming up with auxiliary tasks. It could be unsupervised. It could be some related tasks supervised but come out with Ways to enlarge your data set and to do large scale learning and usually that is the thing that is going to generalize much better And so what is low-resource machine translation? So given that a model has a order of 100 million parameters or more for me low-resource machine translation is a machine translation task where the number of parallel sentences is Less than 10,000. Okay, so usually when you have so little parallel data The performance if you just use the parallel data set is very cool And the challenges are of course as I said where to source data to train on Coming up also with high quality evaluation data sets because if you don't have a good way to measure performance, then it's very difficult to make progress That relates to also Challenges to the mathematics. How do you do human evaluations? It's not very easy to find people who speak low-resource languages Or people who are maybe they speak but they are not fluent in the other language It's not easy to do automatic evaluation. For instance, if you take a Bar Mies, it's not a language that where the segment works And so you need to come out with different ways to measure the error Your blue And then there are of course modeling challenges to what is a good learning paradigm where you have so many data sets from so many languages and so many domains And because it is a large-scale learning problem, how do you scale up efficiently? In addition to this, there are the general machine translation challenges some of which we discussed before like exposure bias The fact that usually we are training for scoring but eventually we are interested in generation. And so There is the question of can you can you train for generation because that's ultimately your task? How do you model uncertainty? How? What are better metrics than blue? If you have a budget for the computation, how do you train taking into account that you have such a thing? Or how do you model the long tail of the distribution? For instance? Meaning that not all errors Matter of the same. Let's say that you are translating news and and the news is not all errors and the news is The Coca-Cola stock price fell by 10% and now you're replacing Coca-Cola with Pepsi I don't know. I'm just making a stupid example But you can see that that simple word replacement can have a huge effect downstream, right? So So then how do you detect these errors and how do you? Measure them and how do you fix them? Okay, so What I plan to do today, so this was kind of the introduction an hour long introduction And what I was planning to do today was to walk you through the cycle research the way that I Think about it at least so The cycle research for me goes around three pillars one is data So you need to source data to train and you need to figure out what's the work to evaluate And depending on what you want to explore you need to come up with different data sets and And there is a whole science there on data collection After you get some data you go to the model Of course, that's the thing that we love doing and so you need to design an architect So you need to come up with an algorithm to to feed the data And to generalize to do data from the distribution or a single distribution After you have done the modeling comes analysis, right? And so you can analyze several things for instance in this paper We were analyzing how well the model distribution fits the data distribution So we were analyzing the model in this other paper. We were analyzing the metric And in the paper that I'm going to talk about today. I'm going to analyze the data Okay, so you can analyze several things depending on the issues that arise from this analysis you may Figure out that you need to come up with a new data set that Let's you explore What you want In a better way and and perhaps you need then a new model and you keep it written in this process Okay, so that's a little bit work in this area. And I think this applies also to other Applications as well So let me so I'm gonna highlight three words here and let me start with the data part unless there are questions Unless you want to take a break Let's go No breaks. No questions. All good here All right All right. All right. Um, okay, so let me go back to my english napalio, okay So as I said, you go to this opus repository. You find that uh, here is the data that you have available and so Another way to draw it is with this table where essentially you have some bible data that originates in english has been translated into napali Some genome ubuntu data are kind of originating english translated into napali And then you have a bunch of monolingual data monolingual data is just a raw text in each language From common crawl which is just you know the Worldwide web, right and you can get data and you can filter for for that specific language You have wikipedia, which is a set of common crawl and perhaps what you want to do is You know translate wiki wikipedia documents from english into napali Now the question is How are you want to evaluate because there is no such a parallel data set in wikipedia, right? And which is a pretty generic domain too so A couple of years back we started this effort to build an evaluation benchmark on low resource languages And we started with napali, uh, sinala, uh kemar and pashto and we took Documents from wikipedia in each language and we translated to and from english And so now we have some evaluation Pretty good evolution benchmark in these four languages And you may say well, so what okay, this is useful I why would I care about this? Well, it turns out that there is no To the best of my knowledge. There is not a clear documentation how to even collect the machine translation data set What guidelines to give to translators to evaluators? And what is the pipeline? Okay, and there are a lot of interesting questions. Yep These were not parallel right text. These were just articles Yeah, yeah, and which we plan to translating to and from english To build a parallel data set for evaluation. Yeah And so the question is For instance, none of us speak any of these languages Unfortunately, and so how do you make sure that whenever you send a sentence to a translator that the translation has good quality? It's not obvious And so the way that we did it was by using this pipeline. So after translating we have a set of automatic checks And these automatic checks Uh are several so for instance, we train a language model on each language And then check the complexity of each translation. So is it fluent enough essentially according to a language model? another check was about Transliteration is The translator simply transliterating, which is let's say if you go from English to Nepali, you can write an English sentence in Nepali Characters but without translating just using the sounds in the characters of the language or Another check is making sure that the translator is not simply coping from a machine translation engine like being or google translate And you'll be surprised by that was the most typical issue that we had And so that relates to well, how much uncertainty is there for translating this sentence? Could it be that there are very few ways and and and so happens that The translation from the search engines matches, right? And so I'm not going to go into the details now, but There are very interesting scientific questions about how to measure the quality of these translations and Also deciding when to stop, right? So here I have feedback loops where you say, okay, if Uh, the perplexity is too high. We're gonna go back to we're gonna tell the translator that they need to improve the translation Or if the human evolution is now Highscoring off we go back to Another translation around but when do you stop? When is a when is the translation good enough? And so Eventually So I don't think I have In answer to these questions what we did was to set for Like to look at the distribution for instance for language modeling we look The distribution of perplexities and look for outliers essentially And And so looking at the distribution Of sentences in each language and across languages and looking for outliers that that's essentially the the method that we used and so Here are some examples of sentences in the from this Wikipedia and sinala that were translated into English and you can see that for instance from the If you look at the second sentence, it's not and even though we had all those checks and and and and Controls it's not super fluent, right? And and other things that you may want to Notice is that if you read the translations here and if you read Sentences from English Wikipedia You see that the topics are a little bit different and in fact A lot of documents from the sinala wikipedia are about religion are about very their history of that country and so And this is quite interesting and we will see the effect of this difference of domains later in the lecture Okay, so This is a sustained effort and in fact in You can find this data publicly on this website and in a month we are going to release an even bigger data set with more than 100 languages and we are also organizing a competition at the WMT, which is a yearly workshop on machine translation and they do a yearly competition and we are going to have a competition on More than 100 languages and we are also giving compute grants to qualify participants So if you're interested in machine translation or working machine translation things that you have good ideas about how to solve Machine translation for those those languages, please take a look or I'm talking Okay, so The point of this part is that data collection is not trivial. There are a lot of tricky questions and It's very important also to look at the data to have an idea for The issues that are there and I come back to this later So the next part unless you have questions on this part is about modeling and that's the meat of the lecture No, I think we are good for now Okay Okay, so, um, I start with Explaining how Standard machine learning algorithms can be applied to low-resource machine translation and then I'll go over some validation of these approaches and then Give some perspectives So let me remind you This is the kind of setting that we have where we have multiple languages and multiple domains But eventually we want to maximize translation accuracy Uh on a certain domain for a language pair. Okay, so this is the setting that we are looking at so If we think about the kind of data that we have in machine translation It is interesting to uh, think about how this can be mapped to machine learning techniques So if you have just the parallel data side That's like a label that I said and we are talking about supervised learning standard supervised learning, right? now You may have monolingual data, which is just taxing its language without any translation. So that corresponds to having Besides pairs of x and y is also axis. So that's the typical semi-supervised learning setting If you have only axis and only y's maybe it is just about doing unsupervised learning When you have multiple language pairs Then uh, let's say that you're interested in doing English to Nepali, but now you have Uh, also English Hindi So then it's a little bit similar to multitask learning where you're interested in a task and now you add another Classification head for a related task If you do, let's say in Nepali English and now you have also Hindi English data Then that resembles a little bit multimodal learning the sense that you add another modality at the input for for the same prediction task And now we also have multiple domains and so the kind of when you have multiple domains Naturally, you are thinking about domain adaptation techniques And so it's going to be interesting to see which domain adaptation techniques are applicable to machine translation And so let's start with a simple setting of supervised learning So in supervised learning you have a parallel data set. Let's say you want to translate from English to Nepali You have a bunch of sentences in English that are translated into Nepali. That's your training set And you have a test side that you want to translate into Nepali. Okay, so the data set is this WB Of pairs x and y x in English y is the corresponding Nepali translation As I said before you can train this by maximum likelihood This is the person for loss essentially you are trying to Maximize the low probability of y given x the given y with the given x right And one way to represent this is by This diagram where you have a blue encoder It is a blue because it processes English sentences and the red decoder It is right because it processes it produces Nepali Translations and the x so the decoder doesn't really produce a prediction by a distribution over Over the space of y And you have a human translator that has produced even the x the y reference And so you can compute your prosentropy loss and black property and update the model parameters to Train your machine Now if The parallel data set is very small then you need to regularize right and so there are standard ways to regularize We can do drop out where you inject noise in the hidden state you zero out hidden states are random You can do label smoothing So I don't know if in the class you explain label smoothing, but it's the idea here is that Whenever you do a classification task you usually have a target which is one thought you say okay the next word is On and so you You say that the target probability is going to be probability one to on and zero to all the other tokens in the Vocabulary instead with label smoothing you say okay I give up a little bit of probability for on let's say instead of setting the probability to one set it to 0.9 and the remaining 0.1 I spread the unit from the across all the other words in the vocabulary And this is good for you because it prevents the model from overfitting to the small data set because you Spread a little bit of probability mass across all the other words in the dictionary and Remember that a lot of loss is turning nobody You need to regularize always a lot of loss otherwise the weights are not about infinity Okay, so that's uh, that's so Simple and uh, this should be easy. So now what happens if you also have let's say monolingual data in english Okay, so you have A bunch of sentences in english. How would you use that to improve? Uh, the machine translation system So this is a question for the audience I give you a minute. So if in addition of these pairs of x and y's I give you also a bunch of axes How would you use this additional data to improve? generalization Would you I don't know use the chat and tal alfredo you can answer on the chat here They can answer on the chat. They can give you like, uh suggestions Or you can actually ask for yes or no answers as well Uh, yeah, if you wait a minute, I don't know how to access the chat. Okay, let me see Be a bar and then you have like The Yes, yes, yes And also there is a question about the transliteration. I didn't ask you before Okay, yeah What was the question? Uh, how do you tell good transliterations from better? Transiteration by the way in automatic checks since some languages do use their characters for proper nouns In the english language like the chinese version of harry potter just Is just a transliteration with chinese characters Yeah, yeah, yeah, so first of all transliteration doesn't mean word-by-word translation I'm responding to the next question. Sorry for that, but it means that you're using the characters of one language to make a Um Logical translation of the word in the foreign language very much like the person from the first question was saying So it's it's like using the chinese characters to make the sound of the english word Now it's true that sometimes you do need to translate today for good purposes But the checks that we are doing is Is let's say 80 percent of the words in the sentence are 80 percent of the words in the sentence Transliterated if that's the case then we would have flagged the sentence for re-translation And also the word by word translation is monitoring by using the uh perplexity That that's right. That's right. Yeah, the language model would have like that Yeah, I I mix the two things my bad No, no worries. So, okay, so Repeat the question you asked for games. Yeah, so the question is Uh, we have a parallel set of x and y's. Okay, and we can train our cross-entropy loss with that Now if I give you also a bunch of axes I have a data set an additional data set of axes. How would you use that in order to improve its generalization? okay, so Jeffrey says, uh, if the axes come from the same article Then I can predict the next x Uh as a base for translation That's a very good suggestion. Sure um, although in general It's very rare that you have axes from the same document. So that that's a very strong assumption uh, who says semi-supervised learning and then Ganesh say we can do essentially what bertals very good. Very good. Okay anybody else There are like 56 more people that haven't answered Okay. Okay. Okay. Fine. Um, these are very good suggestions. Um And I go with one of them. So one way to leverage an additional Data set that I call m s. Okay monolingual data for the source that has a bunch of axes. Okay Is by trying to model p of x That's what I guess lau was saying with the semi-supervised approach Right where you in addition to p of x you also try to model p of x And so, uh, what the uh, one way to model p of x is by using denoising auto encoder for instance Which I'm pretty sure you cover in your in your course And so with the denoising auto encoder what you will do you would take Uh a sentence from this data set this x of s You would inject noise for instance by Dropping words by swapping words. So if the word if the sentence is the cassette on the mat, then it becomes For instance the set cat on there. So you drop mat you swap Uh Set and cat You drop the okay, so so you inject a little bit of noise and then you try to Reconstruct the clean input Okay And again you do the reconstruction will be a cross entropy loss Why is this useful? Well, because uh this blue encoder Can then be shared with the machine translation system, right? So you're going to improve the encoder by doing this And of course you can pre-train with this or you can uh train With both losses at the same time. That's fine One thing to be careful is the amount of noise if you inject a lot of noise that you destroy the whole input sentence, you're just training the decoder You're just doing language modeling and you're not training the encoder and that if it's the purpose if you are not injecting enough noise then The encoder decoder is going to have a very easy time to copy the input And so you're not going to learn anything useful. So there is a little bit of trickiness in how you set the knob in how much noise you add another way to And so this is the semi-supervised approach or the birth approach that was proposed a little bit Another way to leverage Unlabeled data on the source side is by doing what they call self-training or pseudo labeling This is an algorithm from the 90s Really, it's very old and the idea is a little crazy So the idea was like this that you take your sentence from the monolingual dataset You inject noise to it. You encode the code. You make a prediction, but now Notice that here you have the encoder and decoder of the machine translation system So this is the the red blood that translates that translates into Nepali Okay, so you are producing a translation, but now You're not even the ground through translation. You don't have humans here So what you do you use a stale version of your model of your machine translation system to produce the desired output Okay, and the input to this is the clean version of x. Okay And essentially you train the parameters by minimizing The sum of the standard cross entropy loss on the labeled data that you have on the parallel data Plus this loss on the monolingual dataset It's pretty crazy, right? So let me let me explain again how this works You take the parallel dataset d you train your machine translation system pure white given x And then you repeat the following. There are two steps. The first step is to decode with your current pure white given x the monolingual dataset So that you associate to each x of s a y bar Which is your prediction for what the translation should be And then you retrain the pure white given x on the union of the original parallel dataset with this Fantasize artificial parallel dataset as You will get a better model hopefully and then you can repeat the process you can ridicule and retrain ridicule and retrain now People should ask the question as Why is this going to work? So there are two reasons for this The first reason is that when you produce the desired output You are going to do typically beam search for instance. And so what You are trying to do then when you Try to match the prediction to this desired output is to learn beam search So usually beam search gives you an extra two three blue points. Okay, and here you are trying to learn that uh beam search and procedure The second reason is because You inject noise here. And so but you're die inject noise when you translate when you produce the target And so by injecting noise you're kind of smoothing a little bit The the output space and so the combination of these two things is pretty critical to make set training works And practically it can be pretty effective as we shall see later Are there any questions on this? It looks pretty impressive Yeah, it's pretty crazy, right? So this has been very useful also for speech recognition people in the past Year or two have found this to be very beneficial for speech recognition as well If you do Opset detection people have been using also in computer vision This approach it is very old by no means We invented it. Uh, we discovered it for machine translation So uh, we didn't mention that people should be puzzled, but maybe not so put in here enforce the model's mistake Excellent fashion it ish Yes, that's what I was thinking too It depends how you do it. So if you do it right It may not uh, so the idea is that again This white bar are not correct. Okay. This white bar are not correct However, you produce the white bar by doing beam search And assuming that beam search improves upon Standard decoding, okay Then when you train the model by predicting the output of beam search You're gonna you're gonna improve because you're going to learn the beam search process Okay, and the second reason again is that you inject noise here So by injecting noise you make sure that uh You are smoothing out if the model before was overfitting and was kind of imagine that you're doing classification for for simplicity Imagine that the output is a single token. Okay maybe before you were overfitting and and uh Prediction surface was very irregular by injecting noise. You're gonna smooth out that thing And so that's going to be helpful to improve tonalization and so In practice, you're not reinforcing the mistakes If you do it this way I do have a question last week. We talk about Uh, so this would be like the terby right in beam search finding the lowest energy path, right? Yes, we also talked about instead of doing the terby to also do to use the forward algorithm to have like a Like a more than one correct Uh solution is also this done in translation So a little bit. Yes. So what happens is that For very large those languages the forward model is typically pretty poor and so beam search is usually what We do it's relatively efficient and and Improves performance compared to the baseline where you have greedy decoding greedy means where you do beam search with k equal to one For higher those languages for which the forward model is very good and so the kind of Distribution that it produces at the output is well calibrated and Fitting pretty well the data then what you can do is top case sampling. So then Every time that you You sorry every time that you see an excess you produce a different sample Which is a little similar to what you're saying And that usually works better But the model has to be good enough to do that I see I see And then there is also a tradeoff between how much compute so Every time that you do machine learning You have a budget. So maybe you meet your phc advisor once a week. That's what I used to do with young, right? So then That gives you, you know a time frame for you to work, right? And so There is a tradeoff between how much compute you spend on each example and how much data you see Right, so you can produce let's say 10 translations for every single input And that would cost you 10 times more if you were to produce one or you can see With one translation 10 more examples and it turns out that It is typically better to see more data than less data but do more compute for each data point If you have enough data Here usually you have always a lot of unlabeled data Okay, okay, okay, okay, okay Yeah, make sense, okay Okay, so if Okay Let's go to the next case where instead of having a monolingual dataset on the source side We have a monolingual dataset on the target side so The way that you can approach this Is by First training a reverse machine translation system a backward machine So we are interested in doing English to Nepali and instead of training that we train in Nepali to English machine translation system That's this encoder decoder that you see here. Okay That goes from y to x And so you take your sentence from the target side monolingual dataset y and you produce Some translation x bar. Okay, and now you use this x bar as a noisy source for for your Machine translation system that goes from x to y which is the one that you want to train And the prediction is the the y right that you had at the beginning here. So again The algorithm works as follows First you train a backward machine translation system that goes from y to x Okay on the parallel dataset that you're given Use that to decode the sentences from the source side monolingual dataset to produce An additional parallel dataset with x bar and y And again these x bars are not correct They are produced by the backward machine translation system And then finally you do what you want Which is you train your forward machine translation system that maps x to y On the union of the original parallel dataset with this additional fantasized Parallel dataset 80 and that and this algorithm is called back translation so you If you zoom back it looks like an auto encoder where the encoder is a machine translation system and the decoder is another machine translation system Each of these is actually, you know an encoder decoder system and typically you just do One iteration this although you could further iterate and when you train you never back propagate through the Backward machine translation system you could do that, but it's very expensive and usually it's not worth doing it uh, so The idea the idea here is that The prediction is correct You know it is correct because it comes from you know human written sentences in in a poly So that means that the decoder is going to improve a lot because it's going to predict not corrupt targets like in set training but the targets are correct The issue is that there is some bias noise in the source sentences but usually this is a very good way to do data documentation and For the same amount of data if the domains match Back translation works much better than set training. That's because the targets are correct I see a lot of things going on No, no, I was clarifying that we call the other we call it predictor because we move from the x space to the hidden representation of the y space Okay, thank you Okay, so if there are no questions, then I'll try to combine these two algorithms and So this all actually is about algorithms and I hope uh, this is okay So here So the idea here is that now you have your little parallel data set But now you're giving also a source and a target simon language data set. And so what you do is very is straightforward because you're combining set training with back translation and The algorithm is here on the right. So you first train on the little parallel data set Backward machine translation system that maps y into x and the forward machine machine translation system that maps x to y Which is the one that you are ultimately interested in And then you repeat an iterative process that alternates between two phases One phase is the decoding phase in which use the Backward machine translation system to decode the target site monolingual data set to produce A data set of parallel sentences 80 And you use the forward machine translation system to decode this to decode the source side monolingual data set to produce another Parallel data set as okay this 18 and as our machine generated And then you retrain both the forward and the backward machine translation system on the union on all these data sets And you repeat the process Okay And again, you train by minimizing. Sorry the the cross entropy loss Very much like if uh, it was just a simple data set Okay, it's very simple and very effective Um Maybe since this is a machine learning course, uh something to think about for you is is how this resembles yam So you can think that here we are doing some sort of hard yam because You can think about this Uh sentence in English to be a latent variable, right? It is latent because it it is not observed but turns out to be In this case an English sentence And so what in this phase is like you're doing The east step where you are inferring the latent variables By doing hard yam, you're doing beam search, right? And then in this phase you're updating the parameters. That's the m step Okay, so that that's another view on what's what's been going here if you um Yeah, if you put everything together in a single kind of uh probabilistic model or model Um Any question I guess it is all clear. I hope So then uh, the next question is how do we deal with multiple languages? So let's say that we have our Parallel data set for English and poly but now we also have some parallel data set between English and Hindi Okay, so we have this additional parallel data set that perhaps also between Nepali and Hindi So how do we uh, so we have a lot of presumably parallel data between English and Hindi We can build a machine translation system and the question is how can we transfer knowledge between The English Hindi machine translation system and the English Nepali machine translation system why we work with neural nets and that's Beautiful because it makes this transfer learning very easy because we can share all parameters Except for one embedding that specifies the language So we can have a single machine translation system a single encoder a single decoder here We can again feed a source sentence at the input produce a translation the output Which language so we can specify the language by feeding by Uh prepending an additional token on the source side And this token will specify the language that you want to translate your sentence into so we have one language ID token per language And so this will take care of translating from English to Nepali English Hindi All combinations and so you Will share all parameters except for this language language embedding So that's Super simple, I hope And I think that's everything that I wanted to say here And so um If you were to code this it would be you do simple cross entropy loss You stack all your datasets together You just need to remember to uh add the next token on the source sentence that specifies the language that you want to translate translating How do we deal with domain adaptation? So we said that oftentimes the training dataset is in another domain than the test dataset, right? Now if you have a small Indomain validation dataset what you can do is to do to apply some domain adaptation techniques One very simple technique is fine tuning and you'll be surprised that this is I would say It takes to 95 percent of the way We try so many other things, but this one is super effective and super simple essentially You train your system on domain a and then you do a few weight updates on The validation dataset and then you deploy and you're good Another thing that which you could do Is to add another token. So if you know that each dataset if you know that Um This sentence comes from this dataset that belongs to a certain domain Then you can also plan A token that specifies the domain And that's called domain tagging. Okay, so that's also another way to do domain adaptation Where you say I give you the source sentence I give you the token for the target language and that also tell you that this is about news So that the model has a way to factor out the the topic from The the translation um Okay, so The basic idea is that there are several pretty basic machine learning approaches that can be used and combined for Technique low resource machine translation At the very high level the basic idea is how to come up with ways to do the documentation And this turns out to be also in machine translation But that's true also in computer vision other domains the most powerful way to improve generalization The approaches that I describe are very Generic and they can be applied to other domains There is something that is very specific and that's the file that machine translation is a symmetric task typically It's not quite true because there is the file that you can go from a language That is not implanted to one that is implanted for instance And that makes the the thing not quite symmetric But roughly speaking it is symmetric and that's why we can use that translation if you are to do let's say Observe the condition images and predict an image into let's say 10 categories Going from the category to the image is also a very difficult task because you need to learn and generate the model Right, and that's perhaps more difficult than doing the classification task in the first place Maybe so in machine translation that's simpler because going from x to y and from y to x is roughly has the same complexity So that's the only task specific thing that we use but other than that There is nothing that is specific to the language pair whether we do English Nepali or English friends. We use pretty much the same Techniques and there is nothing specific to the language pair Which is the beauty because you let the model learn from data How to solve the task and so this allows us to also translate 100 languages at the same time using the same toolbox Are there questions? I don't see anything here Okay, so the conclusion so far is that Um, there are several several training paradigms that can be combined How to combine them is actually tricky very tricky because sometimes Even a little bit of dominus match may help generalization because it's kind of the noise, right? And whether you need You need to wait more one technique versus another depends on The language pair depends on the capacity of the model depends on how much data you have so it gets pretty empirical in practice and and it takes a little bit of understanding to combine these different approaches Unfortunately, and I think there is quite a bit of work to be done to abstract principles on how to combine these things and I think the field right now is all in trying to figure out how to Automate these Process of combining these different algorithms so Unless there are questions, I'm going to go on some examples of this starting from unsupervised machine translation so Let's consider that you have just one lingual data. Let's say in English and French. That's not a very useful no realistic use case But for simplicity, let's say that you want to translate from English to French without any parallel data Um So what you could do you can take a sentence from the target simon lingual data set You have some machine translation system that goes from friends to English you Feed that friend sentence you produce some sort of English translation initially it's going to be just a random words But you don't have the ground truth reference. So what you could do is to feed this translation to another machine translation system that goes from English to French That's A little bit similar to the back translation thing that we saw before And here notice that I'm using color to indicate to which language the Block is operating. So here I have a decoder that is blue because it operates in English And here the decoder is blue because it operates also in English while the red refers to clients Now if you just do this, this is not going to work at all and the reason is because there is no Constrain on the x bar. There is no reason to believe that the x bar is going to be a valid English sentence And so the same trick which has been for Auto encoding or cycle consistency Sorry a cycle consistency as we also use for in computer vision for style transfer to turn Zebras into horses and whatever so now in their case This thing this little algorithm work because they added a constraint on the x bar They had a discriminator that was making sure that whatever was produced here was a valid Item from the desired domain In our case we cannot really add the discriminator here because it's a discrete sequence of tokens It's kind of difficult to back up. It's a little messy It's not that you cannot but it is it is difficult to do To get it to work at every list So one way to get around this is to Make sure that the decoder reduces valid English sentences. And so We can guarantee that but we can try to Get that by adding a denoising of encoding Turn to the last function and so we can today we can take A sentence from the source sidemon language that I said inject noise Go through the encoder decoder And now this decoder if you do this is going to learn a good language model It's going to produce fluent English sentences. And now when you plug this decoder over here You should also produce fluent English sentences And you do you do the same with this encoder and decoder red that operates in french Now if you do that It's not going to work again. And the reason is because While this may work The decoder The blue decoder may only work when it is fed with the output of the blue encoder that operates on english sentences But there is no reason to believe that it may work when you Feed it with the output of this red encoder that is fed with french sentences And the reason is because there is no reason to believe that there is going to be modularity that you can You know, uh, swap these modules in any way that you want. And so it could be that Uh, the model and it turns out to be the case that the model will partition the feature space in such a way that It works well when it is fed with the blue encoder, but it works very poorly when it is fed with the red encoder And so what do we do about it? So one way to fix this is by Sharing all parameters of the encoder and decoder so that the feature space is shared No matter whether you feed a french or an english sentence So now we have only one encoder and one decoder. So we don't have A red encoder and the blue encoder. We just have one encoder and we have one decoder We specify the language with a language ad and uh, if we do this parametrization with These two lost functions that we can get Uh, the machine translation system to learn even without a single parallel sentence in some cases Um And so again, you see that we have been using the three building blocks that I described before One is iterated by translation. The other is denoising the encoder and the other is multilingual training So how is it this talk and use this lid? So it's like a multiplicative interaction that is changing the weight. So it's like a hyper network or no, no, it is uh You add one extra position to the input And uh, and you fit it. It's just like if this sentence was Uh, had one extra token at the input You embed that and then that goes into your transformer block and and so and so forth like if it was any other Symbol in your uh sentence And these axes are not the one hot encoding right of the input. So So, uh, yeah, so this axis the sequence of tokens. Okay, so it is the cat sat on the mat Followed by language And so the language ad is just another token in the vocabulary and that goes Uh, all of these are embedded and they are fed to uh, transformer block Yeah, my question was this this token this token you're sending are one hot encoded or some other. Yes Yes, it is one hot and then and then you embed it, right It's gonna be embedded and then it's gonna be fed to a transformer block So for each language, you're gonna have a different one hot encoding and the token lid You tell you which vocabulary to to use basically Yeah, and oftentimes there is overlap between the vocabularies Particularly if you break down the words into character engrams like we usually do And so you're gonna this is going to help aligning also So let's say particularly for english friends The vast majority of Uh, the tokens are shared Right and so the lid Let the the english token is just gonna tell you hey I want to translate into english as opposed to uh, produce a friend sentence I see and then if you're using chinese or other things that don't share the same engrams or and so so so in that in that case in that case then um, so This goes a little bit into preprocessing So what you do if you do english chinese translation You got a lot of monolingual data in english a lot of monolingual data in chinese You're gonna treat this as a single data set and you're gonna learn A way to compress the data using character engrams. That's called byte pattern coding and there are It's like a half-month coding of uh, like treat this big data set as a long string And the question is how can you break it down into character engram so that you compress it the most, okay? and so Of course because it is english and seniz. There is very little overlap But that's okay But that's okay. There is this going to be some overlap numbers some, uh Like a foreign words and that's usually sufficient to uh, uh learn a shadow presentation over here Okay, what matters more in practice is very much in the domain of the source and the target actually is with So how does this work? Uh, so since 2018 this has been improved by quite a lot, but here is the idea. So, uh, consider just the blue curve for now So the dash blue curve is what you get Uh on this data set which is a standard data set for machine translation If you train in an unsupervised way the neural machine translation system So you got 25 look which is you got pretty fluid and pretty decent translations actually If you were to train in a supervised way and on the x axis here, you have the amount of parallel data You got this blue curve. And so it turns out that by using about 10 million monolingual sentences in each language And by training in a supervised way, you got the same performance as if you were to use about 100,000 Parallel sentences like saying that each parallel sentence is about word 100 uh monolingual sentences Which is pretty interesting and this is the case. This is in the case where The languages are very similar and in the case in which the domains are perfectly matching If you take English Chinese That is much more distant and if the domains of the two monolingual datasets are more different then This ratio is going to increase by a lot And that tells you why low resource machine translation is actually very large scale learning because you need to compensate For the lack of direct supervision by adding More and more data and by making them all bigger and bigger so that he learns a lot of things and among All of these things there will be something that is useful for the tasks that you're interested in Okay If there are no questions, then I'm going to go on the second example Which was uh, Nepali English, which has been the running example throughout this lecture In this case, uh, the evaluation dataset is from the florist dataset is from the collection effort that I described before For training, we don't have any in-domain parallel data from Wikipedia. We have some auto domain Data like, uh, you know your bible genome and so and so forth and we have a bunch of monolingual sentences in each language So if you want to do Because we don't have a lot of uh parallel data and because it is out of the main if you are If you train supervised you got pretty poor performance 7.6 blue. So it's barely understandable If you do unsupervised learning, it doesn't work at all in this case because the two monolingual datasets are from different domains and there is no way to To you know to find correspondences essentially If you were to train by using the parallel dataset plus the monolingual dataset you do quite a bit better than just doing supervised learning And uh, if you iterate you improve farther. So if you do this, uh, iterated by translation Then uh, if you improve quite a bit now if you add also English Hindi data then You dramatically improve Across the board in particular also the unsupervised Learning setting starts working now It is unsupervised because you don't have any parallel data between Nepali English But it is supervised because you have parallel data between English and Hindi Okay So and because of that and because Hindi and Nepali are similar Now you got to some start the process this iterated by translation with the noise of the body And then you got also the unsupervised learning to work for English Nepali Why using Nepali and Hindi here is it because the two languages are more dramatically similar? Does this show good performance in like Nepali and Italian? Yes, it is because they are similar. That's why we picked them yeah, so, uh It is very useful if you can find if you have a low resource language and there is A relative nearby language that is high resource That's super helpful if you don't Usually again, it is the same thing that I said before you can also use English Italian other languages that are not so related That usually helps but it doesn't help as much and if you want to For it to have as much you need to add so much data from those languages that it becomes super large scale Because you need to compensate This is pretty much what I'm saying here that I'm making a silly argument But just to give you the idea that If in the supervised learning setting each datum gives you x bits of information to solve the task And you have n examples and you need a model size y megabyte to solve it In the unsupervised learning setting For instance, its datum is going to give you let's say A fraction of that information. Let's say a thousand times less But if it gives you one thousand times less That means that you need roughly speaking a thousand times more samples and also more that is much bigger and now To go back to your question about why hindi now if you use something that is less related if the domains are Miss much more than you need to increase much more because the amount of information that you got is even less I hope it is clear and to me this is Something that I didn't know before I started working on this and I think it is, you know, maybe obviously retrospect but It is something to to think a little bit when you Start working on on an application Okay, I have 15 minutes left. I could go over the analysis Part unless there are some questions You're welcome. Okay um, I feel sorry that you don't have a single break but Hopefully I give you five minutes at the end. So you you can uh, I hope you don't tell another class afterwards so, uh let's talk a bit about analysis and um Let's think about simulating The law resource machine translation stop. Let's take our friends to english that we are all very familiar with and let's take a data set of europe europe parliament proceedings uh Extra 20 000 parallel sentences uh to simulate A law resource setting and let's take some monolingual data 100 000 sentences on the target side applied by translation We find that the blue score goes from 30 to almost 34. That's great That's a very good improvement like you can publish at the top here and at the venue if you got An improvement of 0.5 blue here. We got 3.4. So that's great right now I was very excited by this when we Did this and so we applied this to public posts In english and bar means And uh, we did this very same thing and now we got 0.1 blue. How comes? And we check how it is optimizing we check how we initialize we check, you know the training Nothing worked. So what's going on? So what's going on? You can see it already in the florist data set, right? So we were saying that if you look at the data You find that if you look at these translations from sinala compared to original sentences from english wikipedia You see that the topics here are pretty different from those that you find in english wikipedia Is that the problem? Well, maybe so If you look at english wikipedia and chinese wikipedia So we train classifiers on on documents Randomly taken from the two wikipedia. So you find that the topic distribution between Documents sample from english wikipedia and chinese wikipedia are quite different. I don't know For instance in english they talk much more about film Why in the chinese wikipedia? There are many more documents about animal for some reason, right? Even if you take two countries that speak roughly the same language US and UK For the same topic, let's say we look at sport magazines. You find that sport fans in the US talk more about football, baseball And so support why in the UK it's more about soccer for instance, right? So People in different places of the wars Talk about different things. So the distribution of topics is different and that's for a variety of reasons One of which is that People talk about what happens In where they live and so if there is a snowfall in your city It's unlikely that there is a snowfall in Hong Kong and people are going to talk about different things There are cultural things as well And then for the same topic there is also different distribution worlds so It turns out that Typically in machine learning and also in machine translation We always consider a domain mismatch between the training and the test distribution, right? So we assume that we have some data When we train and then when we test we have a slightly different domain and therefore we need to do domain adaptation And before we talked about fine tuning domain tagging, right? But here we are talking about a different kind of domain mismatch Which is a source target domain mismatch. So we have some data that originates in the source language So this is written by people, right? And these are human translations That belongs to a domain ds, right? And then we have some other data that Originates in the target language that belongs to another domain Okay And these two domains don't match now Could this be a problem? Well, I think it could be a problem because now if you let's say use by translation Even if you were to perfectly translate these target sign language data over here Without any mistake this data is out of the main and therefore if you're interested in translating From the source domain to the target language, this data is going to be much less useful So the question that they have for you is how can we test this hypothesis And you know, am I fantasizing a problem or is the problem real? How can we study this problem? and so Let me tell you a little bit about this. So first of all, there is a little bit of an abstraction So one thing that you may want to do is to measure How much domain mismatch there is, right? Because if you can measure Then you can quantify and you can really See if there is such a problem So let's say that there are in an abstract concept space There are two domains and from these two domains we can sample sentences And this is like an interlingual domain And then from these sentences from these two different domains, we can Produce sentences in each language. Okay, so these are realizations Let's say sport news in English and realization of I don't know politics news into Nepali, okay Then let's assume that we can do perfect human translations and that there are no Mistakes when you translate now we have Sentences in the same language Which we may be able to compare and one very simple way to compare is to do To compute a tfidf matrix of this data. So you compute how many times it's for the fear normalize a little bit And then you can apply an svd factorization So now each sentence in its corpus is represented by a distribution over topics. Okay and now if you now All you need to do is to compare these two matrices. So the matrix on the top Comes from The source language right and the matrix at the bottom comes from the sentences from the target domain Now if we can compare these distribution topics, that's all we need to to say whether the two Data sets are similar And so one way to do this is you take one row here that correspond to one sentence and you compare it to every row here of the data set below And you can average across sentences. So essentially you take a dot product between these two matrices and then you average the score and then measure the similarity between Data set s and data set p and then we can come out with a score for the similarity of the two data sets by normalizing across the such similarity for the Source and the target domain. Okay and so If the two domains were perfectly matching then when you do the product, you know of a Vector with itself, you may you get roughly one and so this for you got one plus one plus divided by one plus one So the score is going to be equal to one if the two Toppy distribution are non overlapping So they are totally orthogonal to each other time when you do the product you got zero and so on the numerator You got zero on the denominator you got two and so the score is going to be zero so So the score is going to go from zero to one depending on the similarity between the two data sets And now in order to verify whether this is a good scoring function We build a control setting because if you want to understand problem It's good that you build a controls a control setting to make sure that the only thing that varies is the How much domain mismatch there is and so we took Two very different data sets one from europar, which is European parliamentary proceedings and one is open subtitles from movie subtitles And we pretend that the europar data originates in France and we pretend that the open subtitles data originate in English. So if we do this then the two domains are Essentially not overlapping very much And then what we can do we can define the target domain as a convex combination of these two data sets with an alpha that is in between zero and one So by varying alpha the amount of data both parallel and monolingual data is the same, but we vary How much the target domain is in domain with the source domain in particular if alpha is equal to zero Then the two Domains are Very different if alpha is equal to one, they are perfectly matching And so let's see how this scoring function that we came out with This dms core Works as we vary alpha and and you can see that the relationship is pretty linear, which is what we want. So it seems that The scoring function that we came out with works pretty well. Let's see how it works in also data sets. So if you use a wmt data set that is highly curated you have very mild stdm Source target domain mismatch except for Chinese English and if you look at the data the translation So you actually realize that the way that they constructed the data set is that Chinese news are much more local and that's why you see a drop in the stdm score Now at the bottom here, we have Facebook data and Facebook data for Nepali English or Japanese English has much lower stdm score than let's say German English, which is what you would expect. So we are good And now how is stdm affecting training? So On the x axis we have the value of alpha where alpha is equal to one is target domain matches the source domain and zero is target domain is totally different from the source domain and Again, the amount of parallel data the amount of monolingual data is the same. What changes is the domain And so as you can see here as you make the target side much much more in domain performance improves and the dashed line in supervised learning and here The dash dot the dashed line is back translation as you can see that Back translation suffers a lot when there is a lot of domain mismatch While self training is much more robust and if you combine the two you got the the dot the line So our hypothesis was correct. So in this control setting we really See that back translation suffers in this domain. However, if you increase the amount of monolingual data, which is what you see here on the x axis you find Back translation can catch up with self training. So if you use three times more monolingual data on the target side, you got the same performance as self training Okay, so this is a little bit what I wanted to tell you And then there are practical applications, but let me Give you some time to ask me questions and Let me recap before that so We talk about low resource machine translation as a practical application where you have you don't have a lot of label data and besides modeling There are two important things data and analysis that shouldn't be disregarded And one thing help each other in terms of the modeling uh, I think everything boils down to figure out efficient way to do data augmentation and And and in general you and I think many of the techniques that we describe are applicable also to other domains And and and and the other tickle message is that really Whenever you do Training with little parallel data, it is about large scale training because you need to compensate for the lack of direct supervision that you have So that said I'd be happy to take your questions if you have any It looks like there are no questions People must be exhausted No, no, I think it was very very clear and see it was very interesting topic and people are actually I think it was very clear Uh, we really managed to follow along all around. I think at least I managed to follow what you were talking Yeah, so I want to mention that feel free to email me if you have questions or if you need pointers. There are a lot of Uh You know, really have a little history if you're interested in some of these topics I'd be happy to follow up and or if you're interested in discussing anything, uh, on this I'd be happy Okay, I can understand that On the slide What the email is on the slide should I give them to the student later on Yeah, you can you can give my email. Yeah I guess that's it Okay, thank you for attending Yeah, it was really a pleasure having you uh with us today. You have a wonderful day. Okay You too. Bye. Thank you. Bye. Bye Bye. My cardio. Thank you again. Bye. Thank you