 Hi everyone, good evening and thanks for coming to this talk. So I am Sunil Marthi and this is Jan Gortman and we both work on OpenNLP and a bunch of other Apache projects and we are members of Apache Foundation. Today we will be talking about machine translation. So before we start, show our fans how many guys actually have a use case that is involving machine translation here. And how many of you guys use neural networks or neural machine translation as opposed to statistical? Okay, great. So towards the agenda for today we will be talking about what is machine translation, what is neural machine translation and what are the challenges with that and how do you actually train your models offline in a batch mode and call them in inference pipeline in real-time streaming mode and we'll have a small demo. I do need some help and support from you guys for the demo because the demo gods have not been good to me up late. How many of you guys here speak native German? Native German speakers? Okay, three. Well, I was hoping that you guys could tweet in German and you know with the first time hashtags and we could actually translate those tweets in real time. But anyway, it's okay. Without the hashtag we'll just go with German tweets and let me just warn you that most German tweets that die after 4 p.m. Berlin time, that's when they go to beer bars or before that it's either racist or profane or you know one of those different categories of tweets. With that let me hand over to Joanne. He'll take us to the next few slides. So for this talk we're using Apache Flink. That's a distributed streaming pipeline framework and we have Apache Open NLP. This is used for the pre-processing of the text we will feed to the machine translation system in the demo. And let me start with what is machine translation? So we have here some definition. We have the E that's the string in the target language. That's the language maybe I speak and I would like to translate to. And we have F that's the string in the source language. Maybe the one I don't speak like Spanish for example. And now we want to maximize the probability of E. And the best translation from the model we have is the one with the highest probability. So let's look at how we can translate one word. So for one word you would just go to a dictionary and look it up there. Here we have the example of a gebeuter. And you could translate this to these three words like building, house or tower. Now the thing is that two of the words building and house are much more frequent than tower. But still you would need somehow maybe look at you would need to look at the context to see which words you want to use. Which brings us to the parallel text that's a sentence in some source language. And then you also have the translation of this sentence. And here we have one example again. Das gebeuter ist hoch. And the English translation is the building is high. And here you see I have four words in German which more or less directly translate to English. And probably with this sentence I would get pretty far with a dictionary. And but they are more complicated cases. For example this one. Das ist ein Hochhaus. And here a Hochhaus would translate to three words in the target language. And my model would need to do this problem because also the sequence length is now not the same anymore. So I have more words on the output. Also in other cases I can have swaps. So maybe I have to translate one word and then it would appear maybe at the beginning of the sentence or somewhere else that wasn't the original sentence. So now we come to the neural machine translation part. And there is this very nice slide about the encoder decoder architecture. And this is what is used most these days. This is a bit of a simplification. We will extend it into the next slides and add more. So on the left part we have the encoder part. On the right is the decoder. And what we are doing here is we are inputting one sentence that is he loved to eat. And then we hope we are getting out. So here we have the embedding layer on the left part. And we would take the word he. And this would be pushed into the model. And the first thing we have to do is the word itself. We can't really use much. So we would go to a dictionary. And this dictionary would give back a vector representation of this word. We have some slides about this and as we speak about it. So we have a vector for the word he. And this gets pushed into the encoder. And this process is repeated for every word. And here it is just like written out. But it could also be more steps or less steps depending on how many words you have. At the end of this process, that's the encoding phase, we end up with this s in the middle. And this s is a vector representation of the sentence. So basically just imagine somebody tells you a sentence. And you have to remember this sentence. And that's kind of what's inside this s. And this meaning of the sentence, which is now captured in this vector, is then hand over to the decoding part of the system. So, and they have now two inputs. The first is the s. And the second input is the now that's the bottom. And now my decoding house, I would like to decode the first word of my translation, which in this case would be air. And now I just repeat this. I again give this input the s. And now I input as well the first word, which was translated in this case as the air. And this process gets repeated until the entire sequence is decoded and I have my translation. To make this work a bit better, this is usually done with something they call attention. And the attention helps me know that on the decoding part, I can look back onto the source sentence. That's like when you have some sentence written down and you can just always go back to it and see what's written there while you're writing your translation. So you don't really have to remember everything. And with the attention, this can look on the decoding part back to the part that just wants to translate. So maybe it's looking here at the here's like I'm a student and translation is I am a student. So maybe I am at the student at the end and now it's okay I can look at the student word in my input sentence and see how this should be translated. But of course this is much more complex and these days people would usually use something called a transform model, which has like many, many layers for the attention. And at the end you have the softmax which would then output the word. This is attention is all you need. So if you want to learn more about this, maybe go to this paper, have a look. This probably takes a lot of time to understand this in detail. And so the upper function that's the softmax function which is used to calculate the probability of the output words. And the second function that's the entropy and that's the loss function which is used for training. Because during training you kind of like initialize everything randomly and then you want to figure out parameters which can then be used for the translation. So you run the computation once forward and you get something out and the loss function then can tell you how you should change your parameters. And this process usually repeats millions and millions of times until you have some weights which actually then give you the translation. So training can take a long time and also depends on how much data you are training on. So why do we do NMT these days? Because it just works much better than what we had before. From around 2015 NMT started to work and 2016 this was already better than the SMT statistical machine translation which was before. And since 2017 nobody is doing any more SMT. So here I am handing back to Sunil and he will go through some samples. So before we go through the samples I just quick note on the previous slides. So what happened here if you look at this graph what happened between 2015 and 2016? If you see in 2016 the orange bar is way up the blue bar. So that's when Google translates which from the traditional statistical machine translation to neural machine translation using attention mechanism. And the advantage of using attention mechanism is that translation happens based on the context. It's not just you're not looking at let's say the three words or four words before translating this particular word. You're looking at the complete context like three sentences ahead and three sentences behind to come up with a better translation. So on that note sometimes you do see that the machine translation is all goes all back and you may see something very frivolous and you know fun stuff come out of it. Some examples here. So that those of you who speak German what do you think that is? Is it good or is it nearly good or is it near? Okay so a next example. Okay so even better if you have been using Twitter or the browser and you know can anyone tell me the difference between the original and the translation here? Okay so it's not Dutch and when you translate from you know mathematical function and you put spaces in between the plus signs it's a Dutch. It becomes Dutch right? That's not true. So yeah that's that's Twitter Microsoft Bing that's what Twitter uses for translation. So how do we avoid those kind of you know challenges with machine translation? So the first thing to look at is what is the kind of input that you would expect for machine translation and since we are dealing with the with the neural machine translation since we are dealing with the neural networks here the input to most neural networks is a vector. Vector is area of you know numbers and the training data for this since we are dealing with sequence to sequence models you have two parallel training corpus is coming through and it's parallel text. Now the challenge is how do you represent a word from the text into a vector convert that into a feature store. So the most common way of doing that today is something called an embedding layer. You start you create an embedding layer as your input layer. So let's say if my input has the vocabulary dog, cat, Jira, fox, bird, etc and for each of them I randomly initialize each of them dog is a vector of you know the vector of this the string here the numbers here same with the cat. So I start off with initial random initialization and I run it through something called a word2vec. I convert that into a word2vec and so once I convert that into a word2vec and this is what I get it kind of learns the real values and comes up with better values than what I started off with randomly. So the one big challenge we have with building machine translation models is especially languages like German which have huge vocabulary and extremely long comprehensions that are impossible to pronounce are how do you deal with those unseen vocabulary and the text you get for training is very limited it's got very limited vocabulary. For example the demo I'll be showing you it's a German to English machine neural machine translation model and it was trained with only 30,000 German words but I kind of found that it can handle the rest of the unseen words. So how do you handle the unseen vocabulary? Okay so let's say if my text has only 30,000 words what do I do about the rest of the vocabulary how do I account for that? So this was one of the challenges that was solved by something called byte pair encoding BPE as I call it and here's the paper about that from University of Edinburgh and so the way byte pair encoding works is if this is my input text positional edition contextual I take the most frequently occurring consecutive bytes and replace that by different bytes so an example is here in this I see that ti and ti are occurring more often so I can replace ti with an x now I can kind of go go that path recursively I can replace ix and ix with something else and I can you know my input size becomes from this to this it kind of ran us down by input size so yeah I can keep going recursively on this so yeah so what you had becomes this finally so your input size it's kind of like a model compression your eventually model size is going to be much smaller than what it would it would have been without byte pair encoding so and your models can you know they have a they're storing a function they're learning the translations and they can decode that back so and the other challenge that we have is if you're training deep learning models you're dealing with the cluster of GPUs and if your input text comes as you know different lengths if the input one of them is very small like only five words the next sentence has got 20 words and if your input text is kind of in not in a good order it's not sorted out sorted up front then you are not optimally utilizing a GPU cluster so you deal with something called jagged tensors and this is how it would look so if the max if the max vocabulary size is 17% so if the first one has only 14 the next one for blah blah blah so you can kind of see that it's not sorted out so it makes sense to pre-sort your input up front before sending it to the GPUs for training your models and so this is how you would do that and now you can break this up into chunks and send each chunk to different GPUs on the cluster for training so those different batches okay so this is how you do the model training and so the thing that we have taken into factor here is the number one create an embedding layer and use byte pair encoding to account for unseen words and then sort your inputs to avoid jagged tensors and optimally utilize your GPUs so you have trained your machine learning model attention network model offline in a batch mode now how do you deploy that in a streaming pipeline for real-time inference or real-time if i'm amazon and if i want to translate my content into from german to english or english to french how do i do that in real time so that's where you can we'll be looking at that's the next part of the talk streaming pipelines so for this particular demo we have used flink since we had to talk about beam this morning we can as well use beam with a flink binding and the normal steps in any natural language processing is you do a language detection first and then you do a sentence detection so break up your input into sentences based on the language and then you tokenize each sentence into individual words and then you run your byte byte pair encoding so the model in this case i've deployed the model on amazon sage maker but you can as well run it as an rpc server locally i use any model server so here's the complete pipeline flink for this demo and i just got the yeah so this is how the inference pipeline would look so for this demo i'll be using twitter as input source and i'm running it through an open nlp bunch of open nlp for sentence detection for german and i'm using an rpc client for doing that i'm for running the inference so here is where i need some help from you guys the Germans can if they can start tweeting i would really appreciate that if not i'll just pull the german tweet feed as is okay ready okay let me uncomment my german so it's starting up a flink cluster and i'm running this as a flink job i'm putting the job on the cluster and it's going to be making an rpc call to sage maker a rest api call to sage maker for getting the model inference so let's wait for the tweets to come first time hashtag german tweets please first time first time first time 2019 or first time 19 either one i mean if that doesn't work that's fine i can just pull off their normal german tweet feed so while that's running any questions that you would like to answer okay i'm not having luck with first time hashtag in german so let me just comment that out and just run the normal one okay yeah that's not first them tweet but anyway that's a german tweet okay so the first line is the actual german tweet and the second one is a translation the translation in english so i guess what do you think about the translation is it good i think it's pretty good compared to compared to google translate it's as good as google translate if not better i'm critical of poland when you start to shut down windows namely okay yeah anyway so this is uh yeah some of them are abusive and racist so let's ignore that yeah okay so any questions so here are some links uh the attention paper from google attention is all you need and the slides we'll update the slides yeah questions please yeah we use byte pair encoding so byte pair encoding is you know you take every successive bytes and replace that with the common byte yeah so basically what the advantage of that is your model size is much smaller and it's kind of like a data compression so it's a smart technique here yeah you yeah okay so the one thing i did not mention is this model was trained using a world machine translation wmt corpus our europeal corpus and it's all open source but and i had used only 30 000 words german words in the actual training model okay but nevertheless it's uh it works fine for uh in all kinds of tweets that come through so if you look at this for technical stuff i would say uh say same as you can train it on wikipedia you should be fine too we have trained it on wikipedia or any technical stuff yeah i mean if you're using byte pair encoding i think you pretty much cover most it it's more you can generalize your model to cover technical takes as well as tweets as well as news time's up yeah thank you i can answer this yeah so how do you translate long german words right you don't really have a new dictionary is this a question okay yeah so the way this yeah so the way this works is um the bp e will um compress the word to smaller units and the model knows how to trans translate these little parts of the words so it doesn't really look at the entire word but that's like the um the um new tokens that gets from the bp e process and that's why this works any more questions okay you can take it off here and then clip behind i don't know and you're speaking along right it's easier yes so i actually put some in your pocket i would put it close up so just speak rather loud is there okay okay