 Okay, so what is that? I think it's about 245. It's the optional session, so thank you for coming. My name is Jaya Mathew. I'm a data scientist with Microsoft and we've been working on a bunch of different customer use cases, primarily external customers, and based on my experience I kind of thought that I'll actually give a short talk on how to quickly add multi-language support to any of our AI applications. So let's quickly look at the agenda. So I kind of thought that before we get into how it gets done, it might be useful to just have one slight kind of introducing the topic of machine translation. Why do we need it? A little short description of the evolution and applications and then we'll focus on neural machine translation. So that's where most of the work and research is happening right now. To kind of piece all this together I kind of thought that it'll be useful to show a sample use case which shows you how speech from one language gets translated to speech in a different language and how all the different APIs can be kind of pieced together to do this end-to-end process. I have some useful links towards the end. So I mean some of the things, some lecture videos, some papers, some of the Microsoft tools, documentation. So I think it's kind of nice to leave the audience with something in case they are interested. I hope to have a couple of minutes towards the end for questions and answers. Otherwise I'll be in the room in case there are any questions. Okay, so now let's get started with what is machine translation? So pretty much most of us have used either Google Translate or Bing Translate. So that's one of the classic use cases of machine translation. So essentially here's a screenshot where I kind of type out this little word hello and it auto detects hello as being English and then I'm interested in trying to figure out how to say this in Hindi. So then it kind of gives me the Hindi Devangiri script as well as the transliterated where it kind of tells the native user how do you actually say it, although I cannot speak the language. So now, so this is what machine translation is actually doing. It's actually the task of automatically converting text from one source language to a target language. So now why do we really need it? So if you kind of look at the current landscape, most of the content that gets generated is in a single language. And so now if you want to cater to an audience which is much larger than just one language, you would need to somehow translate it from one language to another. So now trying to do this manually is extremely costly. So if you think about this is primarily to augment human productivity. So you can't have a bunch of humans kind of convert every single tweet or Facebook post from one language to another. Somehow the machine just has to do it. It's more scalable and it actually even drops down the cost. But the performance of machine translation systems are still to reach human parity. So I mean there are some issues but at least it does it at a much rapid clip. Okay so now let's quickly look at the evolution. So it's kind of interesting to just look at the slide to kind of understand how it's evolved over the years. So in terms of machine translation the first way was something known as rule-based machine translation known as RBMT where you know this bunch of expert linguists who understood languages came up with a bunch of rules and said this is how you would translate from one language to another. This is kind of cumbersome, takes a lot of work and effort to have all these linguists experts in the language kind of tell you how to translate from one language to another. Then came the next wave which was the statistical machine translation SMT in short where you're trying to translate a text from one language to another given a huge bunch of examples. So you have parallel data sets essentially so maybe it's data in English and data in Hindi and you're actually telling the statistical model to kind of build you know some sort of rules on its own as opposed to have a linguist to it. So then you have you know word-based syntax-based and phrase-based models. So again this was effective at that point in the 1990s but then the performance of these SMT models started to plateau off and that's when a bunch of work especially by Google and other companies came in the domain of neural machine translation. So that's when what is a computing became cheaper then it was easier to run all these deep learning models like RNN, LSTMs and then it was just easier to have one huge neural network do the entire translation from you know source to target. So now over the next couple of slides we'll just quickly look at what statistical machine translation is. I think I mentioned pretty much that it's either word-based, phrase-based, syntax-based translation and there's a bit of alignment that needs to be done because many a times each of the words don't map to a single word it might map to multiple words or in a different language. Now let's look at what a neural machine translation is. So here's like I said an end-to-end you're trying to just model the entire process. So just give it the data. So this is one big artificial neural network typically an end-to-end encoder decoder and it has a bunch of different either RNN, GRU or LSTMs. Okay so now okay so before we actually get into neural machine translation which is the focus of this talk kind of thought that in addition to the Bing Translate example I'd actually just outline a few more examples. So here is so these are some of the APIs that are readily available so we don't have to go and reinvent the wheel to try and you know do translation from one language to another. So we have APIs which automatically detects what the input language is and can actually using the API translated to one language or you can use an API to translate one of the words or a sentence from one language to multiple languages or you can actually do transliteration which is where you have language in let's say English and you're interested in trying to convert it into Chinese but you don't know how to speak Chinese so then it has the transliterated script which kind of enables the user to kind of say it or you can also you know use it as a bilingual dictionary where you have examples of human translated sentences and synonyms for some of the words. So these are some of the few applications that we support I'm sure there are many others but this is a few of them. So now let's get into what is neural machine translation because that's what we use and I think most of the companies actually use this right now. So essentially to kind of understand this so this is just an overview slide we get into a deep dive on the next slide but in this one we essentially look at neural machine translation as just essentially two steps. So the first step is something called encoding and the second is something called decoding. So in the encoding phase each of the model kind of models each word based on context within the full sentence so that's the encoding so it creates this huge vector and then the decoding phase essentially takes that vector and translates it word by word within the context into that destination target language. So now this looks quite simple but in reality it's not as simple so we'll actually look at how it looks in real life. So here is so let's assume we have an English sentence that I'm interested in translating it to French or Italian or anything any of the languages. So the first layer is where each of the model is modeled in context so assume that in this step you have a neural net which creates a 500 dimensional you know vector for each of the words so this is the first word then it actually what is that creates a thousand dimensional vector model to kind of also capture the context of the word. So this gets done for each of the words in that sentence okay so this is a not very long a sentence but let's say about 10 words in the sentence. So then what you kind of do is you actually repeat this process and it's not essentially just one encoder one decoder it's kind of nested so you repeat this process many times you have many layers so that you kind of get the context in a much better fashion and you come up with this final input matrix so this is actually the input matrix that gets into your decoder step but if you simply have a encoder and decoder what happens is in very long sentences it is unable to find the context and forgets some of the information about the past so typically nowadays this attention algorithm is something that is quite popular where you know an attention layer is put between the encoder and the decoder and it kind of has some sort of memory as to what happened in the past okay so then this attention layer information along with that final input matrix gets into your decoder layer so compared to the previous slide this looks way more messy in terms of how many layers it has and everything so now this decoder layer essentially translates the neural net that thousand dimensional uses the attention algorithm and starts translating each word one at a time based on context so this is uh how the entire nmt actually works okay so now I kind of alluded to why we don't simply have just an encoder decoder in the previous slide so essentially if you think of just using a vanilla sequence a sequence and especially when it comes to very long sentences typically greater than 30 words it tends to have very poor results just because of the fixed dimensionality of the decoder so that's where like I said the attention layer comes in and it actually helps somehow influence the past out of the future prediction of the world so then the next question is okay I have this attention thing now which part of the vector should I focus on so you know you kind of build a context vector with some sort of weighted average to do that so this is just some of the issues I think there are some other issues but I just picked on two of them so then um so to kind of also you know wrap up this nmt piece is why is uh neural machine translation much more popular than statistical machine translation or rule-based machine translation so here is this is just so much simpler it's just an end to end training all the parameters are simultaneously optimized to minimize the loss of the entire network so it's a one step you don't have to maintain multiple little models and stuff then there's distributed representations so there's sharing of word and phrase similarities and you know better exploitation of context of that word in a sentence as well as it ends up being um you know it's more fluent text generation so it's kind of shown quite good results of late so we also tend to use it most of the time now so now to kind of actually show how this works in terms of a speech translation use case so here what is a speech translation tool so essentially you have a client app that captures some amount of voice or something and then that uh audio file is sent to some web api and then the web api does something and it actually sends out the translated text it can send it out as either text or even voice okay so now we'll get into how this actually works so here's where the audio comes in first so the first thing that happens in an end to end tool that you want to do is use something called an automatic speech recognition tool so that's where you know the speech gets translated into it assumes that the speech has something like can can you hear me so most of the times when we humans talk there are pauses and then sometimes we repeat words and uh sometimes uh you know we say the wrong incorrect word and it's it's difficult for the tool to automatically recognize so we would need to clean this up because in this case maybe by mistake I said can can you hear me and then so it's captured everything so it needs to remove the duplicate words and then here is like I mean there is h e r e there's h e r it needs to look at the context and kind of decide which one actually makes sense so we have something called a true text piece which kind of helps correct this text that was captured so what it first does it okay it kind of sees that there are two duplicate words remove that then it it looks at it again and in the context it's like well h e r e doesn't kind of make sense in here let's correct it to hear me okay so it finally corrects the whole thing into can you hear me which is which makes more sense in this context so now that the transcript is ready we will send it for machine translation so this is what happens and it gets into machine translation in this case we're trying to translate text from English to Chinese so the translated API just uses NMT and does the translation and finally you can use the text to speech to kind of translate this text into speech in Chinese and give it to the customer okay so this is how we would actually piece all the different APIs that we have and we don't have to think about building the model we can always customize it because we have options to do that so if you look at this during the entire process partial transcripts final transcripts as well as translations are available in case we need to use it for something else training some other data you know model or anything so right now I think I mean the number of languages supported are this but it just keeps changing I mean they do increase the scope of different languages and language support okay so now what happens is many a times I don't want to use API as it is because it does not fit the industry the context that I'm trying to do so that's where I want some sort of customization in here so we have support for custom speech custom translator and custom voice so in this case you know if we have our own data we can bring it in and actually just build over the pre-trained models so we don't have to do all the heavy weight of building the model from scratch we have something that works using all the data that we have and then we just bring in our additional thing so in here so you know I mean if you're looking at specific IT related terminology maybe a general model doesn't have all that in there so that's when you know and you might want to build over one of these generic models and then training a model from scratch is complex expensive and most of the companies don't want to go through that whole process so that's where we actually would recommend using custom translator so the way we kind of evaluate this is if you look at the blue scores a generic model on any sort of data would give you a blue score of roughly 20 to 30 a human is you know much is is one of the best sitting in here between 50 and 75 I mean but if you kind of give some additional industry specific data you can actually bump up the blue scores not as perfect as a human but it's right in between like 30 45 depending on your data okay so now this is here here's where you're trying to translate a sentence I think from French to English and this is using your generic model so now I mean if you look at the if you read the text it doesn't really make too much sense so what you do is you essentially try and build a custom model this is just a very simple four step process where you first upload the data so you have to have a parallel corpus of data whatever the language is if it's Hindi English French English whatever so essentially you have two files which could be text and there are various formats that we support and you need to give this data you upload it into the system for training so once this data is uploaded you train using this data to build a custom model okay and then you can test it with either new data or if you have withheld some of the data and didn't use it for training you can do that and finally you essentially just a little so it kind of says that with this little bit of data there's a little bit of improvement compared to the general model so if you give it more data more pertinent data it does improve quite a bit and then you kind of deploy it and if you kind of compare this with the previous one let's see it ends up correcting the English sentence slightly better so that's the whole aim of actually customizing what you have with your data so it actually takes away all the complexity of trying to build any of the nmt models from scratch and you're able to leverage some of the work that is already done use the apis do a little bit of tweaking and it's just so quick to actually do it for any of your applications now to yeah yeah so I kind of think I have a couple of minutes left and that's kind of perfect so I actually wanted to leave everyone with a few links so the first two were kind of really interesting blogs then I kind of thought that it's interesting to read about LSTMs neural you know RNNs then that's the paper with attention models that is good to read and that's what I got the information from and the last link is actually a Stanford lecture series on NLP it's a series about 18 to 19 videos about one or 20 minutes each and it's really worth looking at that so then you know the last is some you know information about our translator too and yeah I think that's that's pretty much what I had so if there are any questions I'm here I think I'm almost out of time but I'll be here if anyone has any questions so actually so you so if you have a lot of data that is patent related data and you have some sample you know in your destination target language if you train it in the custom model it should be able to do it you need the parallel corpus so English or whatever language yeah so I think I think the initial step for even some of the things that we do we actually need to hire people to manually translate it and then give in some sample data because without that so but I think the requirement for how much data is needed is just about 10,000 sentences in so I'm actually thinking if you just have about 10,000 sentences of that patent translated into a different language you can actually feed it in this upload it and see if it actually works we've seen quite a bit of success even with just 10,000 sentences so I mean yeah the first step is hard to get the data in the format is expensive yes yes yeah and I think even initially when they started off with some of the tools that we have we actually hired people to do the translation into languages because without that we don't have the training data I had a question here yeah regarding the attention layer of the model could you clarify when when you say it's learning from the past which means it's it's called already data of in that in the in the target language already stored I mean what is that no no so if it's so it actually has some information about the source language okay so the source sentence so if it is this is my name something so then that attention layer kind of remembers the very first few words because otherwise it doesn't remember the first few words loses the context and only looks at the immediate previous word and makes mistakes so it's a source language not the it is a source language okay thanks