 So, good afternoon. Thank you for joining this session. My name is Alberto and today we are going to talk about this unsupervised free cat. A cat is not only the animal, but in this case is the software, how we call that in the translation, language translation industry, the software to translate. I'm 36 years old. I started working more than 10 years ago. I've been at the university. I'm a master in computer science and I'm working most of my career in startups. During the day, I look after SRE in the DevOps business unit of the company who sent me here, SourceSense, and also I'm a machine learning manager for the unit. My forte is in Kubernetes, monitoring, continuous integration, cloud, Hadoop, so big data, Python, everything which is around Jupyter notebooks, scikit-learn, TensorFlow. At night, I attend the spare time that I have two metal concerts and there's something here. This is the positive one. It's just really annoying. Okay, it won't go away. Okay, very good. So I would like to thank also the company who's sending me here. SourceSense, we are basically a consulting company completely devoted to open source software. We've been existing since 20 years and we have various branches in Italy, also one in London, so no more in Europe. And we are Delps people, we are coders, we are data scientists, and so we take after 360 degrees everything we can. So today the outline is as it follows. So we're starting with the need for a standard tool chain for translating and localizing, as we say, a software project. Then I will talk about the cat tool I'm bringing here today, which is Matecat. And what's the problem with today's machine translation, which is the most fundamental supporting system for a translation pipeline. And what is the solution which is coming to the rescue and what we have here practically. So why do we need a standard tool chain? Although there are a lot of solutions out there to translate documents and software, most importantly, it feels that too many times we have too many ways to skin the very same cat. And this is because a lot of people are not into the translation industry, so who doesn't understand reinvents from scratch. And this is always very dangerous because it fragments the efforts. And so the software industry, particularly the false industry, needs to adapt more to the industrial way of doing things, which is the only one viable one. And this because the industry actors, which are the language service providers, the LSPs, have long set onto a consolidated set of processes of technologies and file formats. There's even a consortium which is called OASIS, which standardizes the file formats and the interchange technologies that allows to build a viable industry for the translation. This presentation is about exposing battle-testing technologies to the community so we can return back soon to what we love, which is hacking and not coming up with new ways of translating a software or translating a document. So the standard tool chain for a translation company or an open source project that wants to be translated starts with a cat, a computer-aided translation tool, which is an editor that parses a bilingual file. A bilingual file is a kind of an envelope container which represents a file which is in the process of being translated. So you put this file in this envelope, everything is extracted, and you can manipulate the strings, which is the text, the buttons, the labels, the menu voices, but even a simple docX document. You can manipulate the strings into other languages and among other things, it has to provide two main capabilities, which is the tag editing capability so you can manipulate the untranslatable entities like Markup. If you think about it, a lot of the content in a well-formatted file is the formatting, which doesn't need to be translated. You must not translate that, you must preserve it, otherwise you're going to break the markup. And so you have to manipulate that thing, leaving that untouched. And also, number two, format preservation. So you must convert from the original file to a bilingual system. Then you have to translate all the strings and then you can pack the strings back in the original file format, preserving the formatting. You don't want to break the file just because you changed the strings. Another really fundamental component is the presence of a translation memory. You can think of translation memory as a database of past translations that you can recycle or adapt for the incoming document which is coming here. And the idea is that once you translate something, you don't throw away all the singular translations, but you keep them because the next time you're going to make a second revision, for example, of dishwashing machine manual, you just want to reuse the very same sentences and just translate the new ones. It's a matter of style guide. It's a matter of preserving the experience. Like a control panel from Windows in Italy is pannello di controllo. You cannot change that sentence to something else. Otherwise, all the window users who are accustomed to the kind of sentence appearing somewhere in the menu will be confused if you change to somebody else like pannello, I don't know, pale controllo, I don't know, for the control. So you must preserve the style, you must preserve everything that is part of the experience of the software. And translation memory serves this purpose. And last but not least component, you need a machine translation, a server that provides translations on the fly, which is Google Translate, for example. So some dominant standards which are worth knowing is the ax-lafer, which is the exchange localization file format, which is an XML-based file envelope that separates the strings from the markup. So you put the doc-axe into an ax-left envelope, then you have the binary of the original file that has been processed by removing all the strings and substituting that with placeholders. Then the placeholders are into a map that maps towards the strings. You have them for the original language you are coming from, but then you can add new keys for the other languages you want to go into. And so whenever you want to translate the document, for example in English, you just add the keys for the English that map, then you pack telling the software to read only the keys in that language code and it packs back the translation strings into the original document. So replace the placeholders with new strings. All the document is saved, and what you end up with is a file format with different language, which feels like kind of magic the first time you say it. I'm going to show it here. Then the TMX, which is an XML exchange format for translation memories. You want to export your translation memory or to another member of the community. You use a TMX, which is a standard file format. And the last format worth talking about is the PEO, which is the specialized file format for the GetText library, which is a GNU tool, and is the one that you put the strings into and the software is expected to find the strings in that format so anything that compiles with the GetText library is able to be localized in this way. When you start localizing a project, you look for the GetText PEO files. How not to do it? For example, iOS and Android are long-standing offenders because they came up with their own proprietary file formats, which is the localizable strings and the strings XML. And there are tools to convert this stuff to the PEO format. So the standard workflow is that you come with an original file, you put it into a cat tool or a cat server, and that generates the next lift container, which is the envelope of the original file. Then you query the translation memory or machine translation, and you fill all the content in it. At any time, you can export the container to an excellent file and send it to an independent translator to have it translated. This is particularly important because you don't want to send the original file. You don't want to develop the translator to gain access to the original file, which may be reserved, which isn't translated to. Also, the file may be so big, they want to split the load among 10 translators. So you have a 10 small XLEF file with the 10 parallel pieces of stuff. Anytime you can export translation memory in a TMX or the machine translation in a model file you want to deploy somewhere else. And at any point, you can take the XLEF container with the strings that have been translated until that point, repackage back them into a translator file, which is the thing you want to go back to your customer or to the community if you're working in a project. So today, we are showing our cat tool of choice is Matecat. Matecat is an enterprise-grade, completely free and open source, web-based cat tool. And it has been funded by the European community in the seventh framework program. It costs 3 million euros and two years and a half to develop. It started with the four people team, including me for the very first release. It's currently open software, so you can find it on GitHub. And it's evolved and operated as a service by Translated, which is the company that took the tender to develop this technology. And so now I'm showing you this. So today, I'm bringing you here. There is the hosted version, but the hosted version is not really interesting. So today, we are deploying that on our laptop in real time. And this is the first thing I'm bringing here, which is a Docker Matecat. And it is actually a packaging, a proper packaging of this tool as a Docker container. So it's easy to deploy that. So it's really easy. You just have to clone and issue these really small commands. So in this case, now we are... Whoops. Let's start bringing up the MySQL container. Let's then bring up... Let's initialize the MySQL container. Testing if the container is connected. And there we go. Okay. Everything is up. So my keyboard just switched to English. Thanks to them affect. And then we bring me up the rest of the tool. I've started with some settings, but you are free to customize everything. I try to move everything from the build time at runtime so you can just change the runtime variable and the tool changes accordingly. The only thing which is really nasty can be improved here, actually is that it expects you to... It's only on Chrome, but that is by design. Okay. But it expects you to have a dev.metricat.com for a matter of cookies and where to find the API domains. That is hardwired in a configuration file. And I will make sure you can change the runtime in a later version, but for now you just add the devmetricat.com into your localhost because your PC must know where it is. And this is the Catool interface. So here you can upload anything. And I'm starting with this Docker and Kubernetes presentation, which is a file in English about Docker and Kubernetes. And let's just load the... Oops. Sorry. This is my fault because I forgot one thing to change the permissions. Otherwise, they are complaining. So the file has been uploaded. And now it's been converted to this intermediate presentation. This supports over 72 file formats. So all the office stuff, strings, CAD is a really, really big array of formats. It does the conversion for you and packs back. So it's really nice. Now we start analyzing. And of course you can sign in with Google account. This is my development keys in here. And now it's complaining that this app is not verified. It's a development environment. Don't worry. A nice feature that this tool has is that it can scavenge for you your docs from Google Drive and have them translated and reuploaded back to your drive, which is really handy. Now it's analyzing all the words to go to find for a duplicated content. So you don't have to translate something twice for matches into a translation memory, which I have. And we just open the workbench. So here you can see I've partially translated some stuff. So it's locked. I cannot touch it until I unlock the string. But never mind because there are other strings here. So for example, I click here and it's a suggestion from a machine translation server, which in this case is Google Translate. So we can just accept the translation or not. So the new containers, we don't translate the containers in Italian. Then we translate and it goes to the next one. See here we have the tags. These tags are the formatting of this PowerPoint. So the problem is that this string has been in the middle of a tag, which maybe is bold or maybe is italic. And I have to make sure that it stays in. So for example, spring, deploy your spring application. Spring in this case has been translated as spring area, but is absolutely wrong. And in this case, spring application are in between two different tags, two nested tags. So what I'm doing here is I'm living there as it is. So it's trying to guess all by itself. Come on. Okay, accept it. And I can go over and over and over until all the strings have been translated. Nayspace to store your images. Okay, fine. Any time I can download the preview of how is it going and it's just downloading the file. And I can open it. You can see it has been translated in Italian while all the formatting has been preserved, which is pretty much impressive. So critical, all is good, but for the most of my time here I've been relying on machine translation server to suggest me the correct sentences. This is because I just had to accept or edit the suggestions. This gives me a big productivity boost. But no matter how much data we translate we'll never have enough memories to reuse for the new project which is coming. So machine translation system is here to actually fill the gaps, which are always more than you have. Machine translation are machine learning systems that are trained over data sets which are named parallel corpus or corporea. Parallel corporea serve as a bidirectional label data set. A bidirectional parallel corporea is a very, very, very long list of sentences in one language, for example, English. And then a corresponding list of those sentences in the foreign language I'm trying to translate, for example, Italian. So I have one million sentences in English and one million the same sentences translated in Italian. And it serves as a bidirectional label data set because I show an English sentence and then I show the corresponding Italian sentence. But I can do in the opposite direction. I want to start from Italian to English. I start with the Italian string and I show as a label what you should come up with, the English string. The problem with this thing is that you have to feed this machine learning server. You have to come up with a lot of data. And this has been particularly made worse by the advent of the neural technologies. The neural machine translation, since it's a neural network, it requires a lot of data, which are in the hundreds of millions of aligned sentences. A lot, a lot, a lot of stuff. So since the technology is pretty much the same for all the players, the winner is the one who has more data, which, namely, is Google, always. That is the only really serious provider of machine translation technology. All the others are niche players who pretend to be very good, but Google Translate actually is the top notch player. They start at first, they crawled a lot of data, they aligned a lot of data automatically or with manual efforts to bootstrap, then they really came up with billions of words. So the problem is that we have to find this data. There are efforts to procure parallel data for free, which is the OPUS, which is an open source parallel corpus. And it's an open collection of parallel corpora, which you grow every day. For example, they crawl pages in different languages and align them. Or, for example, they take books and they align the sentences, even though they're not the same translation, because literature translation is not really the same. For example, you know a book which is really, really, really easy to align the Bible. The Bible has notation for all the verses and all the different chapters is the most alignable book in the world. And it has all the languages in the world. So although it's a way of writing is a little old, it works very, very well. But the problem is that we have enough data to come up with a decent translation system because we can just use this stuff. There are millions and tens of millions of sentences in this open corpora. But just for figs, which is French, Italian, German, and Spanish, if you have another language, for example, Norwegian or Sudanese, you have no way of coming up with a decent machine translation system. You just have not enough resources, not enough parallel resources. So we are stuck at that point because the technology we have is very good to come up with translations but only for a selected amount of languages. And since those languages are those who can benefit from the presence of a translation industry, there are more and more resources are created in those languages because more and more documents and interactions are produced and so the more and more data they have, it's a virtuous circle. And the same doesn't happen actually for Norwegian, Italian Norwegian, for example, or Japanese ebrike because there's no interaction with Japanese culture and British culture so you don't have an aligned data set. So you say, okay, what if we have Japanese English and then English ebrio we can do, like a pivot we translate from Japanese to English and then from English to ebrio. Yes, and you lose a lot of fidelity because it's like we have Italy, we have a game, I don't know how you call that abroad is the cordless phone and then you have to turn it into an air, to somebody else and then the other has to tell to another in a ring of people and then at the end you have to just to, you have fun knowing what is the sentence that has arrived. Doing a pivoting in machine translation is really the same thing. So you're just taking the output of a machine system and fitting it into another machine which will add a lot of distortion and so you will not be very happy So let's just maybe since we are in a dead end we may take a step back the dead end comes from the fact that we have a supervised system that needs labeled data what if we just go with unsupervised training unsupervised training is a particular kind of a class of machine learning system which is not concerned with finding correlations between a data and a label but finding a hidden structure into a corpus which has no label So in this case we don't need the parallel examples to learn a language it's not the way humans actually learn a language so we learn Italian when we are our mother tongue language and then we learn another language but separated from Italian it's not that they just relentlessly show us examples of all the lore that we can come up with in our mother tongue language and they just throw at us English translations for me I'm Italian until I learn the English I just learned the English separately and then I started doing my mapping between the two languages and we can do it actually the same how about learning two languages and then try to map between concepts which is easier because it's easy for me just to get a very very vast corpus of just Italian just an Norwegian learn those independently learn a model of how those languages are structured and then try to put some links some bridges in between when I mastered the two languages the problem is that we are dealing with a machine so in order to map between languages a computer needs to build a representation of that language and how can that be accomplished exactly we may use language models language models are technology that allows a machine to come up with a structure a hidden structure because we're just showing it data samples and not teaching the rules and builds a model of how a word relates to another and word embedding is a technique that maps every word into a vector space remember geometry at school where you have all these shapes in 3D environments which are vectors so these are an array of numbers and we can when the numbers are closed two vectors are closed in space words with near meanings will have near vectors in that space so for example I don't know Paris and Rome will be near in the space of the capitals because Paris and Rome will be very far from Milan for example because Milan is not a capital city so Paris, Rome, Berlin will be very near and Milan maybe will be near Frankfurt among the dimension of all the cities who are not capital to anything Boy and man will be near on the dimension of the type of sex will be a dimensional just two numbers actually like a binary dimension but anyway we have these very big vectors of 300 dimensions and we come up with a way to map from word to this stuff and we can do even crazy things since these vectors and numbers they actually allow us to do computations like you take the vector of Paris you subtract the vector of France and there are numbers you then add the vector Italy and what you get is a vector which is almost the same as the vector of Rome and this means that if we analyze enough sentences in one language we start to develop a very structured model of how that language works internally which is a language model our language model we need so we could induce a parallel corpus between two independent languages by just mapping concepts between spaces in Latin spaces and we can bootstrap with a vocabulary for example we have a vocabulary for Italian and Norwegian and so we know which some words translate into another and you can stop you can just bootstrap from them and then you can start mapping similar concepts or you can just use unique entities and frequencies so the the article has the very same frequency as the Italian corresponding article which is il so you can home as the very same frequency I can expect in a corpus as casa in Italian so we can just use this as a rustic to guide our mapping in this very big Latin space so a very legit question is want the result to be really really really noisy because you will end up with a lot a lot a lot of false positives yes but we could use statistics to compute means and among how many times house seems to align with dog or to casa and infer the true positive throwing away all the false positives and here comes the phrase machine translation which is the old technology that was completely scrapped away when we came up with narrow machine translation the phrase machine translation had revolved around the concept that co-occurrences of words are significantly are statistically significant sorry for example you see this character here what do you think is the translation of that character shrimp because it's the only character that occurs always so you can assume that this is the translation of shrimp and this is somewhat broccoli if you happen to meet broccoli again in a word in another sentence and you just see this one maybe this is broccoli and this is a particle like some connective that it's used it's peculiar to the Chinese language and this is the way we used to train the old Google Translate now Google Translate is narrow but in the old times it was based on this technology which is a toolkit an open source toolkit of course which is called MOSES and has been developed by University of Trento University of Edinburgh so the co-occurrences count between words and sequence of words because I've been at school school can align with school in Italian but also I have been it's a sentence that is always always translates in sonostato in Italian so I can always count not only the co-occurrences of words school but also I've been sonostato and so I count the phrases and the phrases I treat them as a unit of translation and that is subject to co-occurrences counting too so I can use those to calculate the translation probabilities a probability is just the idea that seven times out of ten casa is observed to co-cur with house and that's a probability so I can just come up with the probabilities and choose the most probable sentence among all those who get to be aligned these probabilities are always annotated in a database a special database which is called phrase table creating a phrase table is the most expensive operation of phrase based machine translation so the most tool chain is based on three technologies which is the language model year's dilemma or KLM is the language model that calculates the probabilities of a sentence being meaningful then we have the aligner which is Giza and later was superseded by Fastalign and Moses which decodes the incoming message Moses projects the input sentence of our phrase table to retrieve translation options such as among all the different options guided by the language model as a heuristic and then stops by itself when all input sentence has been covered and today we are presenting the second technologies which is monosess monosess is stems from the paper from these three scientists and it's basically a toolkit to create a phrase tables from two more lingua datasets through word embedding then it creates a Moses model with this noisy thing does some fine tuning and iteratively augments the data set by translating itself and so you can start with one million sentences and you rapidly go to ten million sentences because you try different combinations between sentences that you have started with it's noisy, sure but the sheer amount of data that you come up with ends up averaging out the noise in the long run and this is what I brought today so the second demo that I'm showing you today is Moses my small contribution has been packaging this really wild research prototype into a developer format so it's easy to deploy, it's easy to use you have the pipeline to train so you just take two big text files with all Italian sentences and all Norwegian sentences for example and these sentences must not be related to each other you just take Italian Wikipedia and Norwegian Wikipedia or all the news set for the Norwegian for example and you stash in two separate files they don't have to be parallel they have to be monolingual then you launch the training passes a week and when the process is finished you will have several gigabytes of files inside this train directory which is the models the phrase tables and then you can launch this translation server with the following syntax you have to point it into the directory of the model and you can now query your server with the following so you have this API which is another thing that I've built because Moses has no API server so I built a Flask based Python HTTP API please know that you have the verb here the query which is Ulver in this case and the source language did I say Norwegian and this is Swedish yes this because Moses doesn't have a Norwegian tokenizer to analyze the sentence so I used the Swedish tokenizer sorry about that and the target in Italian and since this server is actually online I can also provide a working demo here it's really really really really slow because in this first version it tries to load the model in memory each time it's around then the memory is thrown away you take another sentence is loaded in memory at a time it Ulver is translation of Wolves and these are JSON output that you have here so now I'm in order to show you how this thing work I have to go here and then select Swedish then I go here I try to disable the lookups because I'm purposefully using a wrong language I have to fix this some day so I don't get suggestions for the Swedish and then I add the empty engine I choose as the provider Moses because I try to imitate the Moses HTTP property API I try to just implement it I add the engine name which is okay and this is the I keep you just type a dummy you don't need it now we have to look for a file which I have somewhere and actually forgot where I stored it which is embarrassing and I think I love that file I know I have an original file right now to show you but the idea is that you will end up with a very very slow server that will actually provide translations to you not the Google translate speed very very slow fashion and the translation are not even perfect but are good for a starting point to start translating your software is that you don't even if you are Norwegian or Sudanese so you have a very very small presence online in terms of amount of data produced and parallel data produced with this kind of technology you can just throw it throw in it enough data from books from articles from cooking recipes from subtitles of movies and that you have a lot of if you go in Opus the parallel corpora and you take only the monolingual part that you are interested into then you might be able to actually build a very very huge model proof is that I have launched the model that I show you today is just a test model the series model for Italian Norwegian has been running for 5 days and hasn't finished training yet so I cannot really show it to you because it's just training on a 50 core machine has been running from 4 days on a 50 core machine it's not looping it's just that it does a really really really long refinement process so it induces this first table takes 10 hours the first model 20 minutes then starts doing some fine tuning some parameter fine tuning and this takes approximately another hour or so then goes to the monolingual corpus translate it with the model it has built and generates two parallel corpora because it goes to the Norwegian applies the Italian Norwegian model and generates one in Italian translation and does the same for the other and then now you have two very big parallel corpora that are used to augment the rough model that you started with it's like a bit-by-bit-aining you bootstrap with a rough model and use yourself to improve your own training so if you're supplied like me 700 millions of words to this model takes a long time to translate 700 millions of stuff then you do another fine tuning and then you do that again 10 tuning iterations for 3 iterations of back-translating all your corpus from scratch takes a ridiculous amount of time so I hope to actually post some updates on how the experiments went actually so in conclusion today what we add here so we have a docker-packaged version of monosas which is ready to use to generate a training set we have an HTTP API server to query the model obtained this way and it's all available in this repository and then we have a package version of madecatool which includes also MySQL server and RTMQ instance virusdemons to perform the analysis and the translation in the background and Apache 2 Web App with a PHP a very humongous PHP Web App runs in docker-composer but I aim to support Kubernetes so you can just deploy that anywhere and you can start your own very small LSP at home for your open projects Kudos to the four knights which are those we are portrayed in the cover of my presentation which is Philip Cohen for inventing the phrase-braze machine translation Thomas Mikolov for inventing Word2Vec which is the mapping and the language of the model with those vectors Adam Paschke for inventing PyTorch without which we will be able to actually train anything narrow and Michael Arteche for putting all together and be the author of the paper I took inspiration from thank you very much so are there any questions is the faces of the people the cover is Sergeant Pepper Lonely Hardscull Band yes yes so your colleague is asking when you translate a sentence which has some markup into another language who happen to remix the words in a different order the tag formatting is not preserved because for example you may end up with a sentence between tags which is split into different sentences with something foreign in between you should duplicate tags so this problem has been solved by Christian Backe which is a google brain scientist and worked in the University of Dimbora and he basically employed a trick so he factorized all the tagging if you have a sentence five words are tagged you put tags in each separate five words so you multiply the tags so that each single word is bolded independently then you translate you let those be remixed and then if you happen to have in the target language two tags you can collapse them in two words between one tag you solve them that way this technology that I have developed doesn't feature that trick which is one of the first things I'm going to add actually because it's a very naive trick but it works very well actually and so I will scavenge that from the made cat project because the made cat project this made cat means machine translation and answered the cat tool and it shipped the original project when it was a research project with an array of little stuff like this that actually improved a lot the quality of the engine I will try to go back in the old wrapper when it was a research project and pulled those back in now I've developed from scratch because it was just much easier for the purpose of this conference okay so your colleague asked you don't need a parallel corpus to train monosas so you happen to use two monolingua corpora which is in fact two parallel corpora will be easier or faster not faster but will be less noisy less noisy because you will end up with good quality mappings which usually they end up for being very noisy when you use monolingua corpora because you end up with a lot of bad positives if you happen to use a parallel corpus those mappings end up for being high quality mappings so your training is actually of a better quality you end up with a better model with less noise and so you have to spend less time doing fine tuning and back translation and you don't have to create this really really really huge amount of data only for the sake of a Monte Carlo noise leaving the good stuff in this place yeah I think I didn't understand the question yeah so all of these files which is the office, the document the excel powerpoint also web pages also scanned files it's true only if you put an advanced filter which is probably translated and it does OCR or PDF you can also just translate directly a TMX or an XLIF and also desktop publishing stuff and biolocation formats like the properties and the strings so all this thing whenever you put something in an XLIF is generated behind the scenes which is the envelope container and they are inserted in a temporary location when you translate you replace those strings and then when you do download or preview those strings get packed back into the XLIF the blob file is just a matter of replacing placeholders and packing all the binary him and you end up with a powerpoint or a docx file add another file format the real question of your colleague was what is the process of adding another file format you basically have to implement a new class behind which is the class for that file format the idea is that the interface expects here's a blob give me the XLIF strings so you just have to implement that kind of process it might be really easy for example if it's YAML or JSON because you can come up with a template XLIF and you convert a JSON into an XML in the format XLIF so you can just take the strings and you paste those in this kind of format if it's a more advanced format and I'm coming up with the most difficult format out there which is the AutoCAD AutoCAD is a total nightmare because it's undocumented it's heavily binary and the strings are basically shredded everywhere in the then you have to come up with something that is able to read the strings place a placeholder in the original file and you pack this into an XLIF and then you construct the sentences out there I didn't show you XLIF because it's not really difficult once you see it here we are so as you see here an excellent file you have you should have a very big blob which is the binary basis 64 encoded the version of the file with the placeholders already put in between then you have units which is a translation unit, a sentence which has always a segment with a source and a target so for example, unit 2 segment an application to my E-Boy Temporal Exist document and here you have the corresponding target this file is all from English to Japanese because of this header here and then you go down and you keep having units, segment, source and target this is the very basic unit of work so if to come up with another way of producing an XLIF from a property file format you have to come up with a way to construct this XML once you have the sentences it's really easy you can compose this XML and it's easy the very difficult thing is that starting from this stuff you just have to find a way to take out strings put a placeholder in it with the ID of the unit you have here and then you have to Base64 encode and stash it here let me just show you a real one because since we are in... I can show you this a real example of something we have here and nope this is a power point loading, loading more loading so here you can see the translation unit this is the XLIF version 1.0 the one I show you on Wikipedia was the 2.0 sorry for the mismatch my GPU is crunching how big was this file in the meantime I'm just ok set wrap here you can see this is the base64 encoded file the powerpoint file encoded as a base64 if I had to take this one and do the code I can see the binary pptx man you can see the ator here at 100% CPU ok and here you have all the different translation units when you have the source and with the segment source and the target doggering Kubernetes has been basically generated as a copy of this one so whenever you translate you actually write here you write into the database and when you export you pack all the original strings in the place of this one you can see even the tags here the g tags that we're showing inside the the ator box very nasty file format with a very heavy documented standard you just use this as a working table or as a workbench and then you when you have all the strings base is decode this one substitute of the placeholder which points to a single unit save the blob as a file and you have hopefully translated the file format output this is the most complicated process in extending this cut but it has been done for each show of this format any other questions? sure ok so I chose Italian Norwegian because I knew there were not a lot of resources just some millions of words which is huge is 80 millions of sentences then for each show you go 80 millions of sentences Italian Norwegian has 12 millions of sentences 6 millions and then because I was listening to Hubert actually that works it actually serves the lowest one which doesn't work but with 10,000 sentences it's impossible to go out with a mapping with 10,000 sentences you need some millions of words but it's easy because you just need to crawl sardinian websites sardinian books and stuff it's not difficult to find the monolingual corpora the real challenge is to find the parallel corpora but this technology turns the problem into scavenging monolingual corpora only which is the more content you can just scroll in a single language totally unrelated with the other language you want to crawl it's just an easier task yeah google has it sure like google books for example they have crawled and aligned 1 billion words in various languages with google books and that is the top source of the translation quality when you actually are on google maps and you have to mark a random address inside the google maps and it understands you that's because it has 1 trillion word language model built it's just so much data that it's always able to understand you because it's always seen something so in our region works the model I wanted to show you is still training sorry about that it takes always weeks of training but in the end you will end up with a model trained on a lot of data moses since the moses is the last chain of the chain it's not the best translation software server out there has been superseded by the narrow machine translation which is much more fluent but since moses is phrase based the counts are made as discrete counts so it's always able to come up with probabilities even in absence of huge data examples it's a discrete system opposed to a continuous in the mathematical sense system for the narrow networks translation quality is lower but it has a score of 26 in a blue scale which is the scale for translation quality google with the narrow which achieves a 40 points which is outstanding but 26 are ok for students I think our time is up thank you so much for attending here