 Welcome to the first talk this morning and first of all, thank you for having me here and for being such a great crowd. I really enjoyed the talks and on the conversation over the last two days. Thank you. This morning I want to present to you a bit about an extension that I wrote over the last six months. That does neuro-machine translation and so the idea was to have something very simple. With a single click you can translate a document in writer and we're trying to make the translated document to look as good as it can based on the old document. So if you are into corporate identity, I have a title slide for that too. I'll tell you a bit about my background because I'm more coming from the neuro-machine translation side than from the LibreOffice side and then I'll tell you a bit about what's going on with the extension and how I tried to wrap it nicely into the LibreOffice. I would hope to get some input from you on how to make the extension more useful and how to make it feasible for people to train their own models too. So my background is I did some pattern recognition way back in 2000 but then I moved to Mathematics and I spent my PhD explaining the pattern that you can see at the top right. Then I did a lot of entrance-risk modeling and recently I moved to do machine learning and do a lot of modeling there. I have two tiny patches in LibreOffice and so those three buttons here I implemented a special version of those even though... Yeah, just turn it off. The mic ones broken so sorry, yeah. Yeah, okay. If I'm not speaking loudly enough you will just complain, I know. And so these three little buttons, when I first made them they were not looking nearly half as nice as they did now but this was my first patch in LibreOffice and I think you're doing an admirable job at making it easy to contribute a little feature like that. Okay, so the idea is to have a very simple way of translating documents. So I have a sentence in English, the quick rom, fox jumps over the lazy dog and the extension will provide you with a new translate menu and you can just click on English to German and then the text will be translated to German and it will also get some formatting which now appears from magically but usually you have correspondence. I should say and acknowledge that the project was funded by the German Ministry of Research and Technology through the prototype fund some of the German guys know the prototype fund it's a project that gives you a short term six month grant to work on software for the public good and so this extended fit in very nicely. So the first question I thought well I want to do a translation function in LibreOffice the first question I had was how much of a user interface do you really want for such an extension? So on the one hand you want to have a nice user interface on the other hand if you can get by with less user interface make it more seamless if you want that's certainly an advantage So I first researched what other people do and on the right you can see how it looks like in some of the Microsoft products where there is a little thing on the sidebar that pops up and the original text is put at the top and the translation suggestion is at the bottom and then you can hit insert to replace the text but so what is the advantage of doing it then? The only thing I could come up with when I systematically reflected this was that you can see the original text and the translated text at the same time because obviously if you want to edit the translated text there's a more natural place to do it than in the sidebar the natural place to edit your document would be in the document itself So this is what I did with the LibreOffice extension then I took the original text and just replaced it with the translated text and in order for the user to be able to see what was the original text there's something that's not ideally translated there is an annotation with the original text and yesterday evening during the Hackfest I actually worked on making those optional so if you're very sure that you will get good translations from the model you can disable that This is a little example from Wikipedia from the machine translation article in German and you can see the humanity dream probably doesn't strike you as a particularly idiomatic English phrase maybe it's a dream of mankind or something like that but the next paragraph understanding a language without learning is an ancient dream of mankind This is something that you could probably put in the English Wikipedia if there wasn't an already written translation already So in terms of UI, I first had fairly large ambitions so the problem with neural networks in general is that they're kind of a black box and whatever drops out of it you have to accept and so there is research into explaining how the model arrives at its predictions and so this is a work by Strowett at the vicinity of the translation system that I use and for example they have not only the best translation which will be the output of the translation but they also have the most probable alternatives but the question here is is this a useful UI without intervention options by the user if you can just look at it but you don't have the option of saying here I want to have something different all the time I figured that it was not that useful and so I didn't even try to implement a user interface for this in LibreOffice Obviously if you have ideas about how to make this function like this useful I'd be very glad to have a chat because I want to make the extension as useful as possible So who of you knows a lot about neural networks? Yeah, very good. So it's not everyone One of the things that neural networks are supposed to do better than say a dictionary is that you have some context that you can consider and the traditional way of having this context is to have a neural network where you feed in one character at a time More recently if you've heard about some bird model that Google trained or this open AI text generator that was too dangerous to be disclosed that made the news last year that there's a new way and this is kind of an attention mechanism where you have the original text is fed into an encoder and you have some vector representation and now the decoders, they will go one word by one word but they can look at all the encoded vectors and this attention will play a fairly neat role in the following because we can make good use of it in our extension and so what happens is that the decoder is fed kind of a token that says this little s that means go, it's the start token and then it will generate the first output token, DEA and that will be fed back to the decoder and so it will produce one word at a time and so what we get is that we feed in like our sentence the quick brown talks done over the lazy dog we get the translation but we also get those attention weights which essentially are mapped between the source text and the target text and you can see this roughly lining up here of the diagonal which means that the structure of the sentences are more or less the same but you also have some little thing that it always looks at the end which maybe is a few hypothesize that some words like humans understand language we usually have to read the full question before we give an answer to and so it's maybe it's natural to look at the end but what can we do with this attention map well we can take this in order to format our translated text if we have an emphasis for example we want to emphasize the quick the word quick then the attention map will tell us that Schnellen and probably also DEA which is serious and we would have to emphasize the Schnellen at the beginning of the sentence and this is what the extension does we also need to keep the paragraph formatting so that you keep paragraph rates and headlines and things formatted and we currently do this or the extension does this by just translating paragraph by paragraph in fact sentence by sentence this is one of the other things in order to make use of these translation models so there is a library that I've built on but this library needs the text one sentence at a time so you have a pre-processing step where you have to split the text into sentences which can be tricky and particularly like it's not clear whether it works for all languages so I wouldn't know but to do the sentence splitting in Chinese for example so the two things that I used in building it in addition to LibreOffice of course was a framework called PyCorge which is one of the largest AI frameworks in addition to TensorFlow and I personally like it a lot and this is where it contributes a lot and then one of the core developers one of the few independent ones and then there's the project OpenNMT so it's Open Neural Machine Translation which is a quite comprehensive framework originally developed by the Neural Natural Language Processing Group at Hama and it's more a library application-like than many other of those machine translation models which basically are, here's the code to accompany my paper on machine translation so this was a very nice building block to have and with that the extension was developed fairly quickly but now the next question is you have the extension but it will not do anything for you unless you have a model so who's impressed by translating German to English? Two so apparently you would be more impressed by having different language combinations what would you need in order to train something like that the first thing is you need example sentence pairs so in order to train the German to English or English to German model I used four and a half million sentence pairs probably for many other language combinations you don't have as many and this is actually one of the fairly harsh limitations and then you need one of those fancy TPU graphics cards so I used a gaming one that I ran for one or two weeks and this is one of the weaknesses so to train one model you run a TPU for two weeks and this takes 150 to 200 watts so you've spent 30 to 70 kilowatt hours training the model and if you compare this so I have three children and three kids my wife and I and my TPU would take about 400 kilowatt hours per year so if I train 10 models my wife will ask me what's up with the electricity bill at least it's green power but nonetheless and actually I would like to work a bit to get this down maybe half it or something like that get it down to a quarter then the steps that you take to train the models you first have to prepare the vocabulary and if you look very closely you saw that I don't actually feed words into the model but word parts which is kind of one of the fancy things you do in machine translation but this is an easy part then you have the training bit which takes very long and then you have and then you need to do an evaluation you can try to do this automatically and get some score but in the end you will have to just feed it a few sentences and see if it makes sense and this is also why you need someone who actually speaks the language or the languages involved in the translation one other thing that I want to want so I showed you an example of translating Wikipedia and this invariably works relatively well but if you have a special domain for example you want to translate emails within an automotive company or you want to translate legal texts you would do better if you specialized the general model to a legal German trainless model for example and you would do this by running a few more training steps on a more specialized set of examples and this is one of the next steps in developing the extension is to facilitate for the user to actually run their own model training how many of you do have a GPU computer? so that's far from everyone which is probably one of the things that limits adoption in the end also wanted to talk a bit about what I did to package this extension and obviously and now I already had a lot of hints of what I could do better and if you have some more I'd be most grateful so the situation is that I have those powerful building blocks PyTorch and OpenMMT but this comes at the cost that I have quite a few dependencies and some dependencies will be just pure Python code but some dependencies will also what is called in the Python language extensions which are C or C++ modules that are compiled so I have to find a way to provide those Python modules to the LibreOffice Python instance and my current solution is to build a simple OXT that lacks a lot of these dependencies and then I install the extension into a fresh LibreOffice install all the modules with a Python installation tool pip and then I copy over the folder which contains the installed packages into the OXT zip file and it increases the size by a factor of 5 of the OXT but this leaves me with a problem how can I automate the build while PyTorch are installing LibreOffice and installing packages into LibreOffice is something that probably needs some work to automate the build and so one of my goals here to make this easier would be to shed some of the dependencies and one can also think about moving the extension from Python to C++ in order to make it simpler and the neural network bits would lend themselves quite naturally to do that but if you have ideas, I'd be most grateful There's two more things in UI that he did to do and I'll just show it really quick so you need to install it as an extension just like for every extension so this is quite nice and then you have to install those language models or translation models and what I did was I put a new configuration page in the translation in the language settings and you can download the translation model as a zip file and then when you get new, you get a file dialog where you need to put it in the path which I thought was a relatively reasonable way to distribute those models but if we wanted to do this at scale I mean, how many languages does LibreOffice support? 30 30, yes so if you wanted to have translation models for that the first thing is that you really have 30 times 29 translation models and the next question would be would you want to set up a center repository for translation models or best translate those and make them available to the user? So here's my questions for you so how much is machine translation a desirable standard feature that you wouldn't want to have in the product? You know that for example the Google browser that I will translate this page thing that pops up whenever you look at a page in a foreign language is there a good way to work on something else than write your text? So machine translation in my opinion would best when you feed it sentences some writer was kind of a natural match but one could think about also having it for example and impress but I wouldn't know how to do that yet is there interest in more language in supporting more languages here from user perspective also from providing your own language models and what size of language model would be acceptable so the model that I trained came out with about a gigabyte for the translation model and there's a few ideas how to get this smaller at least by a factor of two again but still I would be fairly large even compared to the not entirely trivial amount of space that the office itself takes I'll have a written quiz on this at the end so other ideas to make this more useful would be better integration with a writer so for example one of the things that I don't do yet which might be obvious would be to set the language property after translation and how many of you would consider training your own language model if you had an easy script so that's a good part of the audience so it's not everyone but maybe a third this would be mostly something that you don't do from within deeper office the way you need to have some sort of well-written instructions there's an initiative that does collect open parallel corpora, open examples of sentence pairs and incidentally one of the sources they have is user interface translations so one could indeed consider whether one can also take bits of the deeper office documentation in order to see such a translation model and then one can try to find things from the current research so there's a fairly influential conference every year that's called WNT and they regularly have challenges where you're supposed to train models or achieve certain tasks and so for example one of the currently one of the challenges is automatic filtering of noisy corpora if you have something where you don't know this is really a good example of pairs or are there mismatches and things they work on detecting which pairs are good to feed into a model and which are not and obviously one of the other questions one might have is can we use traditional dictionaries because currently we just throw all these sentences into the model and then it learns something but there's not much of an opportunity to kind of say well this particular word should correspond to this other particular word like it would have in a dictionary and of course any other ideas you have on this might be very very interesting so my part of the talk would be it's done