 Thank you. Hello everyone. So yeah, I will be talking about evaluation. This is the work that we did and the framework that we created for evaluation of different machine translation engines. Couple words where we're coming from. So, my colleagues are here also here in the room. We work in a company called Optum. So it's a big healthcare organization and we are working in a team that basically built work with innovation inside the company. So we try to risk different machine learning technology and make them usable for the broader organization. And part of like one of the work that we did was around machine translation. So machine translation is basically task of converting text from automatic to converting text from one language to another. It has a lot of application and big enterprise. So it can be like machine translation when people just translate, I don't know, some text just to communicate with each other or it may be some large scale batch translation when we translate a lot of like marketing materials to other languages or our product to other languages and so on. Or it can be like part of chatbot. Yeah. So our goal was to understand what technology is available out there, what we can use, what's the quality wise. It looked like and because we work in healthcare domain, machine translation are usually like common models are usually fine tuned for like general domain of text. Our texts are more specific. So another question was what will work better like on our type of data? And just before your use of machine translations, why basically be doing it? Because machine translation is a hugely evolving field. There was like, it comes through several stages but for recent years with all the development in you know, like deep learning models, multilingual models, they get a huge boost. So the new models coming up like each year, each month, even so it was important to us to have a framework in place that will allow us to test new models as soon as they can come in and compare it with our previous models. And just to give you perspective, the huge as the field is huge. So if you want machine translation, there's a lot of like different models available. So there's a lot of things to come to choose from. And that's why it's important to like be able to evaluate them between each other, choose the best one. So as I said, basically our goal was to develop a framework that allow us to choose our own data sets to choose different engines and different type of machine translation engines and compare them between each other on different metrics and basically see how well they're performed. And to be able to have this framework that we can basically run frequently, agent new models, agent maybe new data sets and so on. A little bit about implementation details, so what we did. So first one is like my personal like problem is language. So if you work with machine translation, we need to know like what language we're working on. Like from what language to what language you want to translate an example here, it's three models from Facebook made by the same team like each one like a year apart. And to set the language, say like for each model they use the new like type of the language. So it's in X6 and in Latin and so on. And if you like start going deeper, it's actually more complicated because like there's a lot of English languages that we treat the same, but then Portuguese in Portugal and Portuguese in Brazil as probably different languages, Chinese languages are definitely like there's two different languages actually. So the problem is like how we can understand them if you work with different engines. And I saw like a lot of examples when people just do like get first two letters in lower case, so please don't do this, it doesn't work. Like luckily there's people over there who not only complaining like me about the problem but do something about it. So there's a library called Lancotes. That's all it's for you. And the way we use it again, we try not to think more about what language people pass to us. We just like we have a model that have some a list of supported languages and whatever format they prefer. We have a language that like you want to evaluate. So we just try to function that finds the closest language and then just use it. Next piece is the actual translation engine. So when people think about translations, they usually think about in terms of like we have a string, we translate the strings that's it. But if you need to translate a large amount of data and especially if you want to do it for evaluation purposes, there's a lot of more stuff to do. So you need to parse the file. You need to skip text that was already translated before because again it may be a long process that may fail at some point. So you want to restart not from the beginning but from the point where you don't have translation. You need to pronounce the text. You need to do sentence separation properly if your text is not sentence separated. Then you need to split it to batches. Then your engine come in and translate the text and then you need to put it back together, maintain the order of the lines, like catch errors if there are any. And again, like huge batch translation may take a lot of time. So we want to catch errors, just output them and not stop the whole process because it's really annoying when you in the morning check your batch script and it just falls. So to support all this, what we did, we actually created the base translator class that take care of all of this stuff like for us. And then for actual translation engines what we need to support is just three functions. So initialization when we basically load our model or like authenticate to cloud engine or whatever, then actual function to translate the lines. And at that stage, we make sure that the number of lines that come into the function is actually like suitable for the model. So if he says that like how much batch size must be no bigger than like four lines, then we make sure like we make sure outside that no more than four lines come in at the stage. And another function to set language pairs. So it's usually important for multilingual models that may translate between different language pairs. So again, just like a model specific way to set the language. And by doing so, basically using this architecture allow us to use different engines. So we're not limited to like let's say only transformer models or only cloud models or like our own APIs. But we can easily create adapters to any model that we have so we can create adapter for Azure that just sends a request to API. We can create adapter for our like local transformer model like Mary and we can use our local APIs as well if you want to like compare the engines with what we have like in-house and so on. Next problem was data sets. So again, when we want to test machine translation on what data set we want to test it. The story here is usually when the new models come in is that there's several like common data sets from usually from workshop on machine translation from different years, but they're very limited. So they're like English to German or English to Romanian by Romanian for some reason. So it's maybe good to like to present scientific results like here's the data set our model is better like how great are but not good when we want to evaluate the models between each other like on from different angles from different domains and so on. Likely there's resource for this already. So people in Opus they create public storage of all like not all but most common translation data sets and they have very good APIs. So basically you're just saying like I want this data set and translation from English to Spanish then they give the data set to you and the data sets are very big. So what we do on our side, we just support it like downloading this data set extracting like because they usually like big like million of lines extracting some test data from them randomly and so on. So in the end basically what we end up with we had our data sets when we have our source data our reference target translation and then we just had translation for different engines and compares with each other. Another problem that we try to solve is about metrics. So there's basically there's many more there are multiple ways how you can properly translate one sentence to another language. So when you have the reference sentence it's a question like how we compare like what we had from a translation engine to what we have as a reference. So any metrics that you have will be approximation. For most of the metrics again there's third party libraries but they usually have some question like how to use them and so on. What we did in the end because again created a pluggable framework when you can use different metrics we used several third party libraries to use like six most common metrics. In the end turns out that in our use cases we can just use blues so the blues are like most common metric for machine translation but because we usually care about like huge difference in quality. If you like different metrics are important if you want to improve the model a little bit. Yeah, so for metric again we just create a base class that you need to fill in. Just say like do you need to take inization? Do you need like source or just target in reference? And yeah, what your languages are. Okay, so and then for evaluation basically what we did we created just three scripts, three scripts. So to download datasets translate them using different engines and then evaluate the results. And everything is managed by a hydra configs. So what you do like if you want to just reproduce our result you can just call the scripts without any parameters. If you want to do it on your own on any other dataset from Opus you can just create the configs that will say like this is the name of the dataset from Opus. Like I want 10 lines out of it without duplicates and with minimal length of string like of such characters. And the script and then when you run your scripts you just set like datasets to download with my new quick evaluation set. And then you pass it on to the next scripts and everything will be done for you. Again, if you want to evaluate on your own datasets and that's the stuff that we do like in-house because in addition to public datasets you have our own, you just create your dataset on your own whatever means you have and you just provide again the hydra configs that say this is my dataset, like name private dataset this is the source file, this is the target file like number of lines and that's it. And then you obviously doesn't need the first step now to download and you just for the script you just say it like translate datasets this is my private dataset and then evaluate it. Again with the engines like if you want to support new engine you create your like translation class that will be inherited by base translator class implement all the three functions like initialization, translation and setting language and then put it in config saying like this is my class like called custom and it should be like you need to put this name like in the enumeration inside the code and this is some settings for my engine and let's go. And again, like just in the script you just say that my translation engine is my private engine. Okay, so the next piece is basically evaluation that we did. So we choose several engines to evaluate again for our, like from our perspective that was interesting for us. So first of all we took two cloud engines Azure and Google, then we used that we evaluated Merian MT, so Merian is a very good architecture for machine translation. It's originally was written in C++ so it was very fast, it is very fast. Then it was converted to like Hagenface Transformers we actually use Hagenface implementation but because architecture is very good it's one of the advantages of Merian is the model are very small. I don't know, I don't remember. It's 300 megabytes, something like this is the model. Another model is NIMA, so NVIDIA have their own natural language processing suite with different models for different use cases like speech recognition, summarization and so on and one of them like, so you have machine translation as well. And then we also tested three models from Facebook. So MTM 100, MTM 50 and no language left behind is the lightest models from Facebook that was actually released last week. And many thanks to great contributors to Hagenface who just supported it in a week. So I had a chance like tomorrow or yesterday to plug it in again to our engine test. And yeah, we include multilingual models here basically for multilingual models, usual use cases like why Facebook is working on them as they care about low resource languages so the languages that have low amount of like parallel data that nobody trained specifically like translation engine. In our use case we do evaluation for translation from English to Spanish, it's like most common translation pair. So we didn't expect much from this multilingual engines but we feel that it's important to test them anyway because when you go to other languages in your pipeline it's much better to use one model instead of like the difference of if they prove, if they may prove like useful, it will be good to know. Another piece was data sets. So again, as I said for like public evaluation we just use public data sets from Opus. One of them was actually in healthcare domain. So in me it's a corpus of documents from European medicine industry translated to different languages and other was like some extract from Wikipedia, some TED Talks, Open Subtitles, books in Bar-Crow and CC-aligned, it's aligned text from Web-Crow. And for each data set we try to extract around 5,000 lines for each it's like depends on the data set because some of them have a lot of like for example especially like Open Subtitles they have a lot of duplications, they have a lot of short lines so it turns out smaller but other data sets we use around 5,000 lines pretty stable let's say for evaluation. But the data sets itself are actually quite bigger like much, much bigger like a million of lines. You can use them like if you want bigger evaluation. Yeah, and the metrics so there are different of them. We mainly focused on blue and I will show the graphs why? Because it turns out at least in our use cases the metrics was where aligned between each other. Two interesting ones, so the blue is traditional metric. Yeah, you have here. So basically it's a number between zero and one that measures similarity of machine translated text to your reference translation and it do this by calculating overlap of word and grams between your translation and the reference text plus some penalty like the length of the text is different. So it's very traditional methods. Method it's actually correlated well with human evaluation if you ask people like how well this translation is in terms of quality it's correlated well with blue scores. Other metrics like TR, TRF, Roush basically trying to do the same things because they try to compare this like reference text and your translation by different metrics by different overlaps but other two metrics, BERT and Comet it's like let's say new generation of metrics when they what they try to do it's they use actual multilingual models like infer first model BERT and Comet I think their own and they just build sentence embedding between your reference translation and your translation and see how close your sentence embedding are and they say close, yeah, it's good translation and they fine-tune the model specifically for this use case. And again like we see more and more in research that people start moving from blue to like BERT score and Comet, especially Comet. Okay, so yeah, what left is results. So for general comparison, yeah it's, I'm sorry it's probably hard to read because it's a lot of data. But what happens here is so this is different data sets. And for each data set, so the first column, first two columns is the cloud engines, Azure and Google. Then next three is open source engine, Sonima and Marian. And then the next three is multilingual ones. So 10 PM 50 and BERT 50, M2M and NLLB. And I know it's hard to read, I'm sorry, but basically if you look closely at these results main conclusion for us was that open source engines are actually on par in terms of quality with cloud solutions. So because if you talk with people about what translation to use, people usually tend to say, oh no, let's just go to like Google or Azure because it's probably better. Turns out not always. And like open source solutions are actually on par with them. So it was one of the important conclusion for us that we can actually use and implement open source models for an NFT inside our company instead of using cloud. Potentially another piece was around multilingual models. So if you can see, so here's a comparison when the first column is Marian that I put there for comparison and then the next one is different versions of multilingual models like big, small, and so on. So the first model from Facebook M-Bard 50, it was really bad. At least again, as I said, they have different use case, but for English to Spanish translations they were really bad if you compare with Marian. But then they improve. So M2M 100 was like a year ago, was released a year ago. And it's actually like have better quality. And then they released, like last week they released an LLB that actually on par with Marian on some data sets. Not all of them, but on some of them they're actually on par. So maybe in the future it will make sense just use multilingual models for everyone. Another problem that Marian is like 300 megabytes and an LLB is five gigabytes. And it's obviously the same difference in speed. In other tests that we run was around grid search. Again, we want to make translation faster. And for open source model, basically it's transformer model. So they have encoder, decoder, and for decoders they usually do the beam search to generate the results. Turns out that for all models there's not much difference in quality. If you just use grid search or beam search with I think beam of four. So the models are almost the same. Like if you compare them, like so here each pair is like the model is gritty or beam search. Again, what's important conclusion for us because grid search are obviously like faster placing usually two times faster. So again, easy to use. Yeah, and as I said about metrics. So this is different graph. So we have different models here on the X axis or it's a sort of different translation engines. And then each line is different scores. So like blue score, cheerf, berth, comment, root, tier. And as you see, if you just sort them by, let's say blue score. So from a smaller score to higher score, all other scores will be sorted as well. So there is no difference. So if one model is better than another by one score, they will, it will be better than another by any other score. So again, in our use case, because we compare like different models with big difference in quality. If you care about one, two blue points, you usually start looking at other metrics because then it will be important. And yeah, couple of words why we actually doing this if you will come like one of the reasons. If you compare the cost of translation. So if you want to translate one million characters, it's around 300 pages. In Azure, it will be around $10, like in Google 20, in OS 15. I don't know where gets this numbers to be honest. But if you just implement the same open source engine that we saw that have the same quality and on Azure GPU machine, it will take around, I don't know, five, 10 minutes to translate all this data. And you just pay 30 cents for this big if you're ready to support this. And yeah, all of this framework that I was talking about is actually open source. So you can go like github.com, Optum NMT and it will be there. So like if you have a need to evaluate machine translation engines for whatever reason, it's there, you can use it. And if you have a question, you can reach out to me or to Sahil here in the room. So yeah. And yeah, I was, people make me to put this here. We're actually hiring. So we have a lot of position open here in Dublin and like across the world. If you're interested, again, reach out and we can talk. Thanks. Okay, thank you. We have about six minutes for questions. If anyone in the room would like to ask a question, please come to the microphone. Yeah, thank you for your talk. You come to the conclusion that the open source models are about as good as the commercial ones, but in the graph, you do see a slight difference. So like how do you draw the conclusion that there are? Yes, so first of all, like one of the conclusion was as you see, like the quality is different between different datasets. So sometimes like one engine can be like really better. It's actually a problem in machine translation because probably like, for example, like here in Paris, you want a Nima is much better, but it's probably because they were straight on it. Because again, it's public datasets, people use them. But for our use case, we don't care much about, so for example, if you look like this dataset, Nima. So the difference between let's say like Google and Marin is around two blue points. And in the, like in real world, it means that nobody notices the difference. Again, it's important for some application, but if you're talking about let's say application when people just translate stuff, like when agent talk with customer and they want to like communicate better, they may use machine translation to get some suggestion in other language. Nobody will notice the difference into blue points. So that's why I say, again, for our use case, it's call me the same. And another piece was that we actually wanted, we want to find you on the models. For our use case, it's also possible to do in cloud, but it's much, much more expensive. With open source, it's much easier and much cheaper to do. And then you actually see the big difference. Thank you. I agree, thank you. Can you tell me something more about the licensing of the open source models? Can you use them commercially? How does it work? So it's, yeah, so it depends on the model in short. So basically the models that we used, they don't remember the license for all of them, but usually they allow commercial use if you say that you use them. But it's important to check because some of them are actually released by non-commercial. So for example, as far as I remember, Facebook models, like we included some translation just to see like how well they perform. But I'm not sure that we can use them in particular because they are probably not commercial. So it doesn't look like we have any remote questions either. So that ends the session. Once again, thank you, Anton, for the interesting talk. Thank you.