 Welcome to the 13th edition of the Tijsig Talks, which are a co-production of Tilt University and MindLabs. We are broadcasting live again this afternoon from the trade building in the Vibrant and Lighting Scores Zone. Our team of today consists of my co-host, Maripostma, and our technical colleagues, Martin and Lucille. We are very proud to be hosting two speakers today. Our two speakers of today are Javad and Stefan, and Marie will introduce them properly in a bit. Marie, the floor is yours. Thank you. Javad is a PhD candidate and lecturer in the Department of Science and Artificial Intelligence at Tilburg University. In his research, he is solving problems in neuro-based machine translation, and these problems are due to data scarcity. For example, providing additional data for domain-specific systems or for low-resource languages, such as Persian. Before coming to Tilburg, Javad was a member of the NLT group at the University of William in Greece. One of the topics of his expertise and interests are, besides machine translation, also sentiment analysis and unsupervised language acquisition. There are other interests that he pursues in his free time. Javad is an edit runner. He follows the performance of the Iranian soccer team on their way to the World Cup, and he likes coaching, particularly Irish coaching. Javad, the floor is yours. Thank you, Marie. Thanks very much for it. Before we start, I'd like to thank the Tysodactyl Organizer for having me. Also, thanks to the people here who are attending. Well, I want to slightly introduce myself. My name is Javad Kostapo, and with respect to the educational background that is my undergraduate and postgraduate, in my home country, Iran, where I studied computer engineering. During master degree, I was also affiliated with the Iran National University Processing, as Marie said, and as for master thesis, I designed a sentiment analysis program for an encouragement recent week in Iraq. It's called Weave Center first. Interestingly, the same time last year, I moved to the Netherlands, and I started my PhD at the Department of Public Science and Artificial Intelligence at Hilbert University. I'm working with Dr. Dean Tarjatirio and Dr. Peter Eswan. Like many of my colleagues at the Department of Broadly Interest in AI, I am more specifically interested in different public recognition translations than one another. I would say my primary interest are human education, data selection, and super-wise, the recognition translation of a human. Today I'm going to talk about one of our recent paper titled, Selecting Public Interest Sentences for a Neuro-Machine Translation for Anything Using My Human Text. Before delving into the paper detail, I'll have to briefly explain to you why and where to use recognition translation. So people want to break down the major tests of recognition translation. The first one is the simulation. In a simulation task, I as a reader initiate the translation and only want to hear the content of the input data. So at this stage, I'm fully aware that translation quality cannot be perfect. Let me give an example here. As far as I'm concerned, all that's not only support two different languages, large and large. However, if I want to voice something, I go on the website and I turn on my machine translation and I get the translation of the whole page. At this stage, I don't care about the translation quality, the terms are all fluency. But correctness is very important. It's not worthy that a lot of research is done in this direction and also more research falls into this direction. Okay. The next task is communication. Let's say you are going to a foreign country and then you cannot speak a group of languages. Then you can rely on the translation. In the communication task, the good thing is whenever there's something that's not clear and whenever there's something that you cannot understand, you can ask some form of question. And what is a relation to your translation program can not be perfect? But the good thing is that we can have machine translation units in different applications. We can connect them with chat rooms and also we have this option to have them on one of our hand-held devices. Machine translation and communication is often combined and probably with a speech recognition unit so that we have our individuals converted to the input text. The input text is fed to the machine translation and once we get the translated text, we can return it to the speech recognition group. The last task would be dissemination. Let's say I write a book and then I want to publish it in different languages. Here, in contrast to the situation of publication, translation is very important. So this is the reason that this task is not by human translator rather than machine translation. But we can use the help of machine translation. So in this scenario, first we translate our input data using machine translation and then the human translator can do that. Now there are some more familiar with major tasks with machine translation. The next is that we want to create machine translation units. So to train machine translation with it, we have two different machine translation The first one is that the statistical machine translation which benefits from a statistical model and the second one, neural machine translation, which benefits from human language. I'm going to talk about the letter. Let's say the ultimate rule is translating from English to French. Like any machine learning task, we would need training samples. But in a supervised setting, for a training element tomorrow, we would need quality. And a quality data is nothing but a training sample that consists of source sentence and target sentence. So for each source sentence, we need to respond to target sentence. And for training element tomorrow, we have to follow some steps. The first step is generating input data. So as you can see in one of the slides, we have our source sentence that's followed by the start of it and then we have our target sentence. The next step is that we have to vectorize the input data. For vectorization, we have to use source embedding method and target embedding method. The next step is we have to feed our source embedding method, target embedding vectors, target embedding vectors to a sequence sequence model. As you can see in the figure, on the left side, we have the coder and on the right side, we have the coder. For the encoder, we don't need to calculate any output probability. This is the reason they have to block. But for the encoder, we need to calculate the output probability and I'll put some random numbers there. But this number shows the probability of your tokens into each other. Once we calculate the last form for each people on network, then we have to backtrack the tokens. Entire of people on network in order to obtain the weights and bytes of people on network. We have to do step two and step three until the model is converted to the shared base. So now I'm going to use the underlying concept that I described to any entity to define the research program. There is a common belief among machine translation of practitioners and researchers that more training data is better, which is trend. However, training an entity model on large datasets requires substantial amounts of resources, such as memory and time. Let me give an example. Let's say you're leading with 31 million sentences and all of these 31 million sentences, according to the previous slide, should be converted to the embedding vectors. And you can imagine that this embedding vector should include it to the memory. And this embedding vector, in our example, has 10 volumetric rows and x dimension. And usually people don't choose the small number for x because if you present the dimension of your vectors, then we're not going to choose a small number because we don't want to lose the information. And even if we can solve the memory issue, it's still, according to the previous slide, we need to spend time for training the model, updating vice-versa and data. So it's very time-consuming. Now there is another problem. Here we have private data for full-managed verse. Let's say you want to create a model that can translate large to first. So in growing social scenarios, you don't have private data. And even at worst, when it comes to translating different elements of each verse, such as medical context, legal context, and even public context. I think it was two or three years ago that COVID hit all aspects of human life. So at that time, it was very important to timely, accurate input, communicate and share medical information across the board in multiple languages. But the question is that, will we do that instantaneously? If you don't have private data for different domain of interest, such as COVID, data-driven anti-paradepts such as NMT may perform poorly for dominant specific translation. It doesn't matter how many sentences we have, how many private data we have, we cannot get it resolved from training the model. So here, while we face the two-side challenge, we need to define how well we communicate. We have to say what amount of private ID or human data are necessary to achieve a state of the artificial translation quality. And that should have no computation on data capacities. So to be honest, First and Community has made many efforts through a technique which is called limit adaptation. In limit adaptation, there is an existing model and people tend to point to it to a certain domain. But the point here is that the certain domain should be different from what the model was originally trained for. Let's say you are out of the main purpose and then you want to find it for the medical context. It was a good motivation for us to propose a method, a data selection method, in order to improve pigment translation in your resources scenario. As I said, for example, purging to dodge. And how we are implementing this, we are going to extract in-domain samples from out of the final process. And ultimately, we want to improve the selective or generated sentence in the context of limit adaptation for training. Briefly, our main contribution is proposing a language of nested data selection method for selecting or generating a quality value purpose. I want to draw your attention to this sentence. We do this only using monolingual domain as a specific purpose. In other words, everybody can design a web folder and then they can look at a Twitter, but they have to make sure that they are collecting their intended domain. And using that connected data, they can generate quality data that can be ultimately used for training and then to go recently. And as I said, we don't need to translate anything, we don't need to align any of your sentences and we can create a final in-domain dataset. So here you can see an overview of the proposal approach. And the left side of the slide, we have our out of the main dataset, which has 31 million sentences. And we have also in-domain dataset. As you can see in the in-domain dataset, you can have either doors to the part, so it's not a wide-angle dataset. The next step is we have to feed our sentences to the pre-trained model, which we use as a sentence third. There are many good reasons for choosing sentence first, but it's beyond the scope of this discussion. The next step, once we get the embedding vector, we have to treat them to the data selection algorithm. And data selection algorithm gives us not only one, but also any new in-domain dataset. And as you can see, this new in-domain dataset consists of source and target set, meaning that we can use it for training and like the number of near-in-domain dataset, we can use it to train one to aim and to model. In this slide, we can see one iteration of our data selection algorithm. And a step one, because this embedding vector, we're given by sentence third. And now we have embedding vector for in-domain sentences. And also we have embedding vector for out-of-domain sentences. We use the precise similarity to complete the similarity between the embedding vectors. And as I said, it's only one iteration of our data selection algorithm. Once we get this in-domain dataset, we can see that last block, we need to, in a step two, we need to sort it. In that way, we have the pile center we talked in, at least, and the lower center we get that bottom of the list. And at the end, we need to select top-end new in-domain sentences. And in our cases study, it was six. So it's very difficult to determine the select six. So it varies from dataset to dataset. Here, also, you can see an example of our data selection properties. In this table, as you can see, in the white-nose column, we have the store or a similar disk or out-of-domain web. And in the first row, we have our query or monolingual in-domain. You can be very complicated in the quotient. And as you can see, we don't have similar data storage because we used that to our time in the relevant in-domain sentences. And then, in the next row, we have top-end part of the in-domain sentences. I forgot to tell you that we conducted this experiment with English and French dataset. This is the reason that we can see English and French as a result. So for top-end, for example, the similar data storage is 90.10. For top-end, we have 86.16. For top-end, we have 85.96. And as we are going to the larger end, the similar data storage decreases. So if I want to tell you what we showed, we showed even though anything more is implanted in the northern part of the data, it would not perform in-domain translation. As a result, more training data is not probably sufficient for in-domain translation. We also will try to mix in-domain sentences with data storage. So after creating our in-domain dataset, we mix it with data storage and then we train the model from the scratch. The model improved, but it was not very significant. It was very clear because the model was voice-to-word in-domain sentences or rather than dataset. We also improved our work with the state-of-the-art method in terms of translation quality and our work out-of-the-home holiday. So there is another important part here that our in-domain sentences were relatively small. And according to the training entity that we discussed, it resolved in less training time. So if I want to wrap up the research, I would say we propose a method to help the machine translation can mitigate the lack of problem in the in-domain program. And as I said, it's a language-agnostic technique, so it can be used for any language practice. So the selected data can be applied directly to one specific in-domain machine translation. So once we create our sentences, then we can improve the work, our training and model. And the proposed selection part one is all textual sentences embedded in the semantics search. We have a query, it's like a search engine, and then once we get the query, it can be moved to our out-of-the-man sentences in point-developed data. And the last part, we have a ranking in-domain data. We have the sorting algorithm that sort our sentences. And for future work, we tend to use generated work or in-the-punk supplement of the patient, or even we can use it in multiple events. In case you are interested in reading the proof paper, because we facilitated this research on different angles, you can understand this QR and access to the paper. So that's it. Thanks for your attention. If you want any questions, I would be happy to take them. Thank you, Johan, for this presentation about your research. I have one question, a general question about translation quality. You mentioned that you evaluated your system and that it performed better than the QR. Can you say something about the evaluation, how are translation systems evaluated, and how does that pertain to our perception of senior users? Yeah. So what we usually do is that we have both the standard or both the reference that is usually created by the human translator. Next, we have some metrics such as you. And what we do, we compare the work based on your ingrams with the oldest standard reference, and then we publish a score, like F1 score in children or what we have accuracy. It's a kind of metric that we have. So it's based on the ingram. And for example, we go and check the first word, the first word of the human translation, and we go to the second word, and we compare the two different words with the two different words in the reference, and it's great. This is the one that we commonly use in order to evaluate a machine translation performance. Something about how that pertains to what we as human users would consider to be acceptable quality or good quality translation? It's very controversial. I mean, we can have one sentence and that can translate to the different things. So we really, what we usually do is that we have the reference and then we control our work at the end of the day with the reference. And we also we can consider the humancy and also adequacy, but most of the time, as I said, it's more related to assimilation. So we pay attention to the accuracy rather than fluency. So if something goes wrong, ignore it in the order or in the format of the text, we don't really care about this. Honestly, we want to have like, for example, do some dissemination like publishing for different languages.