 Balw'r ysgol i porou am ddechrau ac rydw i ddim yn cymryd prynyd iddo ar undud o ddych chi. Roedd gwrs honno masgrwynt. The first is Roger Levin. He was to your roster, and Roger is the leader in the Computational Intelligence Lab. Then, the next speaker is Anne Amgry, and he's a researcher at the Pattern Recursion and Human Language Technology Research Center at the University of Poland of Peninsial. They are both going to talk about their different systems of key word searching and spotting. So I will hand over to Roger first. Thank you very much, Roger. Thanks, Ylvoi y Pub inside, I am going to introduce key word searching here in the context of transcribws, and the idea of my talk is really to present to you a little bit of the foundation of everything. We have a practical demonstration in the tools section tomorrow and once we started there was a question whether we really need a good complete transcription of text for searching and I will also introduce a new term following Gündas proposal of investigating texts. At some point we saw that this is probably not necessary or most likely not necessary and came up with a new concept which I would like to introduce here and also the application of this new concept. I want you to learn a little bit our technology just to make sure that when we talk to each other we understand also is it really a technical talk from developers side and also maybe a little bit already knowing about configuration and behavior of these things features and maybe also bugs of the software. I would really like to remind you that our point of view is that what we are delivering here also with transcripts but in particular with the search engine is not an engine for execution of something which is ready but it's much more a tool for continuously working on it so we would like to enable you to work with these tools and to improve while working. So you yourself have to adapt the tool and the whole machinery for specific challenges and in order to be able to do that you have to understand it of course and you have to somehow be able to interpret what comes out of the tool and that's somehow the ambition of this talk. Okay a little bit about foundation from technology point of view. Today I start with a recognition engine as a hobble like as a black box so the recognition engine produces something and the something which we work with is a matrix of character confidances per position and these scores as I would like to call them while we speak about it here come from some neural network or equally well from the model output and we assume mathematically we cannot prove but we assume that these numbers estimate the probability of a character at a certain position. What you would like to do at the first glance is maybe choose the character per position which is most likely and then you end up with some string in a sense and we use these strings of course these are represented to you as a raw reading result or you can also call it a free version. A free reading result so there is nothing considered what's about what's known about the document context for instance or something about the time was a writer or something it's just the machine reads the engine reads and you take the most likely character per position. And then from time to time people apply these post OCR correction thing so you use external sources like what I mentioned already to improve the strings and say okay this is an error detects the error corrects the error in the sense. We skip these two blocks of the workflow and we start not correcting and implementing the external sources or the knowledge the external knowledge only at this position where we already have ready strings but we start the same way. But start introducing inserting into the whole workflow the document context one earlier at an earlier stage and this earlier stage we used to call Confmat so used to this abbreviation that I would also like to use a word here in this talk today which stands just for the confidence matrix as I said these character confidences per position and we try to evaluate the recognition information for the whole workflow. Already from the entire confidence matrix which in turn means that you have to store it if you look at the same text once again so in the application case here we store now these confidence matrices rather than strings as a reading result. Now decoding refers now to the process of getting out of these probability information some really string text which is readable to humans. We use it by additionally using the external sources and what's all behind this is to find an optimal match or optimal representation so to say between the query which you would like to dig out of the matrix and what's really in the matrix by these probabilities. And the outcome is some list of ranked alternatives so we present usually several possibilities which can be seen in the very same matrix. If you look at it from different point of views it would pretty much be different really different alternatives. So this is a graphical representation of what these confidence are these confidence matrices the darker the more likely a certain letters. So this is a handwritten string of a German small German city close to my home city in the north of Germany. And as you see for instance here the machine says very likely this is a B so this is very dark. Let's say if we are around here nobody really can know what it is of course if you do not know the name of the city really would not be able to spell this out. And that's already explains why for instance here it's not that dark so the confidence matrix says OK there's something around here but I'm not sure. And on the other hand if I give you a list of German cities and I ask you to find out that particular city which is most likely hidden coded in this matrix then you will probably end up with the correct answer right. So that makes a real huge difference between free reading. I don't know anything compared to this decoding against certain queries. How close are strings or how close is a string to a matrix. There is a well known concept in computer science mathematics which is called a Lievenstein distance Russian mathematician. And I put it here very informal for a distance you count basically how many insurgents of letters deletions or maybe even substitutions of letters you need. I'll show you an example in a second. The main advantage of this approach is that there is an extremely efficient very fast algorithm for doing it. So we can do it very very often very fast and also in parallel on modern computers. And the idea always is the same. Find the optimal we call it pass or the optimal representation which is a pass through this particular matrix. And optimal means short in a sense how far are these things away or also cheapest if you think about I have to pay something for insurgents and deletions and people also used to talk about costs. You will always read or very often read the word cost in our representation. So these are mathematically basically weights either it's a distance or it's cost something and counting these things means you add these costs. So here's a tiny little example. I'm not going to into the details. Everybody who knows this dynamic programming or Lievenstein distance knows what it is. You can click at this demo if you have said PDF at one point we provided of course. And here you see what I mean. This is a very easy representation of a matrix. And you walk through the matrix and pick up for instance the way of changing the word Vienna to the word wine. So in German it's an example of one of my colleagues. I love it for the presentation here. So it cost me to the distance of Vienna and wine and German is two. If you use substitutions it's particularly easy. You just have to flip these letters. And now the whole story is all about the following. If my matrix says an I let's say is at the top rank position. So how much does it cost me to replace the I by the E rather than substitutions. You can also say replacement. It's also a term here. And now I add up of course the W and the W find cost zero. This cost something maybe here it's cost one. This cost something maybe cost again one in the general case. Some fractions and again this end here costs zero. And now you add up these things and you find the optimal past the cheapest one, the closest one, which is hidden in the matrix. So that's the whole story. Here's a matrix. Here's a pass and here you see the final cost. And looking up these paths through the matrix finding out of all these very, very, very many possibilities how you can travel from the left upper corner to the corner right down here. It's extremely fast by this what we call dynamic program. You think very often in terms of probabilities. Probability is a term in mathematics which is well defined. Confidences are not probabilities. So as a mathematician I refuse to say probability. That's why we invented the confidence. We think about things maybe between probability zero impossible and probability one. I'm completely sure. Now I introduced something like distance or cost whatever. So start with zero. I have to do nothing. I read the word Vienna and there is the word Vienna. Nothing to pay. But these costs here, these distances can be infinitely large. But you can change the word Vienna to the word Hamburg. Why not? It just costs maybe 10,000 or 20,000 or I don't know. So there is an essential difference between this scale and that scale. We calculate back and forth. Understanding both scales is important because in our vacation for instance, I guess mostly we talk about distances but also people talk about these probabilities. Here are the mathematical functions. This is just a power of E. This is a minus log. So minus log likelihoods maybe I've heard about it. Many people in computer science talk about it. The major trouble here is that you have here an infinite scale and this is only a tiny little finite piece. So you have to fight with the case what to do with all these infinitely large costs. Which again is practically very important for you because we make a cut here somewhere. But where to make the cut? We cannot say in advance you have to dig this out from your experiments, from your experience and the changes from data set to data set. So understanding this concept and deciding for themselves whether I talk about these distance measure or the probability measure and how you think for your own, it's sort of up to you. That's what I mean with we have to learn it. We cannot present it as a ready to use solution. How can we now use these solutions? Of course there is a basic way. We just have text. We generate these confments and we do it per line. So that's all about the story why you have to put these baselines up. The line is the basic structure in which we are thinking. The date coding engine then gets a query. I'm asking is the word whatever hidden in the line. And we dig out the alternative answers ranked by some confidence distance, let's say, or probability if you wish, whatever you calculate, right? The point is confidences are no probabilities. That was a mathematicians part. I have to say that. The practical part is confidences and distances. Do not have an absolute meaning. It's very important for you. If you have one experiment and there's a distance five and another experiment, there's a distance five, these figure five can mean two completely different things, right? One five maybe extremely large. The other five maybe tiny. So matter of experience and of, yeah, also of different experiments. So they are essentially incomparable. And you have to manually configure the whole thing and tune it. There is no general solution for everything here in this case. The topic where you find these places always is threshold for something where to cut the scale. For instance, but we can also now have different applications of this basic pattern. The classical one is string search. You put a query string in, you get out the distance to the query and you just rank these queries. You have of course then big data issues because you can imagine now you do not have to search within strings. You have to search within these matrices, which are maybe huge number of these things, right? The response time may be unacceptable. That's something you tell us when you start these experiments. Or maybe even it's necessary to have some pre-processing. We tried it. It's not so easy to have pre-processing. And our close colleagues from Valencia have a very, very impressive solution. And so, thanks Alejandro, I can now just only put the link here. So UPVLC will tell you this story in the next talk. Thanks, Scott. For me, we have no very good solution for this at the moment. But we can make more than the usual way of asking things. We can, for instance, now again come back to the transcription by reading text. Then we put in not just one single word, but we put in an entire dictionary. Maybe dictionary with word frequency, something like this. In computer science, people used to say that's a language model. Don't think about it in terms of grammar and things like that. It's not really like that. It's mostly dictionaries, including these word frequencies. So this is really at the edge of research to include more like grammar or things like that into this area. And then decoding means compare all dictionary items against the confirmat line you have. And if you do it for all dictionary items, then you will have the most likely dictionary item meeting the reading result. And you say, okay, that's my reading result. And you would be surprised where in the metrics the reading result sits. It's not on top. It's not maybe among the first 10,000 good possibilities, but it's in there. And that's the main advantage of this method. If you just exploit the top line, what we call best pass, then you lose all the other information of the metrics. And now the new idea keeps this information and use it also for decoding. So we get reasonable text alternatives. But now the criteria is that these are close to the confirmat, so what I've been reading and the language model, which is really a sensible, a meaningful answer in terms of language meeting both criteria. You can somehow improve your result of reading text recognition by a good language model and vice versa. So people have very impressive results when you put in very good dictionary. The machine suddenly reads things which you even yourself would not have expected and vice versa. The only bad thing is both your text recognition and your language model is poor. Then, of course, we cannot do anything. Then it's bad. So what is the message? It's not really necessary to improve text recognition results like hell for years. Try with a good dictionary. Try to give the machine support and what the world of language is there in which you have to read the text. And if you, by this concept now, forget your first try and say, OK, now I have a new dictionary. I learned something more. There's new languages in it, more Latin words like only German or English or something. You can give it a new try. And you did not lose anything so far because all information sits in the confidence matrix, right? And then you look at the same text with your English eyes, get a result with your German eyes, get a result with your French eyes, get a result without having a predefined string representation. The old way, giving first a German string would never, ever enable you to read it with English eyes, right? Then it's gone. The information is gone. OK. Investigating is a new point which we, as I said, try to introduce here. The challenge, or several challenges what I said before, is for instance that these language models have to model fuzzy dynamic, incomplete behaviour of natural languages of course. Also the queries may be very complex in a sense that we try to ask the confirmed very specific things which are not just words. For instance specific combinations, restriction to character classes, I mean just asking for figures, just asking for special letters or things like that. Or maybe even restrictive vocabulary asking, tell me what German cities are in there or what Austrian cities are, I don't know, or first names or family names or something. So I mean a vocabulary which is sort of interest in, or my interest in that moment. This is pretty much well known in computer science by the term regular expression. So there is no definition really which I can present here. Regular expression is a certain syntax of describing things. So if you know these things then it's easy, if you don't know, I will not be able to teach it within two minutes or something. So just to give you a sort of glance of it, for instance if you try to encode a four digit year, you know maybe it starts with a one and not with a two like historical text, I don't know. And you have then three of these letters and you may be aware of this kind of coding it in Excel or things like that of these Microsoft programs and many more use it. And the regular expression would be like we have a one here and then one character out of the class zero to nine the digits and so on and so on. And you have these three digits. There is an abbreviation with the curly brackets here. And so the syntax of regular expressions is really mighty. It's a completely mighty tool. We do not support everything but we support a certain basis. And now you can ask for these things for instance. And I guess I put here something which is, oh it's exactly that one. This is transcripts and it points you to the answer for instance the 2000 here. And now we are shocked. There is a 2000 and it shows a 1800 and indeed I guess it's 1800 here in the original line. It's just a screenshot of transcripts. And once again if you now do not wonder about the situation then you understood everything right because the 2000 is just the best path. That's what the string representation pretends to be there. But in fact there was a 1800. And now if I ask the confidence matrix is there maybe a figure of the type one and three digits. The confidence matrix says yeah it is. It's just not the best one but it's there in there. And you see if you once have the 2000 you will never be able to track back to the 1800 here. It's hidden, it's gone. But if you now dig out information from the confidence matrix you see the machine finds it. There is no 2000. There is in fact really a 1800. So that's the idea here. These terms can be very complicated. I leave it here just within the talk if you look at it and know it. For instance this codes the complete data of this type. And you can have very complicated terms and then you have surprising answers from the machinery. This is a regular expression decoding takes in transcriblers for instance for 380 pages of the Zurich Staatsarchif data set about 40 seconds. So this may you may consider long or not long. You know that it's not the end of what we can do. Gunther already mentioned, no Philip I guess mentioned that they applied for new hardware there. So for instance in the moment massive parallelization on CPU and GPU is the point, the work of the day or daily work in a sense. So there will be really a huge gain in time within the next few months and yeah I guess next few months. Particularly sticking on these graphical processing units. Just to give you another impression on just ordinary laptop not on this one on the one we had before. Like 433 Benson pages which we used in many of these collections in contests 2 to 3 seconds average. And if you want to just for searching if you read a complete page and have an 11,000 words in a dictionary which already I guess is reasonable. It takes 8 to 9 seconds for reading the entire page right comparing it you know. And I don't remember whether this was I guess it was English dictionary also from the Benson. Okay yeah I guess my time is over and I should leave you out with some questions not questions in a sense of testing it. But what I would expect from the audience in this time whether we work on realistic questions don't answer this immediately. Try it at first and then at some point tell us whether this type of query these new possibilities are something you would like to use. What is the realistic size of the data corpus? Alejandro will speak about huge corpus, I spoke about fairly small corpus because if you put a very complicated question you can imagine yourself. That is not, I mean you cannot do it in realistic time for the whole library or the whole archive. Okay there is a trade off between these things complexity of the data set and complexity of the question. And what for instance is a realistic query response time if we say okay in 30 seconds you have the answer is it okay or do you expect more or is it arbitrary. Yeah as usual I would like to thank my team which is very important. So I didn't make any work of this is all been done by my young young colleagues and in particular the one who is here now has in parallel the other talk. Our sister company a small SME which works together very much with us and this MOU partner and the project of course. And nevertheless whom can I see here Louise and Maria from preparing this wonderful workshop here. I know they do a marvelous job within REIT with the dissemination group as we call it and I guess it's very much time to thank you particularly both you and also Eva and all these other people for doing such a marvelous job. And finally of course I thank you all for your attention. Thank you. Okay we've got any questions for Roger. Gunty do you want to say something about when keyword spotting is coming to transport? Yeah actually I should note that response time is not as you would expect for a search service but we believe that we can do it as a job and this means that you would get an answer that you are searching in a large collection and you get an answer maybe after an hour or maybe after a day. But the job is stored and therefore nothing is lost and you can always open your old queries and work with them. Actually the last snapshot version already includes keyword spotting and the next release will include it so probably next week or maybe two weeks forever. And then every document where HPR is applied can also be searched with keyword spotting. One question on the matrix is the confidence matrix because if many of you do the text or CR and you get this very large matrix and then you consider using a dictionary. I would like to think also that there are some kind of tactics that say that there are certain combinations that are not possible in European languages so you could say that reduce the size of the matrix in order to speed up searches for example. We wouldn't do that but we thought about it but this contradicts a little bit the idea of having the capability of asking the matrix within let's say different sets of light. So if you for instance restrict your matrix because you know in German you would have the S. The question is how much can you reduce the matrix in order to improve the response line. You won't destroy it with respect to English, German and French. No we didn't do it like this. We reduced the size of the matrix because you have here in the matrix plenty of zeros or almost zeros. So the reduction is more in terms of there are many possibilities which are not really there at a certain position. So the other thing at least as far as I understood your question would already demand or require domain knowledge somehow which is not in a sense not our job right. So we do not have it in a sense so if you do have it and at some point we have the requirement to say OK we put more in and you give us a complete set let's say or a set of rules we should observe coming from European languages. And you agree on it then we can try but we don't have these rules at the moment and it's not our part. So I would probably from a technological point of view go more for the direction keeping the content as it is. Store it like it is complete and then leave it to the user putting the domain knowledge into the dictionary thing into the language model and then apply it. So we have more like what we call like a cut in a sense on this confirmat side. OK. I think because of time we have to leave it there but I'm sure Roger will be happy to talk to people in lunch more. So thank Roger again and welcome Alejandro to the stage. OK. Hello everybody. I'm Alejandro Tocelli. I was introduced by Ubunter briefly. So I will talk a bit about keyboard spotting large scale documents similar to the talk of Roger. The only thing here we follow another kind of approach. This approach is rather than computing the spot confidence on the fly. We actually compute first an index and then try if we work directly on that index to show the spotting words. So I will give some overview of the motivation of the word. We will give some overview of the probability framework we are using to do that. Then we describe it with the system workflow we are we are employing to produce the index. And also I will show two demonstrators. I'll leave demonstrator final conclusion. We all know there is a massive image collection which has available all over the world. Which were compiled by libraries archives and other cultural institutions. And we know most of this material practical is inaccessible because we don't have transcript of such material. And this making possible the material be searchable for sound engines. So in the case the perfect transcript will be available for such material. Of course we can access to this material. We will make plain text index to make access to such documents. Look for contextual information on that. But of course we don't know that transcribe this such material is expensive. Actualy we need a special people with a lot of experience to recognize this kind of letter. To have some ground fruit. And also in the case of using computer assistant transcription system it also is expensive too. Of course if you use the automatic transcription of a handwriting type recognition system. And using this trunk to make some search on the page. We will have some error of course because the handwriting type recognition is not error free. And the performance of the search system will be poor. So in this case we are using indexing and testful search using probabilistic index. We will show how we build in the next page. So this is the approach. Imagine we have an image, a written text image. On the top we have a kind of probabilistic map. In this case probabilistic map belongs to the word matter. Of course it will be switchable. For example the location. We have three instances of the word matter and the corresponding high probability. We can observe here. This is an interesting scene. And also for example the word matter with S has a low probability. We are interested to look directly at matter. How we can obtain this probability map? In this case we can obtain this using insulated word reclassified for example. We take into account the context all the words surrounding the word we are looking for. So of course for example the word it could be helped to recognize the following word matters because there is. So in this case we are looking at the matter. In this case we know the context also of course we make more sure than this word matters no matter. So using this kind of index is very prohibitive. See if we want to incorporate this map in the index because first take a lot of computing time of course. And also we take a lot of space storage to keep this map in memory. So the one thing we can do to solve this problem is directly use another kind of structure is called word lattice. We are obtaining from the coding process itself. And directly from this word lattice we compute directly the relevant probability of each of the words and questions and also the possible location of that. Of course the word lattice are obtained directly from using the confidence matrix in the case we use a recurrent neural network plus the language model. So the recognition of this word and the probabilistic in this case are taking account the language model plus the optical modeling given by the recurrent neural network or in the other case by HMMs hidden marker models. So as I say in this way we can build some kind of index for example we can index all the text like image ratio as we can this kind of index. In the first column we have the word which is indexing the relevant probability associated to that word and also the coordinates in the image which is the possible location of that word. Of course here we have a lot of possible words not only the ones that are really appearing in the image but also a lot of similar words but it is expected that other words have less probability the words not really appearing in the image. Another thing we have to mention about this is we can have this kind of index then it is easy then to look for the words in the whole searchable way. We can look for any word and also we can make for example regular expressions. We can combine this query for example for look for kind of concept for example look using boolean expression to combine different way this word and make the search more powerful. In this way we can look for concept not only for the words, see if we have to look in 83,000 image for some kind of concept. See if we are smart we can combine the words in such a way we can find the page we are looking for. More or less this is the indexing diagram we are applying the workflow. We start with the text image, XML page for example and the image then we apply the keyboard font indexing tool. This actually is compound by the HDR recognition engine but it could be a recurrent neural word on HMM plus word lattice tool. We are not only trying to record the best hypothesis but rather than the best end the decoding hypothesis. So we are taking account not only the best path but the best path that can explain the word. We can normalize the word along all the possibilities and we have in this case better relevant probabilities. After that the page learning index are processing we call ingestion. In this case for example we can convert this index in lemas. Lemas are instead of looking for the all indexing form of the word we are looking directly for the lemas. That gives more for example a more smart look for concepts than in Latin for example there are many different declination. So in this way we replace all the possible declination by the lemas corresponding lemas and in this way we can directly look for the lemas. And also of course in this stage for example we replace all the characters by uppercase. We remove the acritis because it is not useful for typing to make easier the interface. And finally the keyword search is in charge to analyze the query to a kind of parser. If we use a boolean expression we can separate what operation is applied etc. And finally this is also responsible to show the results with the corresponding confidence record etc. Before we follow up we want to talk a bit about the way we measure the performance of that search system. We actually use the recall precision performance in this case. For example we all know the precision is high enough since most of the retrieved results are correct and recall is high since most of the existing correct results are retrieved. Of course if we for example for some image we have all the perfect transcript and look for in that search image the point corresponding point of precision recall will be here. It is the best possible performance 100. The area and this plot is 1. In case for example we use the automatic handwriting test recognition system only the one best hypothesis for example of course it is error free. The person will be lower and for example one point will be here. So the area and this area is less than one of course. In case in the probability index we are focused now right now practically define a kind of cure. As we can see here in that case this makes the search of course when this curve is approximated to that point is better. The performance system. But this makes the system more flexible. For example the user can decide the recall and precision for which he want to obtain the results. See for example see apply the threshold very high. We are all the retrieved results are practically we want high probability near to 1. Of course there are not all possible. As if we go down the threshold the retrieved results are many of them but probably the recall will be near to 1. So now I will talk a bit about one of the demonstrator. We have applied this index approach. It was in the first one it was in PASAU dataset. It's about 14-18 cent to the collection of historical records. I see 26,000 images written in German most of the time. We have done some preliminary experiment in 200, 291 page image. We use 200 in this page for 13 and 91 for test. Of course most of this collection are table of course. One thing here is the text line detection has been supervised and made. This is one of the things. And also we have to realize that the lines are very short and so the language model don't have a lot of effect in improving this. In this case the spotting performance. This is a recall precision code we have obtained in this dataset in the 91 page. The average precision is around 70%. The minimum precision is practically 6-7%. Of course the evaluated keyword is about 5,000 words. We don't consider it a contrasion mark and uppercase. Of course in this case instead of... This is one important thing. We are not working exactly with word lattice. We are working with character lattice. In this case all the index is lexicon free. We don't use any kind of dictionary. So in this case we can't look for any word we want. Of course one word is no phone in the... But normally this word doesn't appear actually in the image. So for example you can try this demo in such direction. Give some words. For example here we are looking for a combination. This is our combination. We are looking for the word in this case. We are not applying the matization in this case. We are looking for the word actually. So in this query we can write all the possible ways. We can write a period. For example since we look this in the interface. It shows all the possibilities. These are all the pages. One is looking any of the word actually. For example this is a query. Also we have here in this page a PR2. Of the word April in one of these words. This six words. So if we look inside for example. Someone can recognize the word April is here. Another kind of for example Margaret. Another word combined word between O. All the way you can write Margaret. It's very useful to find some information. For example also we can use an combination. We are looking directly for page and contain. Joseph and Maria. Also we can also for example look for pages. You are looking John and Anna. And this area. A kind of expression. Of course also it is possible directly to look for phrase. And Maria. We use a square bracket. In this case the system know you are looking directly. For sequence and contain this to one other side. Okay. Another index we are applying in the chance. This is another project of course. From the 14th and 15th century. It's a medium artist from the French royal chance. Actualy the number of volume. Is 167. The number of pages practically. In this case the last release is about 83,000 pages. For index in such collection. You are looking at the estimated number of running words. For millions. Number of different starting spot in the index. It's about 28 millions. And this is a typical point of response time. Is practically less than 100 milliseconds. Look for a word in the whole 83,000 pages. One remark here is a text line detection in such collection. Having the extraction was made full automatically. Practically. This is the. This is of course a lot of experiments. This is the red one corresponds to the full index. Directly was valued in 95 pages. While the blue one corresponds only to self for abbreviation. This collection only is part is in Latin. And the other part is in French of course. And there are many abbreviations. Here are the others you can find. This collection is in English. Of course you can do many words. But about the abbreviations we found. You know most of the part of this or the chance collection was written in Latin. There are many abbreviations. For example. The nice things you can search for the full version of the word. And the system retrieves all the words. Be abbreviated or not. For example if you look for the word Benjamin. This is the full four. There are some pages that have the full four. And another page that directs the abbreviated form. And found every things. Chevalier, Libres, and Cologne. The nice thing is you only directly look at the modernized version. Of course. So to finish this talk. Probably different has been introduced. The collection on Transparency for Writing Test Document. We show some empirical reasoning to historical collection. With different of course complexity levels. To demonstrate or have been available. You can play with them. And also the best thing is the abbreviation. And all the difficulty in the historical manuscript can be overcome. Directly you can look for the modernized version. Or also train using the modernized version. And the system retrieves all. Be abbreviated or not. The possible words. And match the queries you are looking for. Sometimes your attention. Thank you. Any questions for Ingrid? Yes. I wonder if you are going to do expansion of the definition class. Research for the root. Yes. I don't do that. Yes. To make the lematisation. Of course we are using dictionaries. Special dictionaries. We are group all the words. The different inflecting for the word. And replace by using the dictionary. By the root version. Of that inflected form. That's all in the case of a matter. The first example. Did you then get a natural matter? Or do you experiment with matters? Yes. No we look. The way we do that. All the possible words. All the words. Are in the spot in the index. And put on the side. Which is the lemat of such word. After we have two index. Another intras. The inflecting form and the corresponding lemat. So it's easy to then to see if we can look for the inflecting form or the lemat. Directly. Can you say something about. If it machines what's the word that's wrong. Can you correct it and feed those corrections back in so it learns? What is wrong? When it finds a result and the result is wrong. Ah yes yes. We have implemented for example. Let me see. For example if you look for another word here. For example in this table. For example here you have an export. And you can for example say what is correct or not. You are sure in this way we are concluded this. Another way to do that. For example you can click everywhere in the page. For example in that point here. So pop up a window. And say. All the possible words are recognized there. For example enrichment. All the possibilities. You can choose the correct one. Or maybe give some idea what is written in that position. That's good to say it can learn even more. We better leave it there because I think we need to go back upstairs for the next session. But join me in thanking Alejandro at Roger once again.