 Okay, we move to the last presentation of this round, and it will be an online presentation by Achim Rabus from the University of Freiburg and Alexei Tikhonov from the same university. And this is also the university that is a member of the Reed Carp. All right. Thank you very much. Can you all hear me? I don't know. I cannot switch on my camera, but I hope you can hear me. So we are talking about smart models and transcripts that we train for different languages and for different scripts. And these are our case studies, quite numerous case studies that we try to present today. But first, I'm going to talk about the question, what are smart models? Smart models in our understanding are models that differ from the traditional philologically faithful approach, usually used to try for, well, one-to-one correspondence of the letters in your manuscript and your model or your transcriptions. However, smart models, on the other hand, they have enhanced functionality. They can resolve abbreviations or transcribe into a different alphabet. We've seen that before with the Divanagari example. They can deal with superscripts or add something. They can unify orthography or even change the script directionality, for instance, if you have the... We have some examples from Ottoman Turkish, which is a right-to-left language, and it will convert to left-to-right Latin transcriptions. Why would you do that? What are the benefits? Why are smart models interesting? First, you can make search tasks easier. You can use your data for, say, corpus linguistics research easier if there's some kind of normalization already involved. Second, you can reach a wider... Well, you can transcribe sources for non-philological use and thus reach a wider audience. And the last point is just to figure out what are the limits of the HDR technology today. All right, so our first case study is with Glagolytic. This is a south Slavic script written predominantly in Croatia. And we have published two models, one handwritten model and one printed model, and the smart features are... The models can transcribe to the Latin alphabet from the Glagolytic alphabet and can deal with ligatures and abbreviations. There are some cooperation partners for both models, and I'm going to talk about the handwritten model right now. So that's quite a larger model with 170,000 tokens, a character error rate 5.7%, and this looks as follows. You can see the Glagolytic handwriting on top, and this is a ligature over here, and it's been resolved correctly on the same here. So these are ligatures in Glagolytic, and we also have abbreviations in Glagolytic. As you can see here, for instance, we have one letter, a second letter, a third letter, and the sign for the abbreviation, and everything that is written in these round brackets has been added by the model. That's a very high-frequent word. It means blessed, so this is a religious text obviously, but this has been added automatically by the smart model. Okay, our next case study is pre-modern Slavic Cyrillic, and here we can see the difference between a traditional model here and a smart model here. This means, well, the month of October on the 18th day, and here we have an abbreviation. We have special characters. We have Cyrillic, not the Arabic type of numbers, and this is the philologically faithful model, and that's the smart model. Resolved abbreviations here, here, here, and converted the numbers. Okay, so Alexei will continue with our next case study. Hello, everyone. I hope everyone can hear me. In the first project phase between 2020 and 2022, we developed a model for handwritten Russian as well with numerous partners. Next. And our focus was primarily on text between the late 19th and early 20th centuries. It means Russian before the spelling reform that happened with the October Revolution of 1970-1923. Some characters were abolished. For example, there were several I or E graphemes, or for example, the Yad was left out. And here we have got a CR that is quite suitable for modern scenarios because historical ones are often more complex. After all, the older manuscripts in Russian are much more individual. And now we will see some examples of the smart functions in our model. You can see that the mentioned Yad in the original manuscript is replaced here by the A or E, and as well the, yeah, we could say Latin E or I is replaced by the Cyrillic E, which happens as I said after the October Revolution. We will see another example for smart functions where the hard sign was left out, which is there, obviously in the original, but not in the transliteration because in today's Russian it's not there. We had to deal as well with some transcription errors, as we can see here, in which the model also left out the soft sign by in the line 16, where we've got tile and in the original it's tile, so the L is pelletized. And in the transliteration not all where we've got an eight instead of a nine in the same line. Thanks. In the individuality of manuscripts we now see a different example from the same period that differs significantly, but where the same smart functions have taken place or are functioning. Yeah, we could say Latin E or I and the Cyrillic I. Here as well the hard sign on the end on the words which shows the pronunciation, which showed the pronunciation is our pronunciation of the words back then, and today it's not used anymore. And in the second phase of our project started few months ago. Among other things we are currently working on a Ukrainian model. Here we are concentrating mainly on the period between 1850 and 1950. The first text we worked with are the notes of the Ukrainian national poet Taras Shevchenko as seen on the slide, even the tragic situation Ukraine now the colleagues from National Library of Ukraine can support us with the manuscripts for which we are very grateful. We are still in the early stages here, but the first CR results of around 7%. Let us hope that a working model can be developed fast and soon. And other new language in the second phase of our project is Yiddish. Here our goal is to train a transliteration model from the handwritten Hebrew Yiddish, which we can see here. Into the Latin alphabet, as shown in the example from our ground truth. And peculiarity here is that the manuscript was written from right to left, as we can see here in the translation transliteration should be from left to right. In other language that encounters the peculiarity is Ottoman Turkish, which is our next language. Here we use data from multiple we're using data from multiple collaborations we developed a model was nearly 270,000 tokens, and we haven't at the moment, a CR of about 14.7%. However, we are continuing to work on the model. There are some characteristics of the manuscripts that are not doing the work as fast as we would like to have it. Next. Yes. Actually was some of these challenges were already mentioned in the previous talk. Here we can see that some walls are unsystematically not presented so to say, as a result, there is no one to one correspondence between the number of graphemes in the original, and in the transliteration. Thus, they were also different writing standards depending for example in the region or, for example the text genre. As a result, there is no consistent picture of an average manuscript in Ottoman Turkish. Many of the manuscripts are unique, so to say. However, as the work with transcript shows the goal of a generic model is still possible and we are working on it. Alright, so our last case study is German shorthand so called Deutsche Einheitsquart Schrift. And this is still very much work in progress with a cooperation with the Deutsche Tagebuch Archive in Mending in Freiburg. And we took some manuscripts from them and we also created synthetic training data. There's a tool online or a tool that you can install on your computer where you can convert German text written in a Latin alphabet to a German shorthand. We used that for training purposes and these are the preliminary results and we see a manuscript, this is a diary. And we see the results here and I think it's definitely usable. Sometimes we get garbage in red here, but the rest is quite readable and I'm really fond of that because I do not know German shorthand. So we can gain access to that kind of sources using transcribers, which is great in my opinion. So we come to our evaluation. In order to train models with enhanced functionality, you need a large amount of training data, you can start with, say, 50,000 tokens. And it's fun if you have more than 100,000 tokens. And the results are usually less deterministic so you cannot predict that well what will happen. The number is usually higher than with specialized or philologically faithful models, but the real world performance is okay. And I recommend not to focus too much on the CR, because just look at the results. Some language models with HDR plus models seems to make a difference, which makes sense because we have these one to many relations, and we have complex abbreviations and language models seem to help here. We encounter that HDR plus seems to be somewhat smarter than Pylia, especially with very complex cases in the Ottoman Turkish situation, for instance, and we're very much looking forward to the new transformer models. So transcribers is pretty smart in fact, and we like our smart models because we think we can reach a wider and a non specialist audience. If you would like to try training your own smart models, we recommend converting your conventional models, use your ground truth, download it, convert it offline, and then re-import it. That's a way, a viable way to do it in our opinion. If you have special cases, try using synthetic training data. That's a table with our results. And we come to our conclusion. I've shown that smart models can demonstrate the capabilities and versatility of the HDR technology, and you can use this for many different scripts and languages. And it makes sense because at the non-academic, the non-specialist audience can gain access to this kind of text, which leads to a democratization of knowledge, which is a good thing in our opinion and has also been mentioned in the opening talks, if I remember correctly. We need more training data. We need more and better ground truth to train smart models. And this means that collaboration and recycling is indispensable. And because of that, I appeal to you all to join forces and make transcribers even smarter. Thank you very much. Perfect. Thank you very much for this presentation. And we've got 10 minutes for questions. Hello. As an avid user of smart models myself, I was wondering if you are aware that now it is possible to actually train smart features and traditional features by using an abbreviated model plus tagging and training the taggings in it. Yes, we are aware if my question is clear. It is clear. Yeah, thank you very much. We are aware that this is possible, but we started training our models before this feature was implemented and we are reporting on our results from a couple of years' work. And sure, it would make sense to compare the tagged smart models with the, well, legacy smart models. Thank you. Any more questions? Is there something in the chat, Lauren? Also not. Yeah, here we go. Actually, I don't really understand the point of smart models. Because what you do you invest a lot of time in training a model to solve a problem that can already be used or is already solved in post processing for example the transliteration. You invest in high quality of the Devanagari recognition and then you just post process. The stuff is much better than doing the transliteration in the model. And also the abbreviations. Or do I get something wrong here? Well, you can do something in post, of course, in post processing, but regarding for instance our Glagolitik use case, this is the scholars working on Glagolitik, they automatically transliterate. And it wouldn't make sense to use the same alphabet for transcribers model because people just don't use that. So we needed to go smart in the first place. And regarding the abbreviations, I think actually I think this is pretty cool because it shows some philological intelligence if I can. I'm going back in my presentation and trying screen sharing once more. Here we go. And we see we have an abbreviated word, right. And then in traditional philology, we would call this an editorial commendation or addition, because you need philological intelligence to know that BL and N, and this little sign has to be expanded with all these letters. The model has learned to do so. And this is pretty cool in our opinion, also from a machine learning point of view, but also from a philological point of view, because you do not have to add that in post process if you you cannot automatically post process that you need to do that manually, because you have to look if it's really an abbreviation and how to expand the abbreviation in this specific case. And the model is capable of doing that. And in our opinion, it does a pretty good job. Well, thank you for your presentation, Achim, and I forgot your name, the other colleague. I was wondering what my question is about metadata you provide what your data sets, if you are sharing it, which is also a question obviously. How do you tell potential other users or yourself in the future, what you did with the abbreviations and justify your choices or what kind of document do you draw up to relay these choices made. I'm kind of curious, whether you could give us insights into this kind of metadata you provide for your own documents to understand your actions later on. That's a good question. Thank you very much. Anemika, is it you? Well, I talked about collaborating and corporations, and we gathered ground truth from different colleagues and this is a problem with the Ottoman Turkish case because they're competing ground truth. The ground truth doesn't match that well. But for instance, in the Glegolitic case, people just gave us their traditional editions, and we reused that for model training and they have editorial, they have, well, an editorial forward for a traditional printed edition, and they write how they expand the abbreviations and things like that. We didn't produce the ground truth. We reused it and so this is up to the creators of the original philological transcription or edition. But you're right. Sometimes it's, I talked about models being non, or results being non-deterministic. Sometimes it's not easy to predict what will happen. For instance, in the Russian case, sometimes the hard sign in the end will appear and sometimes it won't, which is, if you are a philologist, it's not very satisfying, but if you just want to use the data, it maybe just doesn't make a difference. We have one question online. Sorry. Let's do this question first, then yours, and then I think Elisa also had a question. Thank you for the presentation. Very interesting. Just a practical question. So when you say to work offline, so download the say diplomatic transcription and then work offline on your first of all, how would you download it in which format and then rework the data and then upload it again. How would transcribers able to match the picture and the like the uploaded information? Yes. Alexei, would you like to take this one? Yeah, sure. So from my point of view or from my practice, from my experience working with transcribers, there are several scenarios how you could do it. For example, the one way would download your transliteration or transcription into XT format, and then, yeah, making the changes you need to make offline, then upload it again, maybe in the same document or you create a new document in your collection, and then, yeah, making a new training or creating a new model. If, is that answer to your question? Just a follow up. So say if from scholarly tradition you have files that looked into kind of creating critical editions of a certain money or a certain old text. And they are, I don't know in a word or and of course easily convertible to plain text. But how to say they did not necessarily respect the, you know, the order of lines or something they just did a continuous. Is that usable at all or I would need to process what has been done and kind of coordinated line by line with what's found in the in Okay, okay, okay. Yeah, I see what you mean. And so we had several, yeah, examples or experiments with such problems that we got, for example, a manuscript and a transcription or transliteration which our partners did, for example, or we did. Or, actually, our partners did. And that was no one to one. Yeah, relationship between the lines. So, there is a function and transcript was to text to image. So, where you can put, for example, your transcription to to the first line of the of your layout so you do first the layout analysis and you put the your transcription into the first line and then you click on, or you use text to image and then you hope that it will work so sometimes it work or, for instance, in our cases it worked in the most times it can happen that, for example, yeah, two lines. Two lines from from the transcription are then in just one line in the. Yeah, in transcripts. So you have to do some post processing that then after you used text to image function, but in the most cases the post processing takes, for instance, in our cases just little time maybe a few minutes or maximum 10 or 20 minutes. Okay, just a quick follow up consensus to your question you can download the page XML then you can modify the XML files and upload it again. And then you, you will have the text is the smart text aligned afterwards. And I think there is one question in the chat by you are not co Patrick. Can one revert from the smart transcription to the philological if need be or does the smart system override the more faithful transcription. In our case we train smart models and they can only be smart, they cannot be philologically faithful, but you can train one philologically faithful model, and then download it converted to smart upload it retrain it and use it. Yeah, so maybe a short follow up to so from my point of view the main thing is, you have your ground truth. And yeah maybe in the best case it's the philological one, and out of the philological ground truth you then can create. Actually, every model you want to create so. Then, I think we do the last question. Thank you so it's me again. I have a question of usability and actually readability of those expansion with parentheses. Because in my experience, if you put the markup that is used for human readability, you won't get the machine able to read and search those words. So I was wondering if you have considered this release this risk when doing this expansion with parentheses. Also, you cannot strip them all because maybe there are real parentheses. I don't know if your content might have real parentheses. So that's my concern. So that's refers to the legality case right. Yes, as I've mentioned before we recycled the training data. This was a philological book edition and they intended it for a human readers of course, and we reuse it and we like the result that the computer was able to add these these parentheses. But you're right if we were to use these the data produced by the model for quantitative research, we would have to eliminate the parentheses. And we could do this because we don't have that in our script in this specific case. But as we can see here once again, it all depends on the ground truth data. And if we have the parentheses there, they will be reproduced by the model as well. Then, thank you very much for these insights for this presentation. Unfortunately, we can't send you the cups right through the internet because we don't have that magic fireplace that the Weasleys have in Harry Potter. So, but speaking of cups outside coffee break, you know the drill. Let's go.