 Good. Next up is Sandra Balk from the Leibniz Institute for East and Southeast European Studies in Regensburg, Germany. And she will be talking about the topic digital edition of historical travel logs and the role that transcribers plays. Oh, and you brought a colleague. I hope you can introduce him because I don't have him on my moderation card. Really sorry. Hello. As mentioned, we are from the Leibniz Institute for East and Southeast European Studies in Regensburg. This is Jakob Merke and I'm Sandra Balk. And we want to tell you a bit more about our new research project, the digital edition of travel logs and the role that transcribers played so far. Here's a short overview of what we expect. In our project, we want to build a infrastructure for the analysis and edition of historical travel logs. This infrastructure should enable researchers to encode and transcribe different texts on one side and on the other side to analyze and visualize them. And one of our main questions is how can we encode the materials so travel logs and other materials in a way that we can use it to answer complex research questions regarding travel events or travel observations. To be able to do this, we are working on a digital edition as a use case that combines T I with semantic web and link data technologies, and this will allow us to model data more explicitly, and therefore enable analysis and visualization of the material. In this case, we will work with the unpublished records of Franz Xarebrunner's journey from Arau via St. Petersburg to the University in Kazan and his way back. So far, we transcribed the first two manuscripts with transcribers and also annotated persons, places and other events with XML markup elements based on these manually transcribed texts. And we developed training modules for the handwritten text recognition. And yeah, these are to be used for the semi automatic transcription of other related texts and about these, this process. Yeah. So, since we got quite a large number of manual transcribed pages, we decided to train a language model to automatically transcribe the remaining documents of one of us work. So, what I noticed quite early on was that some of the models trained on a smaller subset, I'll perform the models trained on the larger parts of the data. So I took a closer look at the facsimile and I started creating them. And how did we create. I did to look at to look for any instances where the straight linear structure of the pages is interrupted. In this case that means strikes loose and additions on the left side, you can see an example of a flawless page on the right side marked in yellow. You can see some of the flaws that I recognized. I then experimented by training on different data sets using pages of different levels of defectiveness. In the end, the model which elevated best evaluated best was the one where I only excluded those pages which were heavy heavily flawed. So, why can we improve the model by outruling pages with strikes loose and super linear additions. Supposedly the answer lies in the layout analysis. As you can see on the left, transcribers has no problem identifying the lines on the flawless page, while almost every addition messes up the line segmentation as seen on the right. And little surprisingly, the automated transcription of the flawless page is almost perfect. While deriving from the bed line segmentation. It is defective. I marked all the arrows and yellow on the left side, there's only one arrow. On the right side you can see whole passages which are completely nonsense. And those parts correspond with the exactly with the messed up line segmentation, why the lines in between are still fine. Okay, and finally, on the last slide I named some mistakes I made so you don't have to make them. First, Bronner uses Cyrillic letters on a few pages. So when I tried training on a base model, it failed. Since we had enough data to train our own new model, I left it there. But if you don't have enough data, and it is possible it is a consideration to exclude those pages with differing charts. Second, I think it's worth spending time up in front on making sure that the data you upload is technically correct, since it since it can be very difficult to find the error causing data afterwards. And third, the transfer of our model to another unseen text caused a significant decline in performance. I was able to get the performance back up by adding some hand transcribed data of the new text. The possible takeaway here is that if you plan on using your model on many different documents, it might be good to transcribe few pages of all the documents you want to transcribe with the model instead of transcribing many pages of the single document manually. So if we sum it up, even poor data, even data of poor quality could lead to good results if the data is created well. Thank you for your attention and if there's still time left, we could lead to discussion and questions. Thank you very much. Are there any questions from the audience, because we're making quite good time. I'm repeating the question. Did you try to automate the curation process so that our online guests can hear too. That's why I'm repeating it. I did not. Right now, right now I'm thinking about possible ways to automate it, but I think since we are the human in the loop I think it's always a good idea to leverage our insights to improve the models. So maybe automating just for automating sake is not always the best, but if there's a possibility, let me know if you have any idea. Okay, any other questions. So Danny, I have one for you. Have you already discovered the trainable baseline feature. Okay, I was wondering whether that would provide any benefit to you. I've seen that it's possible. I'm not sure if it's a new feature or if I just overlooked it. I've seen that it is there and I wanted to use it, but I have no. Yeah, it's quite new. I'm not sure what it will do for your reading order that will probably depend on how you train your baselines so the sequence in which you create the baselines for your ground truth. But that could resolve some of your issues if you train enough ground truth, then those in between lines will probably be recognized a lot better and will mess up your data less. Okay, are there any other questions.