 This session now is about share your model, public models in Transcribos. We always had some models available but more for demonstration purposes. We were a bit cautious to come up with larger models because of course you read German Quarant and then you expect that it can read really all kinds of German Quarant or English writing or whatsoever. And for historical documents I think we are still rather far away from having general models which are really powerful and which behave like models for maybe modern documents or like for OCR for printed documents. However, I think it is important also for the community to see and to give you the possibility to share your model. And so you will nowadays find I think something like 25 models already in Transcribos mostly based on large amounts of documents. And I am very grateful that some colleagues are now willing to show you how they produced these larger models and what the background is. That is also of course very important to see in which context such models are created. The first one who will introduce us his model is Achim Rabus from the University of Freiburg and he is working with Slavic documents. All right, I am a Slavist and I am interested in Czech Slavonic. Maybe not very many in the audience are interested in Czech Slavonic but I would like to propose a specific workflow for producing these models. I have two generic models. The first one is for Russian so-called Polo staff which is for 16th, 15th century writing. The CRR of the model is 3.8%. I used 170,000 tokens for training. And the second one is a really generic model with more or less a similar CR but considerably more tokens in training data, almost 400,000 tokens in training data and this model is capable of transcribing manuscripts written in so-called Ustav which is the oldest writing system down below as well as manuscripts written in Polo staff, as you can see here. So this model is capable of transcribing old manuscripts from the 12th century and also models manuscripts from the 15th, 16th century. This is kind of a generic model. Of course, as I told you before, those who attended the large model workshop these models introduced some hypercorrect forms because of course there was linguistic change between the 12th and the 16th century and the model learned certain linguistic features from one period of time and tried to apply it to another period of time. This is a drawback of generic models but I think it's better to have one than to have none. And I would like to propose this so-called recycling approach which means I didn't produce ground truth manually at all. I just browsed the internet and copied freely available electronic editions as well. I asked a couple of colleagues who produced editions in printed form to send me their word files and I used these word files to train the model. I copy pasted it after some pre-processing into transcribers and ran the model training and I think this is quite a reasonable approach because there are so many world files around. And you all know from your community who is a more conservative colleague who produces editions in the traditional way but if we could reuse and recycle these sources we would all be better off. And this is my recommendation for you all. Thank you very much. Yes, Arshad? I was wondering, most of the time you hear that you need a good standard for making transcriptions to introduce the ground truth. For example, Arshad tends to fill in a creation of modernized language. Did he have this problem? Yes, we did. But in my field people tend to produce very faithful, linguistically faithful editions so there's no normalization and no amundation. But of course we introduce noise due to heterogeneous editorial principles but in my experience more data is better than no noise. What was your experience with the diacritic recognition? Diacritics is difficult and in part of my training data the editors resolved to omit all diacritics altogether. Which is the model learned to ignore a code and things like that. Which is not very good but in the first model this is the case. The second one has some diacritics and this is, well, everything what is between the lines in my experience is difficult for transcribers. I actually have a comment. I think this is a very helpful method but sometimes you don't have the print text available but it would still be much easier to train on a ground truth, to create a ground truth from the print edition but then what you would have to do is export them as pages and then use them as DEI for manuscripts, as D2I for manuscripts, right? So in that case maybe the... You're talking about scanning printed editions to produce ground truth, right? Yes? Yes, not just the ground truth, to prepare the text from in transcribers for the printed edition which will be much easier and then you can use the output of that for manuscript editions, for manuscript copies of the same text. Sure, that's a sensible workflow, definitely. Except that this means a feature request to extract the text in per page segments but... Transcribers can do that, isn't it? So transcribers has a feature to export the transcription to Word or XML or whatever. Right, okay. The next one is Stefan Satthammer here from the University of Innsbruck and he will talk about Neolatin documents. Yeah, I know I was telling you something about our model we trained inside the project Noskemus, Novus Kienzia, early modern scientific literature and Latin. Let me say a few words about the background of this model. The project Noskemus tries to decipher and really advance our understanding of the interrelation of Latin and scientific literature in the early modern times. The main outputs of the project are a couple of monographs, a database with about 1,500 representative texts in terms of chronological spread, literary forms and scientific disciplines and a so-called digital source book. We want to make all those 1,500 works which are in our database machine-readable and we want also to implement the keyword spotting. A quick overview of the ETA our model went until now. As a first step I tried to create one special or individual model for each century our project deals with. This means I created one model for the 15th century, for the 16th, for the 17th and the 18th century. Each of these special models consists about at least 300 pages and those 300 pages are taken from at least four or five different works or books. The works selected for the training data were selected with the aim of ensuring their representative in terms of chronological spread and used typefaces. As a second step when I had a couple of individual models available I merged those individual models to one general model called Nos Zemus General Model Version 1. As an example here I have the special model for the 17th century. The training data consists of about 70,000 words and about 10,000 lines. The character error rates for the training data are 0.31% and for the validation set 0.62%. In this model pages are included of the following five books. You can see I tried to select some books from the beginning of the century, from the end of the century and from the middle of the century. There in December of last year I had three special models finished and available and then I merged them to the general model. The general model consists about 170,000 words in the training data set and about 30,000 lines. The character error rate for the training data 0.66% and for the validation set 0.95%. Although the model is tailored towards transcribing neo Latin texts set in antiqua based typefaces I made the experience that very good results are also possible with texts in other Roman languages like English, Italian or French. Our model is also able to handle some passages or some words in Greek and some passages or some words set in German fractur. I tried to follow those transcription guidelines and to give the user a maximum of freedom standardization in the transcription process have been kept to a minimum. Normalizations have been implemented only in the following cases. Ligatures and abbreviations have been expanded. The long S was transcribed as a normal S. Small caps weren't marked as small caps but were transcribed as myuscles. And special characters like the sign for ed and the e-cardata and those were also kept for the purpose that the end user has the maximum of freedom. There are at the moment a couple of known issues. There are occasional inconsistencies in the transcription of quotations marks, punctation marks and diacritics. The error rate for transcriptions of Greek words or passages is still high. The main problem there lies in the vast amount of different Greek typefaces that were used in the early modern times. To a lesser degree also in passages and words which are set in German fractur, the error rate is not the best at the moment but I'm working on it and I'm hoping that I can publish an updated version of our model in a couple of months. Yeah, thanks. Any questions? Maybe just as an explanation, these are books printed documents. Did you try to expand the special characters? Sorry? Did you try to expand the special characters? I had the earlier version where I expanded the special characters but for the final version I decided to keep them because although many editors wish to standardize the edition they are going to make because often they want to keep especially these special characters like the ACOW data and so forth. Any other questions? Okay, thank you. Now Günther Hackel will show us some public models. Günther is the second Günther in the transcribers team and therefore called often George and many of you probably already had contact with him. He is one of the programmers and also does a lot of project management and he was working with documents mainly from the NewSci project. Hello. I want to present the NewSci project and more important the public models which are and will be created during this project. So the motivation is that historical newspapers are really a great knowledge base for the scholars, general public as well as the humanities researchers but the access to them is limited and also tools are missing with which you can investigate on them. So therefore this project came up and it runs until next year in April. There are partners involved and one speciality is that we have three types of partners and libraries, computer scientists and also humanities researchers so that they really can try immediately the tools which came up in the project and give feedback immediately. So it's not always easy but I think it's worth to have them also in this project. Here are some examples of the newspaper pages and our group deals mainly with data generation so we use transcribers for this task so we have to on the one hand create ground proof for all these research activities like layout analysis text recognition, article separation and entities and we have also to process 1.5 million pages until the end of the second year. So as a first public model the team in Rostock has trained one neural net for layout recognition of newspapers. It's already available via the layout recognition tool in transcribers and of course we also trained some model for text recognition. Here you can see examples for trained model so now this was also a training for you. I think tomorrow you all can read a German newspaper at the breakfast. Let's see, please let me know. We took not the easiest training data so it's from all over the time periods and our kind of images, good and bad images and these are the statistics of these ground proof samples. We have around 442,000 words in the training set and there is some interesting statistics. If you take the statistic of the German language which says that the word has 6.5 characters in a word in average then you have almost 3 million characters and if you then know that the character E for example is distributed in this language statistic 17% of all characters then you have almost half a million representations of the letter E in this training set which is quite a lot because a seldomly letter like Q is only represented around 600 times in this set. It's just a funny statistic. So we are quite satisfied with the results as we heard already in the morning session the guys from the linguistic department used this public model for the Alpine journal and also someone took this data and trained a Tesseract model for example in Germany and also others use it right now. And hopefully we get two other. It are three exactly because the National Library of Finland has Swedish and Finnish in the project both languages so we should get three other public models during the course of the project. And the BNF should follow very soon for the French. And here I want to show that you can also create article ground proof with transcribers so it was really a challenge to make such a huge page. At least you have some colorful poster for your office. But hopefully we have also some article model which we can make publicly available if it works as expected. So this is an ongoing research and will last some time till there are really good results. But of course the ground proof is there. And here are some other enhancements which we created in this project which transcribers benefits from the project because we made an IOP export and as well as an import for the named entities where you can also use already existent ground proof and also the annotation feature is really getting enhanced. Yes, that's more or less the created public models in this project in the new site. Any questions? IOP it's a format for the named entities it's inside, outside, beginning it's the short name of this. Just out of curiosity I have not understood the purpose of this multi-chamance in each other's... It's just an illustration of the articles inside this newspaper page. So each different carrier is an own article and it's just to present it in a human view every way. So for the computer it's just IDs in the page XML but to let the user also see what articles are separated we have these different colors for each article. Yeah. So how does this work relate to the future? How long does it take? This is something different. So the team in Rostock, the University of Rostock will take these ground proof articles ground proof and make an own training out of these ground proof. It's also machine learning but a different attempt. Yeah. I've got the microphone so I've got a position of power here. So I wondered whether you were using OCR specifically for this and not HDR and if so which kind of OCR was being used. And also I'd like to know a bit more about the layout analysis and how that works and if it's automated for picking out the columns of articles that you're transcribing and are you separating those somehow from one another or are you exporting everything off the page altogether if you follow me? So we used OCR, every fine reader at the beginning to get the first version of the text and then corrected this text for the machine learning. So it was the first step and for the layout recognition it was trained on lines on baselines only. So in this project the use cases you first recognize all lines afterwards you recognize the text afterwards you recognize the articles and only after you recognize the articles you draw the text regions around articles. The focus is not to recognize text regions at the very beginning during the layout recognition but only after article separation. So the trained layout model for newspapers concentrates on lines and not on text regions. Finally, we will hear something from the National Archives of the Netherlands. Hello everyone. I'm going to make the presentation for our newest public model. For the last two years our team in the digitization department has conducted many experiments with purpose to create a model which will be able to transcribe Dutch handwriting documents from the 17th, 18th and 19th century. After many attempts, research and experiments our team finally managed to create a new public model that is available via Transcribus platform and we call it Iceberg. Model Iceberg summarizes more than 1.5 million words and more than 800 different handwriting styles around 5 scans per handwriting type. In order to reach a point that we will be able to produce a reliable model like this we created many datasets with various material. This way Iceberg is a dynamic combination of two models that respectively has been trained with VOC documents from the 17th and 18th century and tutorial DIT documents from the 19th century. Here you can see some image samples from the material that we have used VOC documents from the 17th century here as well. VOC documents from the 18th century and 19th century tutorial DIT materials from the North Holland's archive and eight other state archives in the provinces. Our approach was quite simple. At first we categorized the available scans into three different datasets. The first dataset contained VOC material from the 17th and 18th century. The second dataset has been trained with 19th century notorial DITs material from the North Holland's archive and the third with notorial DITs from eight different state archives. For the training process we used almost 6,000 scans in total which correspond to more than 7,400 physical pages. Every model has been trained with HCR Plus and 1000 epochs. For the test set we used additionally 60 scans one scan per every 100 pages. The results that we managed to extract on the validation set were quite promising with an average CR at 5.15. Finally we combined these sets into one and we created the public model iceberg. Although despite the ambitious results model testing was essential in order to confirm the model's efficiency. As you can see in this table we created six different datasets that every one of them contains special characteristics such as different handwriting styles, different origin and have been created at different time periods. The training results were actually remarkable with the lowest CR level at 6.45 and the highest at 8. So these results lead us to the conclusion that public model iceberg can be very useful for users who wants to process archives from the 17th until the 19th century and also can be used as a reliable base for training HDR materials. Thank you very much. I especially like the name. I especially like the name iceberg and also what I'm looking for is sometimes a model called a Swiss knife. Or something else which is then really capable to read everything. Any questions? I think you also published the model in August last year. The M3? Yes. Can you tell the differences between those two? Maybe. Miss Kaiser can help me. I think it contains only one-third of the ground shoes from this model. I can confirm it works because I tried it. It's much better. Thank you very much. One remark from my side. Large models are often not better from the character rate, but they are more robust. So you can feed more different styles to a model and they will still provide good results. More special models often have better CR rates, but are not so robust.