 All right, let's start. Welcome everybody to our workshop on large models. My name is Achim Rhabus. I'm a Slavist from Freiburg, Germany. And this is Dirk Alvermann from the University Archive in Greifswald. And we will be having a workshop on large models. As you can see, the workshop is not only on large models, it's also a large workshop. And thus we have a very heterogeneous audience, ranging from almost absolute beginners to users who have already trained dozens of large models. This makes our task kind of tricky, but we will try to do our best to accommodate. We just want to know who in the audience has never trained Transcriber's model ever. All right, we've trained more than, say, 20 models. Okay, well thank you. We'll do our best. So regarding the languages you're interested in, predominantly of course the Western languages, Western European languages and Latin strips are really predominant. But there are also people in the room who are interested in Arabic, Hebrew, Cyrillic, Greek and Divanagari. So we'll start introduction into the training tool. What do you need to train a model in Transcribers? You need to have ground truth, which is, as you all might know, the digital high-quality images and the corresponding faithful diplomatic transcription. And then you might want to have a button in your Transcribers software that allows you to train the model. If you don't have that button, you might want to write an email to the Transcribers staff and ask them to unlock this feature. So what do you use the ground truth to train the model? What do you need to do? You need to select the ground truth for training and validation, and then you need to define the number of training epochs. What's that? One epoch is you have the artificial intelligence in neural net having a look at the data, and then trying to guess without having a look what is correct and what is not. Afterwards, the model looks again and finds out what's been wrong and improves the recognition process. And the more epochs you have, the better your model gets. And then you hit the start training button and you go get some coffee or go on holiday or whatever, because training large models takes time. If you have a small model with very few epochs, you will be able to have it trained in like half an hour. I trained two models during the morning session today, but if you have large models, it will take more than 24 hours. I suggest we just have a look at the training module in transcripts right now. Do you need the kind of definition of ground truth? Yes. What is the definition of ground truth? Ground truth is you have high quality digital images and faithful transcription, and you have corrected your transcription and it fits perfectly to the digital image. And then you define it. This is the data I want to use for training the models. I think it is better said that we can have an exact transcription of the text. Sure. But this is what we all have to deal with, because we are dealing with a historical, mostly historical material that is still the ground truth. Sure. Yeah. You will have to define it as ground truth. I think we will talk about that later on. Yeah. All right. This is the transcripts, and you have to click the tools button, and then here you see the train button. And if you don't see it, you will have to write an email to the trans people to enable this button. You click this button, and then you see your data on the left side of this window. You take this data, and you click on Training. You can use a whole folder for training. It's gone there. And then you have to select some pages for validation. Validation data can be from the same manuscript or the same data, or can be especially selected, well, especially selected pages. You can, for instance, do it like that. And then you have to name the model. You have to choose a language. And then you can define a number of training epochs like here. Experienced users may have encountered that the default value has changed in the last couple of weeks. We used to have 200 epochs as the default value, and now it's been reduced to 50. Actually, I don't know the reason why, but from a user's perspective, 50 epochs might be enough for a small model, but if you want to train a large model, you might want to increase the number of epochs. 200 might be okay for a normal model. A large model will have 500s or even more than that. So let's move on. You click OK. We won't do that now, but that's the way you might have to do it. So what happens now? On the server here in Instruct, the model gets trained, and if you have selected everything carefully, you will end up with a very beautiful fitted model on the left side, which means that the CER, which stands for Character Error Rate, and the lower the CER, the better your model is. And if your training CER and your validation CER are more or less the same, you have managed to train a very well fitted model. However, if there is a great difference between training CER and validation CER, you have overfitted your model. Overfitting the model means they neural net adopted itself very well to your training data, which is a data the model has seen all the time, but copes not very well with your test or validation data, which is the data you want it to work on. And if you have an overfitted model, that's not good. What do you do if you have an overfitted model? You add more ground proof for training, or you select your validation data more carefully. So what can you do with the training parameters? As a rule of thumb, around 10% of ground proof should be reserved as test or validation data. If you have a very large model, you can use less. But if you have a very diverse data, you might want to use even more. And Diego is the expert on that, and he will talk about that in a minute. Regarding the epochs, as I've told you before, you can start with around 50 epochs, but you should increase the number of epochs as soon as you have more data. Regarding a couple of publicly available models, we can have a look at the relationship of test of train and validation data, and you see they are in the range of around 2, 2.5 to 5%, which means these are quite large models. If you have large models, it seems to be kind of best practice not to use too much validation data. Maybe there's thresholds that indicate the validation data is enough. I don't know, maybe 10,000 tokens, 20,000 tokens. We can talk about that later. With regard to training epochs, you see this green line in your images, and this green line indicates the point where the model has been trained enough. And you see 400 has been a bit too much for this specific model, 340 would have been enough. Regarding the significance and the impact of increasing training epochs, you can have a look at these images. This is one and the same training and validation data, trained on the left, trained with 400 epochs, and on the right with 1,000 epochs, and you see you gain around 0.2%. It gets better, like 0.2%, which might be worth the effort when dealing with a very large model. If you have a small model, just don't go there. What means large? This is a question. We have a couple of publicly available models here and a couple of models, one model that is not publicly available yet, and you see most of these models have more than 300,000 tokens in the training set. Some models have few different writers, and one has extremely many writers, 800. Some of the models have special skills. We will talk about that later, what that means. The CR varies from 2.5% to almost 9%. The specialist skills and the number of writers indicates that there is a difference between large models and generic models. Not every large model is a generic model. As a rule of thumb, we can say that most large models have more than 200,000 tokens, and if you integrate several writers and maybe integrate several languages or several script styles, you might be able to produce a generic model. This is more or less our definition of a generic model. Dick will talk about what large and generic models should do in theory and practice. Now, time for examples. What large models or generic models should do? This could be a normal use case for a generic model. As you can see, you have different writers, two different writers on one page. There is not much. You have German current in this area. You have latent cursive writing, Latin language. It's normal for German languages. In the 16th, 17th, 18th century, the writer would change scripture with language. So you have two languages, German and Latin language. More, you have concept writing in the Maginaya. It happens often if you don't select your materials so well that you produce only nice written pages. And then you will have abbreviations all over. One here, another one, the next one. So, if you want your model to read different writers, different scriptures, different languages and abbreviations, you simply have to train it to do so. That means your ground truth has to have all these examples in it. And a model can do it. Let's start with different languages. It's an example where you can see very well the difference between the scriptures, Latin and German current. The ground truth, not so strong model trained on this material, made not much mistakes, but there's a difference between Latin and German language. And a very good fitted model, model 3.1, has the same CR on Latin and German language more or less. Ground truth, not well fitted model and very, very fitted model. And the green one is the German current. Reading result, the red one is violet one. That's the Latin language. So, you can achieve the same CR, one model, same CR, both languages? Appraviations. I don't know if somebody of you where on the last Transcriber's User Conference there was Guntram Leifert who declared that the HDR, HDR plus is able to read more or less two to three characters and abbreviations, but it does it very well at the beginning and the end of words. So, at the beginning and abbreviations, at the beginning and the end of the words. Here you can see typical abbreviations that occur on every page. So, in Videdu, you are Studiosus, you have Neckver, Atchver, Denikver, Semper, Datum, with the Thetolus over the U, and typical German ER for der Teufel, the devil. This works very well with an HDR model trained on these abbreviations. It will get more or less 80% of these abbreviations in your HDR result. What, in my opinion, doesn't work very well in word abbreviations, complex abbreviations like this one. Everyone who works with Latin text knows this product. That's difficult for the HDR. In punto, as a contraction, contractions are not read very well with HDR, Omnia, the same thing, or Domino. In my experience, these abbreviations, it's not worth to train them. They occur not very often in the text. If they are there, very often you could try to put them into the ground truth and to train them as well. Maybe you will have success. I know that Achim trained models with inward abbreviations that are read very good. My opinion is another one. Next one, concept writings. This is very tricky. A generic model could read concept writings if it is trained with enough ground truth on this material. If you remember the morning session, we talked about Ottoman Turkish and the problem that not every character fits or matches to a single other character in the transcript. It's similar with concept writings. Concept writing means that we have words like this one. It's called MITH. You won't read these characters in the original. But the system could try it, could be trained to read it. In this case, it didn't work. It read a totally different thing. The second line, the line which is marked, would be in ground truth, tiny little frage zu erkundigen. The transcriber was not as good as the model. The model read frage, that's right. And here it read zu erkundigen. What's written in the concept is zu erkundigen. So the model was better than the transcriber, but the model wasn't right. In my experience, you can train concept writings, but have in mind that you won't reach a better rate than something like 10, 12, 15% CER on concept writings. Up to now. What could help is language models or dictionaries, but this we will talk later on this topic. Can you say a word about what concept writing means? Concept writing means, in some cases, like in this marginalia, the writer has not much space and he tries to write very fast and very economic. So some characters, it's more keywords, some characters won't be written in the word and it's a kind of, you as a transcriber has to make an interpretation of this, what you see. Normally HDR shouldn't interpret the text. It should read it. So you have to train it to interpret. It's not his task, normally. It's important to keep in mind the limits of generic models. There are some weird things going on when using generic models for transcribing previously unseen data. For instance, it might well be possible that hypercorrect forms, hypercorrect transcriptions occur. What does that mean, hypercorrect forms? For instance, if you have a German from a specific time, they write MITH, not in concept writing, but in usual writing. And the model, the generic model, learns this feature. Kind of learns this word. And then you have another source, and in that source, MIT is written MIT, right, without the H. But nevertheless, it might well be the case that a model transcribes it as MITH, even though there's no visual cue whatsoever of this age. This is because the model learned that this word has to be written that way. That can very well occur if you use generic models. That's one drawback. Another drawback is that you most likely introduce more noise than using specific models. This is usually due to different transcribers and different origins of the ground truth and things like that, which means that sometimes you might be better of using a specialized model. In my experience, the following rule of thumb applies. Generic models are great for non-computer savvy users who want to benefit from transcribers efficiently and fast. So if you're not into digital humanities, if you're not into computers, or you are obviously because you're here, but if your colleagues are not and they want to benefit as well, it might be easier for them to use just one generic model than to have like 20 different models for different purposes. So another limit is, of course, models are usually language specific. If you train a model as specific type of Roman script, it will learn the linguistic features of the particular language. And if you try to use this model for another language, it will perform very poorly, even though the paleographic characteristics of the script written in this other language are quite similar. But this is just a property of the typical neural network we trained in transcribers. But if you want to use a generic model for different languages, you can train it multilingualy. That's no problem. Okay, so this is something that I don't want to do because it's hard. How to organize test and training material? Many of you have asked about the best way to organize material for training and validation. So to be honest, there are lots and lots of individual ideas on how to organize material. With generic models, in my opinion, and it's the only thing I can talk about, the greatest challenge is to create really well-balanced train and test sets. I think there are two options that I want to present you. One conventional and a less conventional. I prefer to separate the material in different folders. So to separate training material and validation material before starting anything. Then I prefer to establish an order. This depends a little bit on the material. So I did chronological orders for models that should cover, for example, a century. And I did an individual order for writers for models that cover a long time period but with not more than maybe 20, 30, 40 writers. It makes sense in this case to establish an order for writers. The next thing, that's not enough. The next thing is to form corresponding groups of test material and training material. It makes no sense to train the model for the period of 1500 to 1510 and to get a validation set from 1560 to 1570. So the sets must be corresponding in their order. An example, chronological corresponding groups. Achim has shown the training tool and you were right before adding my second advice is use the versions management. Don't do things like use the edit status. If you do transcriptions, ground truth or other things, set the edit status of a ground truth page to ground truth and before you start training, before you select the train set or the test set, take the filter, ground truth only. That will make sure that only really pure material, 100% write transcriptions will be at the end in the train set and in the test set. So the versions management is exactly made for this so use it, it will make life easier with large models. In this case, I added complete training documents. My training documents are the complete documents and within these documents you will find pages in edit status, ground truth and other pages, not transcripted and so on. My test sets are corresponding to the same years and I add them in the run of the training for the same years with the trained materials. So we have material from 1582 up to 1588 and these are corresponding sets of training and test material. You can group them in the same way for individual writers. It's more or less the same procedure. Okay, the question that has to come at this point is how to choose single pages for training and test sets. That's not so easy. In this case, I'm sure there are a lot of other individual ideas. If possible, you should choose representative pages. That means select pages of transcribers, neighboring pages of the same transcribers. In my group, we select more or less two or three pages of one writer and made ground truth so that one page of every third writer comes to the test set and will be set up to edit status final so that I can recognize that it's not ground truth for the training set. Afterwards, I export all the final pages of such a document and re-import them to transcribers and to form from them a new document which is a test document that includes only test pages. To make it more clear, this is an overview of one of my documents. I don't know which year, 1627. The transcribers make their proposals for me and the blue ones are ground truth. You can see neighboring pages, same writers. In this case, we selected one, two, three, four, five writers, six writers maybe. The red ones are the final pages. They will go to the test set. The blue ones will go to the training set. And that's all what we do in advance for the training. This is the conventional way to select pages. It's a little bit complicated, isn't it? As always, there's an easy solution, an easy way to, and the easy way is sampling. Maybe you heard what Sebastian told this morning on Compare Semper. Compare Semper, the Compare Semper's tool. It does exist just for a long time, but it's worth to be recognized as what it is. It's genius. You can sample in first stage, I think, test material. Semper's have many advantages. They don't consist of whole pages. This is one advantage, but only of single lines. That means you don't have to transcribe so much for your test set. And they are automatically generated and randomly selected. This is a question that some of you ask in advance of this workshop. How could I get a randomly selected test set or train set? You can do it with Semper's. To do so, there are different ways to create Semper's and Semper's sets. You can do it in transcribers, so you have only to start the Semper Compare and can add from your collection single documents or single pages to a Semper set. Then you can choose the number of lines that you want to get out of this. So here I added, I don't know what, a couple of documents, more than 1000, and I told the system to give me 300 lines back. What you have to do first is a layout analysis, otherwise it wouldn't work. This is one way. So you will have a Semper set with 300 lines to transcribe and to use as a test set, for example. You can do it, and this is my advice. Outside transcribers of platform, probably before importing all your material to transcribers, you would have it in a file system, I think so, most of you have. Simply take data management system in your file system, select every hundreds image, and then put it to a single folder, import this folder into transcribers, and so you can get more or less something like a very extremely varied Semper of all your material, all your material, what goes not to the training, and from this document you can let transcribers select, I don't know what, 500,000, 1500 lines to create a line Semper of all your writers in the material, and it will be randomly selected, so you can't cheat, and the system can't cheat too. This is what it looks like at the end. What you get out is, in both cases, there is a document that consists only in lines, pages, pages, but the page has only one line. So what you have to do is to transcribe 500 lines, and you will have a very representative test set for your material. Yeah, everybody should have one. Each line represents then a page or a writer, and the effort of transcription, this is a simple calculation, is less than when conventional test sets. I think, yeah, instead of transcribing 20, 30 lines per page, you have to transcribe simply one line, and that's all. Alright, let's talk about the byproducts. Sebastian talked about that in the morning session as well. If you train a model, you will get both dictionaries and language models, and the dictionaries are not very good, whereas the language models are. So dictionaries are essentially a list of tokens that are found in the training data, either with or without the added frequency information, and dictionaries follow the so-called rule-based approach, which isn't artificial intelligence. And since HDR is AI, this means that HDR and dictionaries don't fit together very well. However, as Sebastian told us in the morning, language models are also instances of AI, and they are created from the ground truth during a training session. And I will say, aside a Goldberg for a definition of language models, language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language model also assigns a probability for the likelihood of a given word or a sequence of words to follow a sequence of words. Alright, so the language model is able to make an informed guess for the following word. And because of that, HDR and language models fit together quite well. In our experience, you can reduce CR by using language models by around 1 or 2%. As a rule of thumb, don't use dictionaries, use language models. If you have a very bad model, dictionaries might help. But if you have a decent model, you can boost it by using the language models. Maybe it's necessary to say that there are experiences with dictionaries and single writer models and very specific models where dictionaries could help because this writer should have a specific range of words and a specific topic to write on, and large models, dictionaries won't help. So I give you some specific examples, dictionaries and large models. This is one of my annual test set for one year, 24 pages. The blue columns are the pure model without dictionary. The red columns is the HDR result for the model with dictionaries. The arrows indicate on which page the CR has become worse after using the dictionary. So you can see more than the half pages, the character error rate was worse than without dictionary. You will hardly notice this effect when looking only at the average CR of the model. For example, a simple compare or advanced compare with the average CR. You will notice this effect especially if you go in the document and make a kind of advanced compare for single pages to see what happens really. This requires a detailed validation before you decide to apply a dictionary for a whole couple of pages, like 100, 200 pages. In my experience, dictionaries are not worse to be processed with large models. Language models are totally different. We had not so much time to test language models. They came with the last version of transcribers or with the last snapshot. With language models it's different. This is the same model. In this graph you see the test sets grouped by years. So for a couple of years I started the validation with my model with and without language model. The blue one is without language model. The red one is with language model. You can see language models almost always lead to an improvement of the CR for the year where they were applied. I have applied them to this chronological series of test sets to see if there are peaks in one or other direction. But I was surprised that the language model works in all the years with every single year the performance with language model was better than before. So this could be a way with generic models apply the language model and see what comes out then. My experience half percent, one percent with trained material and maybe with untrained materials, absolutely unknown material you can get two to three percent better CR with the language model. Another one. They have similar effects. They are not perfect. They have similar effects as dictionaries. It depends on the quality of your model. Here you have two different years, 1654 and 1655 in this time space at the beginning of the century my model is very strong and it makes no mistakes or it makes mistakes but not so much. In the higher time level 1655 the model makes more mistakes. In a very fitted area of the model there are pages where the language model did mistakes that weren't there before in the not so good trained area. Language model is almost better than the pure model. I would like to tell you something about the so-called recycling approach which in my opinion is an efficient way to get from zero to something. Imagine your community is not very large. I'm a Slavist by training. I'm interested in among others Czechoslovakia and I guess there are not very many people in this room also interested in Czechoslovakia, which is a pity. But as a matter of fact there hasn't been done much yet with respect to training models for Czechoslovakia. And what do you do if you do not have many resources to produce large amounts of ground truth manually? I suggest you just browse the internet, look for already available digital editions and reuse, recycle this data, download the images and copy paste the transcription into transcribers and use it for model training. You might also want to ask your colleagues if they would like to send you word files and digital images of traditional editions. This is what happens in my field all the time. People are more or less conservative. They use computers as a better kind of typewriter so they have their manuscripts. They use a computer to input the text into a word file and then I print a book and throw the word file away. But I suggest we recycle these files originally intended for printed editions. We recycle these files for producing ground truth in an efficient way. Yes? There is no problem with copyrights? Well, you will have to ask. If there is an internet edition, there is a license and you can read the license and if it is open source there is no problem at all. And if you talk to the colleagues they should know everything about intellectually and properly writes. Basically you have to request. Well, with respect to the recycling approach you might need to conduct some pre-processing steps. In my field there are a lot of pre-unicode encodings. It used to be very difficult to reproduce Czech-Slavonic letters. I think if you are from Greek studies or Divanagari or things like that or Arabic you know this issue very well and you might need to kind of add some pre-processing steps. But in my opinion this is a very efficient way to get from zero to something and you might be able to greatly reduce the costs and help your community by training models and make them publicly available. Just to respond to the question of the copyright as long as you don't put the data up you can create a model but without the training data doesn't there's a copyright issue? Yeah, sounds reasonable. Yeah, good, thanks. I'm wondering because in cases like this what you're thinking about is let's say a classical edition and then one question is whether it doesn't create a problem of hypercorrection. So you get the traditional edition and then it would be sort of trained to read the classical rather than the actual version that you're trying to transcribe. This is just a question. All right, so you're talking about normalization and amendations and things like that. You see in my field in Slavic studies we used to produce very faithful editions. Letter by letter. And so I can use it. But if you have a normalizing edition the model will learn how to normalize. Do you like it or not? And the other question is actually I'm all for using this and especially when you have editions of the Bible or other classical works wouldn't it be better to use smaller segments of let's say its chapter or even per page language model or even dictionary to use in order to also minimize the effort instead of or actually in T2I you basically do it because you give them text. Yeah, well I think generally the more the better. If you have more pages that you can copy and paste the model will get better. If you have a very limited amount of time you might want to select randomly, of course. Yeah, that's true. All right, we've talked about that before. If you follow the recycling approach it might well be the case that you introduce even more noise and we'll have to think about if you can live with that noise or not. And maybe you actually can live with it. You will lose like 1 or 1.5% of the CER but you will be able to train your model considerably faster than when getting rid of the noise beforehand. Yeah, it's always a trade-off but I would like to suggest that you think about that. Working with base models only a few of you have asked on base models I think because base models the possibility to train with base models in HDR plus exists since November, September something last year for the colleagues that trained models with simple HDR two years ago they know very well how base models work so this is a kind of refresher and to show you what base models can do with HDR plus. As we heard this morning base models are thought by the developers like a kind of help to getting started. In my opinion there are at least two other strategic ways to use base models with some problems in it but I want to present you this three strategic ways as a starter model for the further development of a model or even for combining writers. Some of you asked how to combine writers from specialized models into a kind of generic models and then last point as a kind of model booster for large models. First one. You can use base models as starter models but this is not really necessary. I think Dorothy Huff she trained two series of models for me or I asked her to do so with a base model and 1,500 words 10 pages on the base model plus ground truth to the base model and there was a difference between training with base model or base model with more than 40-45 pages you won't need a base model to get started this is enough to have an own solid model that will work good on a single writer with generic models it's a little bit different but the specific, the real power of base models is to find a different area. Base models as we heard this morning retain, remember what they have learned up to a certain point. So every new training you will do with a base model with your last model improves your model. Base models are useful for continuous development of a model if you make a series of 10, 15, 20 training sessions to get one good general model base model is the way of choice. I'm not sure at the moment because Sebastian told this morning that the base model will not it's not sure that the ground truth of the base model will be remembered for all the other training sessions in the series my experience up to now is it does remember and first I give you an idea how to start a training with base models for those who never did you simply activate the filter because you use the versions management we get clear that you have to use the versions management ground truth only then you choose the base model in this case I choose the base model from another project that's quite good then you add only the new ground truth not the whole I don't know what thousand pages of the base model only the new ground truth and then you use the old validation set of the 5.1 model this is what you can get in the HDR model data the validation set will be there to find and you can add it simply from there to the validation set and then you can start the training what I did in this use case was a base model is 150,000 words and eight different writers it was trained with 500 epochs up to a CR of around 4% in the new training I added only 10,000 words and I added only ground truth for two of the eight writers and I retrained it with 500 epochs to make it comparable with my old model and the CR of the new model is 3.8% this is not so much as I expected but if you do specified validation in this case these are the two writers for whom I added ground truth Balthasar and Engelbrecht the blue columns are the base model the old model you can see that for all the writers not only for the two writers where I added my ground truth the results are better than before thanks to 500 epochs more training on the base model so I repeated the training and within the training the model remembers what it had just learned and so this is a better result for all the writers with combining writers try it maybe you can get this way a really combined model for more writers model booster if you want you can improve strong generic models with a help of public models that are available in the transcribers community so for example this is a technique I tried with only very large models using a large model as a base model you can improve your model two advices before starting check the properties of the base model don't use a base model that doesn't fit to your material if possible try to predict the performance of the base model on your material I will show you two ways to do this this use case is on material form the 70th century in the public models there are two models available that could suit for my experiment the German current M1 plus from the transcribers team in Innsbruck and the German current model of Tobias Hohler the first one has 1 million words ground truth the second one 1.5 million words ground truth they both cover the 70th century where my material is situated and both are trained for German current so I tried to make a prediction first I used the sample compare I have a sample set with my untrained material and I started a sample compare for both models as you can see the winner is German current from Tobias Hohler German current M1 plus 24% of my material so I choose the second model as a base model for my training I added then 108,000 words my model has a CER of 7.3% average CER then I retrained my ground truth on the base model German current and reached a CER of 6.6% so this is nearly 1% better without adding any new ground truth no page so and if you follow the line for the single years which is covered by the model you will see this is not a statistic anormality it's in all the years the resulting model was better than the base model and better than my own model without doing anything this is what base model training model booster means try to do it if you have a strong base model of your own and a strong base model and if they fit try to get more out okay this is on base model training at the end validation I have spread something on validation all over the presentation so I will do simply an overview on the validation tools you have three validation tools compare text versions you can find them in the tools compare text versions compare and center compare okay compare text version is a nice thing if you laugh numbers you will all we are talking about the CER of 5% 6% but to get an impression how the model performs it's very helpful to see the text versions and where the model works and where in which cases it don't work very well this is a word error rate of 12% as you see and the compare text versions gives you an idea how to deal with character error rates so the character error rate is a calculation using the total number and then all the insertion and substitutions and deletions required to arrive from the HDR result to the ground truth this means every wrong character doesn't matter what kind of character is an error but if you look here not all these errors have an effect on the search impact of your model it depends on what you want to get out at the end if you want a searchable text, plain text half of all these errors calculated in a character error rate are not important for full text search so don't worry about the CER check it in some kind of overview I think you know all the other possibilities to make validations is the simple compare only for single pages free choice of reference pages you can compare whatever you want with each other it doesn't has to be ground truth this is the same page that you have seen before in the compare tab you have a very useful validation tool the advanced compare advanced compare makes sense if you group test sets for years or for writers you can start a validation for a whole set or for a whole document and you will get the single values for every single page so the CER for this page is something like 7% and you can see that there are a lot of pages in this group under 5% CER and then there are some pages with not normal peaks and this is a way to go and look at these single pages and see what happens there is it a single writer the kind of scripture may be concept writing Latin writing whatever and to analyze where you have to work better with your ground truth or something else this is how it looks like a little bit larger Do you have any insights about how to interpret the relations between the world error rate and the changing relations for example in page 9 in page 18 the world error rate is larger in every case larger than the character error rate that's clear if the relation is 1 to 10 sometimes it's 1 to 2 I would have to look at this page to make it clear this kind of validation helps you to get into the material to analyze what went wrong and to do so normally I don't appreciate the world error rate I never work with it that's not important for me because it doesn't indicate nothing it shows you, it assures you that even with a CER of 2% if your world error rate is 5%, 6%, 8% that means that every 10th word won't not be found in a plain text search nothing else but it can be that it's capitalized and it shouldn't or that there is a comma it shouldn't be there but it doesn't matter for the search I just want to note that if you click on the result of one page in the advanced compare you also see the text compare versions so you can look at the text in there so if you have page 8 for example where the character error rate and the word error rate for a part so in this upper table you can double click and you can see the text versions of it each page the versions of your yeah, that's right okay, sample compare predictions so in the workshop title it was written how to make predictions with models this is what you can do with a sample compare if you create a sample set in different ways I presented you you are able to make predictions on the impact or the performance of a model on your material or maybe of a model that you haven't trained yourself I showed you this with the German current M1 and the other German current model to make a prediction how it would work on my material another thing is to make predictions for example for the works with dictionaries or without dictionaries if you have a sample set you can simply run the HDR with language model or the HDR with dictionary on your sample set and then compare the validation or the predictions of both in this case I run my model a strong model without language model so the mean is the prediction is 8.5% on the unknown material and then I run my model with language model and the mean is 6% so 2.5% better than without language model on unknown material that's for example something where sample compare can be used very well to make predictions for your own material that's basically it thank you very much for your attention