 And first up is Simon Kemper from the National Archives of the Netherlands. And he will be talking about transcribing corrosive iron gall ink in the tropics. Very interesting topic because here you can also see that transcribers might not just be good for things that are hard to read because they're old, but also maybe because they are not in that good of a shape. So, let's see what it's about. Is this thing on? Yeah, it is right. And thank you for having me today. I will be talking about something I've done in Indonesia and something I'm doing right now in the Netherlands as the key user of linked open data. I'm also involved in entity processing and helping Lisbeth with the iceberg model from time to time. You've probably seen her presentations in the previous conferences. So I will be talking slightly about what you can do with using damage material and combining it with material in a good preservation state and how that influences the character error rate and other statistics. But we'll be mainly talking about how to communicate with the users about the reliability of search results. You have all this metadata that you need to give to your users to give a good impression of what they're concerned for. And once you start transcribing digitizing, you create an illusion, which others have talked about, the illusion that you can search for everything in the archive. Obviously, as has been mentioned time and time again, not everything will be digitized. There will hardly be any archive capable of digitizing the entire holding. Other than that, the transcriptions differ and not everything is transcribed. But it's also about the quality of the records that matters. And that's much more difficult to communicate or to calculate. So that question will be the main question today. So you have a list here, reliability of search results. I keep it vague specifically because I want to address the fact that whatever general statistics we use, CR, F1, loss, it's also becomes more and more appropriate to start thinking about corporate specific statistics that might be based on particularities of, let's say, the Dusty's India Company Archive, which I will be talking about today, but it could really be any archive. Anything that's millions of records, you could develop statistics specifically and benchmark specifically for the archive. In the Netherlands, we have a very crowded field in 17th, 18th century Dutch transcription and natural language processing. There are more and more parties doing that. So it pays off by also trying to compare these projects together and not only create more general means of evaluation, which are already there, but are not always sufficient, right? At once scores, character rates, we all know that they don't tell the full story. I will be talking about two main initiative institutions, first my own National Archives of the Netherlands. Again, you've heard of, at least that has been doing last couple of years. There are some other initiatives that are now going on in the Huygens Institute, which there's some people here too. They're specifically dealing with the letters and reports sent from Indonesia to or other colonies to the Netherlands. That's about five million of records and they try to really create a very deep search model for that. That's much deeper than the one we created National Archives. We're also dealing with maps. We recently partnered with a company called All Maps in Amsterdam. It's also a partner of the National Library of the Netherlands of Delta University. And we try to use triple IF maps and annotate those automatically and then geo reference them automatically. That's something that I will come back to in the end. It relates to the corrosion, right? You might first think something different with the geo features and corrosion. There's a similarity there in how you handle those and how you evaluate those. And then our triple store in which we connect all the data we collect because I've also been talking about entities at the end, and not so much entities as, you know, making indices only but also entities and the better text of entities as a way of evaluating transcriptions. So the other way around of what you usually do. The actual language processing also is a kind of feedback to what you did in previous steps. If you want to, you can look at our website, but it's not really essential for this presentation. So you skip scanning that too. Then the other program, which I haven't participated as well. I used to live in Indonesia for seven years. And it was also part of this program, but obviously the main work was done by the great archivist there. Think of Kang Jajang, Kang Harris, that means brother and mother, you know, you always address your colleagues like that in Indonesia, like a family, and they're depicted here. They actually did pioneering work. They started digitizing before we did in the Netherlands, or before many others did. There was a cooperation with a Dutch organization reports foundation. But they were the first but they did it before there was HDR before there was natural, there was natural language processing but it didn't really apply. So they really transcribed everything themselves. They didn't do the entire text though they did the marginalia, they did the headings, those pieces of text that are that kind of give a structure to the archive. And also might you might want to notice that they did about 1.2 million scans. We have many more now, but their material remains useful. So I hope that if you do more collaboration between the under the national, the National Archive of Indonesia and the National Archive of the Netherlands for reasons which I will go to now. In general, just a side point Indonesia has many more records relating to this colonial period than we do in the Netherlands. So it's also really a treasure trove. They were not the first though the courts foundation, the archive of Indonesia was not the first to really try to reserve these records in the tropics. The other program, not really a program, the more like a series of volumes that was released about more than 100 years before that, so late 19th century, early 20th century. And that was unlike many other volumes that were released on the few see was really aimed at preserving the archive they saw that not judges in the company archive, vanishing in front of them. And they said, what can we do just describe everything. And they took about four decades to do that right. And I think in this case means we transcribe the main records that kind of summarize the rest resulting to only about 12,000 pages of published material or 11,460 pages in published material, which could translate if you take averages to about 14,000 archival folio. Now the thing is how to connect this and how to use this as ground troops. Like we haven't really done that this far, even though many of those sources they described are highly corrosive and extremely difficult to even the people are really good at pedagogy to read nowadays. Some of them are still in a good state, but some of these so called daily journals that which had digitized in that program I mentioned just now. But they are basically like like barcodes you can only see black lines everywhere you don't really know what to make of it, unless you really focus and obviously creating ground to extend will be a very laborious. So what I did first is create a model for print attacks that is nearly good. Right the actual character error rate for this model which was trained with microfilms and other damaged printed material. That was the first step is 1.59 but if you actually apply it to train set of specifically you know these printed materials there. And then you get 0.0101% so that's that's you almost need to do and the errors are in the superscripts right so you can actually really implement them. The problem is the lines differ the lines of the public material and the ones in the archive. They're not the same so you need a delimiters. How do you get this. That's the question. To process your guy for material. I use an adaptation of least but I spent model, which has a more varied combination of hand writings from the National Archive. This is the initial model at everything in a good preservation state so I didn't really have to use the corrosive material for that model. I mean it's to also structurally take out English statistics in the transcriptions. It was already done by the ice cream which you know we just did that another time and we could bring it down to 3.4 it's about 3% points better than the earlier model. But then we started that in the corrosive model which is version three right. So the first character array goes up and you can expect that it's just more difficult to material, but it also means less overfitting that more of this basic corrosive material and the end model we also added faded out text. The character array goes up. But what do you see if you do separate train stats that the actual character array of the samples from the good states from source and good state of preservation is extremely low I think this one is a bit of an exaggeration I did some extra tests with tables. They're around one between one and 1.5. So I would want to give you a slight update here. I don't want to be too optimistic. But it's definitely a decrease from the 3% points you saw earlier. At the same time, the severe in corrosion from basically keeping very, very like that probably around 80% went to 33 by adding dozens of the transcriptions not yet those 14,000 right I didn't do that yet. But just shows a couple of dozens of them with manually added the limiters. Now, you might think what do you do with 30% but you can actually with these 30% you can automatically detect the limiters, allowing you to do the 14,000% that wasn't able to 14,000 pages. So as long as you can distinguish the lines and create a script for cutting the printed material, according to the archival material, you're just one step removed from actually implementing the 14,000 pages and that was the aim of this particular test. The presentation can perhaps tell you more about the entire set, but this one is only promising right, cutting from three to about one is only a good move that and the errors that are occurring within the good presentation. So the material from the National Archive is mainly in these kind of things like super scripts or other invitations that are crossing out the letters, or, and that really looks like a peak so quite rare tribal errors perhaps or scribal oddities. So what about evaluating this, how do we actually, okay we can do, we can do the sample stats we can do it with the character arrays, but can we go deeper than that can we get a better clue that if there's a huge hole in in a page or you see the trace of termites or you see everything blackened and the paper browning. I'm not the first to ask this question, one other member of the co co op. Why worked with the digitalization Indonesia to, he was doing the website and some of the physical stations. I think Mark Rowling had done research on this. I think also sponsored by transcribers. Perhaps not, but at least in cooperation with those people's looking at particularly these kind of phenomena, right. I think that's why I recall going to say things that was most especially most of you will know was very common really used in Europe, but also in the colonies, you might wonder why don't you use use Indian ink to their neck close to China and the anyway they tried but they never really were able to replicate that formula so they mainly use Indian ink for stamps, and it kept using the same old iron galling with the same old corrosive issues in their own administration. This administration moves from Europe to the tropics obviously high temperature and humidity makes the effects of corrosion much worse. So burn through his Halloween lacing in which it gets tracks and middle piece of paper fall off goes right thing and you can have many more with the other issues like truth whatever. But I guess as archivist or as researchers you're familiar with these kind of issues. So I think this was trying to see what the break of point was for corrosion. Five minutes. 10 minutes. Five minutes. Okay, we can do the question in the coffee break. Now what we found is that it's really the combination of issues browning corrosion holes. And that's the causes HDI to completely feel right to go up to 100% I did this based on histograms on blur grades. But it was really actually very much at its itself just a pioneering attempt at struggling with this issue. So these histograms were not really useful in the end. I mean you can detect colors. It doesn't really tell you how the HDI will look like. I think more has been done since then we all know these many volume publications. It's often an image classification, really a matter of adding more GPUs tinkering with the model. So my work with transcribers also engaged with very much. But what I like to propose today is slightly alternative methods, which obviously only tries to complement this great work that I've already done. Like you can you can put marks you can classify images like the E and F. What does it tell you about baseline? What does it tell you about HDI? That was the main question I asked when I read Marco's piece two years ago. Main baselines also tell a story by themselves not based on image processing. But what happens if we can really calculate these do calculations with the baselines themselves. But it only became possible thanks to transcribers. So about three months ago it wasn't released the baseline. And I'm really amazed. I mean I don't want to flat out but I'm really quite struck by how effective this baseline model frame is. This text here, you can see them because I cut out a slide to the 64 and B limit. Yeah, but these, it can actually go to the polls. There's all new set of issues like if it was online from that I over there to that other part, mainly comes home and what it's trying to start transcribing it or what if you add ground clearance to it from hundreds or 150 years ago. That actually still will take the text of that all doesn't the computer at some point at the time of the moment to start creating fixes right at the polls today. It's right to insert something that, yeah, it cannot calculate it's just a complete fictionist estimation. So, yeah, for that classification and looking at the image or second thing might still be useful for the gray areas just take them out of the transcription. You take those coordinates and you say you move the transcription altogether, because the new model which I created can actually interpret all kinds of things that which will very much be confusing to our users. Okay, so just shortly on the baseline model as I really need to rush the rush. You see the three models, you see the more you add the actually the higher the loss, but that's good because we actually have less overfitting so actually this model is much better than the general thing you've probably experienced yourself. I always tell the full story saying the baseline same with the character so we don't have this issue anymore that we finally can draw proper lines here. And we can try to calculate the length and number the averages on the practical base and notice outliners. We can notice outliers there and then go to those particular pages and say hey we need to correct the baselines here. We can improve the model for the result that states which you really want to have it's easy to try. Now entities. We took about this, we have at once for around 90% of our entities. We have about 100,000 tax from this material, the material, mainly from the Netherlands. We have seven categories places persons group destinations organizations sexual references quantities and temporal references please ask me about this in the break. You can also look at the embedded text and search for that specifically and thereby alter the text and thereby also do evaluations on embeddings of entities which is different from lines right I mean it's a larger syntax around an entity. Why should you do that because often those kind of syntaxes are frequently repeated. There are three entities in the sentence that occurs on every page in an archive with certain phrases indicate headings indicates the start from letter indicates the end of the letter indicated their signing of a report. Those are frequent and those are most often easiest for a model to learn because it just gets as more ground to some of those particular phrases and those particular characters. You can also, you know, specialize in a way your evaluation system by focusing on those and that attacks from the entities. Also using the fussy search model by my own colleague, we look at the Levenstein and the way in which these phrases might have variations but actually be still inherently the same. Like in the sample sets from under the you could for instance look at having such a constantly the same even though the holes and the corrosion will make it hardly readable just know time and time again it's the same. And we often to have versions of the same documents in the archives in the hey. So in that way you can start combining these key phrases, not only for training purposes but also very validator purposes. Three minutes. Other than material from traffic side because most of you will not have the kind of material at the National Archives. We still have a team of wrestling partners spending stuff in countries like Indonesia and still gets the kind of material for our own digital colleague. We do have some apparently like Brazilian documents. So one of our users just discussed that with me yesterday that are in a very bad state. Yeah, there you are. Thank you for the comment yesterday. There will be more obviously National Archives does have deteriorated documents but perhaps not in the same state as you can see in the tropics. But maps, right, right maps are quite similar in many ways because they also have all these geo features, they have all kinds of disruptings of the text. So we could use these techniques to improve our automatic orientation of maps. And this is very important. And this is because in maps, this model that specifically trained the maps baseline model. And you see it works because wonders, not only with words or characters, also with features like houses or flags. Why do you need those, because if you have a place name and then you have a little house next with saying the settlement that's the point and if you have a place name and those names you can start doing automatic deal referencing together with our partner all maps. That's extremely important to immediately give a presentation to these places. Put them into a knowledge graph and combine coordinates with place names and the straight dynamic as it is. So that's, there's a lot to be gained here and interestingly for this particular discussion today. You can get a new level of innovation by combining, if you already know the coordinates, you can place these documents in a GIS system, like I did here is this page. Right, but then with the neighbors, put into a GIS and this completely, you have the transcriptions by doing this kind of things. And you can also start comparing the coordinates of the annotations. So if you have a map of 20 years earlier, you can see what the spelling variations are. So that's another way in which the domain in which you're working allows you to do particular kind of evaluations on your transcriptions that might be worth it before people start doing this and I hope so because it's really exciting. This kind of work as a text from a different areas. I won't slide off. I guess this is enough for you guys. Right. Just on this very last topic. Before my time runs out, the mapping and also some extra describing of gross text does lead to your questions on Unicode too. But it's something for another discussion of another time and how to be very, where to deal with all the symbology. Okay, I hope some questions might have a reason because of this talk and please ask them. Can you wait just for a second. I'll give you the microphone. We can do one or two questions. So we'll try to distribute the time lost across all the presentations. If you have access to the original documents. If you have access to the original documents. Can you re scan them. Can you re illuminate them differently image differently and get more information. Definitely you can. But are we as good as termites you know that's the question you can ask yourself because he spent them. You can create artificial damage right you can also replicate damage to archives yourself with properly good preservation so you might ask, do we even need the material from the topics, if we can just do it on our computer. In ACR you actually do that as one of the steps you take. It was that your question or where you asked for something else. No, I was just thinking. I mean your damage your black blotches they're not completely saturated you could possibly image differently you could set the price. They could try to do that. It's basically that I believe it's your process whatever it's your use to not alter the So that's my and Marco also experience as well. To make that part of the actual ACR process. You might actually decrease the consistency so I could not do that later on by then, but perhaps come back to the other point that was kind of related to what you were saying though. In general, I think we should, the damage that was created in the tropics that's very difficult to replicate so we should, you know, these we could try to help people have tried to get artificial damage, but it's really better to use the tropical material directly. It's a kind of, you know, it might be something other people will do us about image preprocessing like to weigh in here to a little. It can be of value but it obviously doesn't scale that well because that needs a little bit of individual treatment as well, because not all the damages will look the same. But if you have time and resources, and with image preprocessing you can gain some extra quality. You can sort of weight certain features of the image, most strongly, also with AI preprocessing so there's AI for that to, for example if you have to low resolution, and you can do that with can improve on that with the help of AI for example. Yeah. Yeah. Exactly. Yeah, or with uniform collections for that matter. Okay, we can do one more question if there is any. Otherwise, we can move on. Thank you for this very interesting presentation. And just a second, not leaving, letting you leave without a mug, because researchers need to keep their strength up and drink lots of coffee and tea, or whatever they like.