 Well, thank you everyone. This leads on directly from the last presentation and is specifically looking at the transcribing and machine text recognition of the manuscripts. And online we have Sarah and Sadeep. I think. Are you two there? There's the three of us. They are online, aren't they? I guess. So the process that we've been going through for the last year or so is, well, in fact, going back before that, sorry, we have three, we had two existing OCR systems for Wikisauce. First of all, we had Google OCR and then we added Tesseract a couple of years after that. The project in the last year has been to add a new system called Transcribus to the set of OCR tools available. So you can see here, this is an English language handwritten manuscript and I don't know if you can quite read the text, but this is using a generic English language model to transcribe it. So it's not particularly great, but training a specific model gets us much higher quality than that. Here's another example in Italian and the, so Transcribus has about 80, I think 85 different models available in different languages and for specific bodies of work we can go through and anyone involved in Wikisauce can learn to train a OCR model and I think Sara is going to take over now and talk us through. Sorry? Yes. Hello, everyone. I'm afraid I can't listen to you. Off the top of my head. Would you like to be broadcasted as screen now? Yes. Okay. Hold on. Would you like to be broadcasted as screen now? No. Oh, sorry. Yeah, the question was what is the Latin script and they're not, there's a whole bunch of different scripts and the whole point of the project is that it's opening up the possibility of any script basically and there's certain requirements for the minimum number of pages you need to train a model and things like that and I think if we switch over to Sara now. Can you hear me? Does anyone know if they can hear us? Hello, everyone. Can you hear me? Can you hear us? Yes, I can hear you. Can you say something, Sara? I'm trying to talk but I'm not sure if you can hear me. She's speaking now. Hi, so sorry. You just give us a little time. You're trying to solve some technical difficulties. Yeah. Can you hear me now? Okay. Hello, everyone, and thanks for the introduction. I will explain you what is Transcribus. Transcribus is a platformer and an early designer to simplify time-consuming and laborious work with historical documents. So the main idea behind this platform is to help researchers and users who work with historical documents and give them the opportunity to train AI models to automatically transcribe their documents and the main goal is to help people who can't read all scripts or to pass the transcription process so people can concentrate more on the content. Next slide. What does Transcribus make possible? The main thing is the automatic transcription, help and written and printed documents. Sam will show you what is possible with the OCR integration in WikiSourcer. So you can automatically transcribe an image and written or printed and right now we have more than 100 public models trained by the Transcribus community and 11 have been integrated in the Wiki media OCR. But it's also possible to train AI models for any languages and scripts without the need to, without coding skills or as the equipment just, you can just laptop and an internet connection is enough to train a model. This collaboration between WikiSourcer and Readcop, which is the cooperative behind Transcribus started thanks to the online values that we have as you know the Wiki media movement is community led, mission driven and profit. Transcribus is a sustain and developed by Readcop, a cooperative and our motto is purpose before profit and we try to support as best as we can educational and profit initiative and also for us the community is very important and has a big voice in developing the platform. Sam has the available model but what you can do when no model is available for language script of your document. At this point you need to train a model for your documents so you need to teach a computer how to read your documents and this is a possible in WikiSourcer you need to switch to the Transcribus platform to do that. I will just give you a technical background so what we are using here is machine learning, machine learning is a field of artificial intelligence and it focuses on the development of algorithms and models that enable computers to autonomously learn from data without being specifically programmed for every task and thanks to the training we get a model but what is a model? The model is the output of the training process what was learned by the machine learning algorithm. Model learn to identify patterns with the training data and generalize the learning knowledge making prediction on new data that was unutilized during the training phase so in our case to train an written text recognition model to give some images and the corresponding transcription to the machine, the machine learner on our transcription and the result is a model that is able to generalize its knowledge and transcribe new pages. When we train a model we need first before starting the training we need to have our ground truth. Ground truth is a technical term used in machine learning and is labeled data. I just wanted to make sure that the slides are also changing because we stuck at the last line. Sorry, I move on, I forgot to go to the next slide. Here is our ground truth. In our case the ground truth is images and the transcription we have done manually and during the training we need to split our ground truth in two groups which are the training sector which are the images and transcription on which the model learner and the validation sector which is around 10% of the training sector and the validation sector is the pages in which the model affects the accuracy and so in this way we can tell how well the model performs. Next slide. Now we will see the steps to train a model. First you need to create an account in the transcription platform creating an account is free and then you need to create a collection upload some images, do the transcription and start the training and next slide we will see the step by step with this video. Can we switch? This is the interface of the transcription. You see there is the work desk where we work and then the training tab we will use later only to train your models. Is the video running? I think the video isn't working. Someone needs to play the video on this slide. We are just trying to figure it out. Okay, in the meantime I can just explain you the process. So after that... There is a question if there is a link to the video. Yes, this is one. This is the interface and you see we are working on the work desk right now and then there is the training tab here. The first step is to create a collection. In this case I created just the wiki Mania 2023 collection and then you upload your documents inside the collection. So the collection are based on the folders and inside this folder you have your documents. Here you can upload your documents so the one you will use to train your model. You can upload images in JPEG or PNG format up to 10 megabyte of your image or PDF. In this case I am uploading 30 images from a Javanese manuscript. I took it from a wiki source. So I select the pages that I want to upload and here you can decide the name to give to your document and then you can do the upload. We just check that all the images are there. Submit the upload and then there is the jobs tab where you can check the status of your jobs and how the upload is going. It could take some seconds or a minute. It depends on the amount of pages you are uploading and then we can open the document. Here you see all the images have uploaded. The first step to do now is to run the layer recognition. The layer recognition enables us to detect the text regions and the lines of text and this is an essential step before transcribing manually our documents to create the ground filter to train the model on it. Because the transcript is working in this way you need to first run the layer recognition and transcribers automatically detects lines and you get the lines to transcribe. And then we open a page. This is the automatic layer recognition. You see the regions, the lines. If there are some errors like here, some small lines who has been detected incorrectly you can just select them and consult them. But usually the layer recognition works fine and then you can start typing your transcription to create the ground filter. There is also the possibility to add special characters. You go to the virtual keyboard and here you can specify which special characters you want to add. In this case I have the Japanese Unicode and then you can open the virtual keyboard in the editor and add them the special characters manually. Or another option, if you already have the transcription of these documents, in our case the transcription are already there in the WikiSource you can copy and paste the transcription from WikiSource to transcribers. The downside here is that as it is now you need to do it manually line by line so it's not possible to copy the entire text. Because if you try to do it all in the first line and the match isn't correct but we are trying to work on it so we hope that soon it will be possible to copy the entire text and send it automatically to the correct line. When you have finished the transcription you save your pager as a ground filter so it means that it's ready and can be used to train your model. Now here we have 30 pages of ground filter you need to have at least 25 pages of ground filter to train a model and now we can go to the training page and here just select the collection with your ground filter you are asked to add some details about the model like a name, a description, the language and the centuries of the documents. In this first step when you train a model the model is by default private so only you can see it and use it so these metadata are just for you but if you want to make your model public and also make it public to people on WikiSource it's good to have a detailed description here just to make people understand on which documents you have trained your model and on which documents they can use this model. Then you are asked to select the training data here you can select the documents or single pages depends on which pages you have worked on and the next step is to select the validation data so the pages on which the model will test its accuracy our recommendation is to automatically assign 10% of the training data to the validation data and in this way it is automatically selected and you're sure that the validation data is good and it's a real example of your round truth. In the last page you see a recap and also some advanced settings it's your first training leave the number of epochs and the list toppings as it is you can also add a base model it means a model that has been trained previously by someone else and it can help to improve the knowledge of your model but pay attention that a base model needs to be trained on a similar hand and languages and then you can start training your model you go to the jobs table and you see the status of the training the training can take from a couple of hours to a couple of days and you will receive an email when the training is finished at this point you can go to the model manager and see your training model but also all the public models that are available and we can go to the Balinese model for instance and here you see this is the public model you see who created it on how many words were trained and especially this number in this case 0.47% this is the character error rate so the performance of the model it means how many errors the model will do on your pages and in this case the character error rate is very low 0.1% and it means that the model will transcribe 99 characters correctly and will make only one error so if we can go back to the slides I will just show you what to expect when you train a model and how many ground troops you need we are just getting the slides back one moment so the amount of ground troops for printed text is lower than for written documents and usually in written text recognition is better than standard or serial for historical prints and newspapers because it's specifically trained on that type of script for printed text you just need around 5000 words so 20 pages of transcribed pages and the character error rate you can reach is between 0.5% and 2% if you want to train a model how many written documents the ground troops so the number of transcription should increase if you have just one hand and a simple writing you need 10,000 words and the expected character error rate is between 2% and 4% if there are several hands also during training you need several 10,000 words so 10,000 words for each hand and the character error rate will be between 4% and 6% and if you have many hands you want to create a very general model that is capable to deal with hands of the same period but also hands not seen during the training you need more than 100,000 words and the character error rate will be between 6% and 8% so you need to transcribe new hands but the number of errors increase next slide and when you have trained your model and you want to share this model with WikiSource first you need to send an email to the transcription team and ask to make your model public and we will get back to you asking you some questions of your model like who needs to be credited for his short description to understand what this model is can do and then in the WikiMedia OCR tool you can make a request to add this new model from Transcribus to the OCR tool and in this way it will be integrated in WikiSource now I pass the on the stage to set people for the last slide I think Sam can share about what's the progress on the Balinese model and I can talk about the WikiSource steps in the script we have only a couple of minutes so the Balinese model is as Sara was saying is quite high accuracy and is it's now available in the OCR advanced options form which is at ocr.wmcloud.org and very soon it will be available on WikiSource as well via the normal OCR dropdown options there's a little bit more work to be done to make that available and that's the same work that will need to be done for any future model that will be trained we're not making everything available through the WikiSource user interface yet we're just setting like for each WikiSource we're setting one default model that's just for the time being I think at some point it's going to be an easier to use interface and some more documentation and some more training so that people can choose whichever is the most suitable model especially on multi-lingual WikiSource that's going to be necessary but also on other WikiSources that's going to be pretty useful as well is there anything to add there? I'll add about the Learning Partners Network so we do realize that this process for newbies to transcribers to working with manuscripts can be overwhelming and includes a lot of different moving pieces and to support communities and new languages in this process we are launching a Learning Partners Network on the source of manuscripts and we are accepting applications a needs assessment survey we are asking people to fill that if you are interested in working with manuscripts and we'll be working with a smaller cohort out of the interested people and we'll be supporting them in creating and not just creating a conversation where PPI M the work that they talked about before we'll be having workshops in-depth workshops with transcribers with PPI M and also if there's any funds that you need we will be supporting the entire project from the beginning to the end and we're hoping that by the end of the year we will have four to five communities which have proposed their project and then able to kick them off apply for funding and stop them in the next year next slide we just have I don't think we have time for it thank you so much I don't think we've got time for questions is that right or we do have time we might as well the question was is the output going to include HOCR I'm pretty sure no specifically HOCR but there's a different dialect for mapping word coordinates it's a page and yes that is available and we haven't done any technical connecting of what that will mean but it's a possibility and I can imagine all sorts of things of writing that back into the deja vu files there's definitely heaps of scope for further development there thank you for this talk it's very interesting because I've worked on OCR projects for many many years and I've been using Abbey Fine Reader those like versions 11, 14, 15 so how does Transcribus performance compare to Abbey OCR's performance like has anyone done any comparisons good question Sarah maybe you have some background on that I think there are just some papers doing some comparisons I think the some archive in Norway or Sweden try to do a comparison with different OCR and HDR tools so if you can write me an email I will send you more detail some details about this paper usually in written text recognition in general is better than OCR because it can be trained on the script so it takes more time to be trained before but then you get a better output and does Transcribus support internet IPA the international phonetic alphabet like if you try to train IPA handwriting does it work well no I don't think no it's not possible no what if the document in itself is an IPA what if someone has written something in IPA and then they are trying to train it in an IPA sometimes machine learning is capable to learn certain things but I don't have any experience on IPA so we need to do some tests before saying it's really possible I think we're out of time now hi there's a question from the chat saying I wonder how to get transcripts from medieval French texts if there's something that is able to be answered I think it is possible I don't know if there's an existing model but should be possible I think but yeah get in touch and we'll figure it out