 I'd like to invite Tom Derrick from the British Library, which is a member of the Reid Co-op, and Nicole Merkel-Hilf from the Heidelberg University Library to the stage. I hope it's okay to say that you are a private member of the Reid Co-op, and we move on with I think it's a presentation about Index Script. Thank you for the kind introduction, and good morning to you all. Tom and I are happy to have the opportunity to present two projects from two institutions with one common goal, and that is text recognition for non-Latin, and to be more specific for South Asian scripts. With our two projects, we show where the challenges lie in text recognition for this material, and what we have achieved with Transcribos so far. Tom is digital curator for the two centuries of Indian print project at the British Library and deals with 19th century printed books in Bangla script and language, but he's also working with printed material in Urdu language, written in the Persia Arabic calligraphic script Nasta League. My project, Naval Kishore Press Digital at Heidelberg University Library, also deals with 19th century printed books, but focuses on the Devanagari script, which is used for various South Asian languages. So basically we are dealing with three different scripts, and they have the following features. Bangla and Devanagari are written from left to right, and are so-called abhugita scripts. That is a segmental writing system in which consonant vowel sequences are written as units. Each unit is based on a consonant letter, and vowel notation is added. The inherent vowel, which is usually a short A, may be changed by adding a vowel mark or diacritic. Both scripts make frequent use of ligatures, and these can be formed by combining two or more consonants. And the total number of ligatures existing in Bangla and Devanagari amount to several hundred, and this is definitely a challenge for text recognition. The Nasta League script is used within South Asia primarily for the Urdu language. It is written from right to left, and is one of the main calligraphic script types for Persian and Urdu. As usual with Arabic scripts, vowels are not written, except perhaps as diacritics, and the characters have three different forms depending on the position within the word. Initial, middle, or final. So after this short excursion on South Asian writing systems, I turn to the Heidelberg Naval Kishor Press Project, which is abbreviated NKP. The Naval Kishor Press was founded in the North Indian city of Lucknow in 1858, and grew to one of India's most important publishing houses. The Press's portfolio covered books in Hindi, Urdu, Persian, Arabic, and Sanskrit, on subjects as diverse as religion, education, medicine, school books, Sanskrit literature, as well as translations of English classics. The Library of the Center for Asian and Transcultural Studies at Heidelberg University holds with approximately 2,200 titles, a representative cross-section of the Press's publications. Some books in the collection show paper decay and text loss. The books were printed with lead type, but there are also numerous volumes that were printed using the lithographic method. This was a preferred printing technique in India in the late 19th century because it was inexpensive. There are many bilingual books in the collection, so for example we have classical Sanskrit texts with Hindi commentaries, or texts in Prachapascha, a literary language that was common before the standardization of modern Hindi. And commercial OCR software, which does exist, usually does not provide a good recognition result for bilingual or multilingual texts because they usually rely on in-built dictionaries. So when we started our project with Transcribos in 2018, the aim was to digitize selected works from the Devanagari section of the collection, provide searchable full texts, and make if possible ground-truth data available for reuse. Our typical workflow with Transcribos includes layered analysis, creation of ground-truth, training of HTR Plus models for recognition, and last but not least, the application of the models on the text. The Devanagari books in the NKP collection have two typical layout variants, a very simple book layout consisting only of page number, header, and text body, and here we either use the Transcribos standard model for the layout analysis or our P2Parla model trained on 60 pages. The second book format, where you see a sample page on my slide, is the so-called POTI format which is based on Indian manuscripts. For this format we also trained a model with P2Parla on the basis of approximately 50 pages and the model was trained to recognize all text regions and these are usually chapter numbers, page numbers, header, and text body, and the model also recognizes the baselines. For the ground-truth we used existing transcriptions from an earlier phase of the project, but there was extensive post-correction necessary. And for our Transcribos experiments, which I will discuss later, we created the ground-truth ourselves. Our generic HDR plus model for Devanagari, which we now use for the NKP collection as well as other Devanagari texts outside the collection, was trained on the basis of 200 pages and 26,000 words as ground-truth and it is based on five different type fonts and it achieves on the validation set and impresses a character error rate of only 2.3%. And because of the good recognition results, we have made Devanagari mixed M1A available as a public model on the Transcribus side and it already has been tested by the community and we got quite good feedback. So currently we are experimenting with Transcribus in two directions. On the one hand we are training HDR plus models for the recognition of texts printed by using the lithographic method. And this is our first attempt with strictly speaking handwritten material. Our first model is based on approximately 17,000 words of ground-truth and has been trained to recognize the hand of two different writers. This model already achieves a character error rate on the validation set of 9.9% and we are now creating more ground-truth to improve its performance. Since scholars in Indology often work with transliterations, our second experiment is training data models for Devanagari texts based on ground-truth transliterations. This means that the original text of the on the facsimile is written in Devanagari script but the ground-truth used for training is produced in Latin transliteration. And we are lucky here because there is an internationally accepted transliteration scheme for the Devanagari script, the so-called international alphabet of Sanskrit transliteration which is widely used by scholars. So we relied on this transliteration scheme for the creation of the ground-truth. Here we are collaborating with the department for the study of religion at the University of Toronto and they provided us with roughly 20 pages of ground-truth on the basis of which we trained the first model. It came out with a character error rate of only 4% and I found that really amazing because I didn't think it would work. So these two experiments have produced good initial results and we will definitely stay tuned and try to improve the performance of our models. So what do our users, scholars in South Asian studies, get out of it? So far we have performed text recognition for around 12,000 pages. These have been exported in Alto XML format and ingested into the Heidelberg University Library's web presentation and scholars can search the text in Devanagari script or Latin transliteration and a word or phrase found searching the full text will be highlighted in the facsimile and in the OCR text online level. I tried to visualize this with a screenshot. And furthermore users can download a high quality OCR PDF of the facsimile where the text is also fully searchable in Devanagari script and Latin transliteration. That much on Naval Kishopres digital and I now hand over to you, Tom. So as Nicole mentioned I'm from the British Library where I've been working with transcribers to recognize text from our South Asian printed books collection. So these are books in index scripts that have been digitized through a project called Two Centuries of Indian Print and they cover a diverse range of topics and genres, everything from religion, history, translations of Western texts, poetry, drama, etc. and intended for enhancement of the study of book history for South Asian scholars. So we have digitized a couple of thousand books which are mostly available through our online catalog. Most of the books are in the Bengali language and script but also we have books available in Assamese, Saleti and Urdu. So in terms of the features of this collection as it relates to OCR there are around about 30 different publishers from these books. Majority of the books were published mid to late 19th century and many of these publishers adopted different typographical standards which has manifested in the books as types that are of different sizes and fonts. But mostly they are not too difficult for OCR. The majority of the pages are as you can see single column blocks of text so the layout is not enormously difficult however there are title pages often with large title font which is a little more difficult for transcribers to always correctly draw the baseline. There are some but very few multilingual pages with English and Bengali but the majority of the books that we have been using have been completely written in Bengali script. And the print quality is very good in fact not much to say on that other than some of the books have faint print and there's some show through on occasion which can affect the OCR output. On the whole not as problematic a collection as Nicole was dealing with in Heidelberg. So to take you through the transcribers workflow that we adopted for our printed Bengali books actually before we use transcribers we experimented with some other OCR tools through a couple of competitions we evaluated the OCR accuracy of Google cloud vision tool and also Tesseract. It's not completely fair to compare the accuracy of those to what we're seeing with transcribers because we gave them far less training data at the time and the results from Google were actually pretty good but at around the same time we started using transcribers and we liked the fact that we could train it specifically on our own material whereas Google's cloud vision was drawing on all the synthetic data that it's working with which is often much more modern than the historical books that we're working with. So we settled with transcribers if you'd be pleased to hear and we also could have created models for each of the different typographical styles that we were seeing from the publishers but we adopted a more generalist model where we drew on a sample of pages from our collection which represented the different types and through three different cycles of training for fifty pages and then one hundred seeing very good improvement on character accuracy eventually after a hundred and fifty pages of ground truth training we were content with the five point five percent character error rate that we would use that model to automate the rest of our Bengali books and I should also say that we're very thankful for the support of Jadapur University in Kolkata who produced the ground truth transcriptions for us. So the model as well was trained on just over nineteen thousand tokens. We verify the accuracy manually as well and found that to be consistent with the results that transcribers was giving us. So we also discovered with the majority of the books the layout analysis was very good. Baseline detection was excellent but for about two hundred books it was not it was drawing multiple baselines. We ended up training a p2pala model on just baselines and that's vastly improved the accuracy on those books that were originally struggling with and from a project management point of view very briefly we were a team working remotely in India and in the UK. We managed this workflow by using a Google Sheets tracking the status of each book through the transcribers workflow and this linked to a Power BI dashboard which updated in real time and gave us very nice visual snapshot of our progress and for example we could see how many books had needed layout correction and we were able to forecast how much longer the work would take us. So far we have automated OCR for Bengali books on more than a hundred thousand pages with about another fifty thousand to go and we're exporting these as page XML and we're using a style sheet conversion into Alto XML version two which is the format required for ingest into the British library's repository and to work with our triple IF image viewer. So that's the Bengali project and now we're trying to do something similar with our Urdu printed books which again predominantly 19th century. First of all we didn't have the available language resource to create the ground truth so we tried Heidelberg's open model for Urdu which we were only really able to give a kind of cursory lookover for the results but although they looked reasonably good we realized we would need to produce more ground truth so we've reached out to the crowd, we posted a message on the Urdu listserv at Columbia University and about eight or nine people responded saying they would be interested in volunteering to help create ground truth for the Urdu books. So we're using Transcriber's Light to run this exercise. We have assigned one collection per volunteer for transcription so that each volunteer is not tempted to work on someone else's material and correct it and potentially get into fights about that. We do the baseline, detection ourselves and then the volunteers simply transcribe and mark the pages as done when finished. Early feedback from the volunteers is that they are finding the Transcriber's Light interface extremely straightforward. There are no difficulties from a technical point of view. Many questions they have are normally around the transcription conventions. The majority of the work has been done by one person so we're trying to give them as much as possible. They've created 44 pages which we're nearly getting to the point where we would like to train our first model just to get a baseline understanding of our accuracy and then we can forecast how much more ground truth we will need. And then there are just a few other initiatives I should mention at the British Library that we are currently working on or have been working on that might be of relevance to this audience. So one particular project we're very excited about is called Converter Card where we're working with digitised catalogue cards for our Urdu and Chinese books and we would like to convert these digital scans into online catalogue records. So my colleague Georgia trained a P2 Palom model which proved very accurate at recognising Shelfmark, Title and Author and Addy who's sat here also trained a text model which worked extremely well and so we would like to use this to extract these aspects from the XML, the Shelfmark, Title and Author and then use that to query OCLC WorldCat to retrieve the matching records and be able to populate our catalogue records eventually with that kind of data. So that's something we're exploring at the moment. Addy's also been involved over the years with a project that's working with Arabic scientific manuscripts. So these are from the, they've been digitised through the Qatar Digital Library and they achieved a good character accuracy based on just over a hundred pages of ground truth training and they're currently looking at options for creating more ground truth and the possibilities around that and there are data sets available on the British Library's research repository from the created ground truth. Yeah, we see transcribers at the British Library as a more and more an integral part of our service that we're offering for HDR and OCR and we are increasingly integrating it into our post digitisation workflows. Transcribers can read manuscript and non-Latin scripts as well as be trained on bespoke structural recognition. It really does plug a gap that we currently have. We use ABBY as well for printed but that doesn't support indexed script so transcribers has proved invaluable in this aspect. And with that in mind we're spreading the word to our colleagues in collection areas who would be interested in using their own material. We've run several training events where they're hands-on sessions that prove very popular with our own colleagues and also when we've been running events in work in India. And so to briefly conclude our talk, I think we're just running over slightly but it'll just be another minute. We think that transcribers lights or I think because I've been the one using that is a pretty good option for crowdsourcing especially if you're working with a small group of volunteers I can't comment on how it is with a larger group. P2pala has been kind of hit and miss for me, like extremely effective in some cases and or otherwise not but it proved very useful so far for digitised catalogue cards. There are, oh Nicole wanted to mention as well about the appreciation that transcribers can be trained on our own material without the inbuilt dictionaries. There are unique challenges to non-Latin scripts as our talk has hopefully shown today with the form of the characters, the scripts, diacritics, ligatures, etc. The fact that other OCR tools do not really support it often or are not very good at reading index scripts. Material challenges, the variety of printed fonts, physical deterioration that we deal with with archival documents, this all plays a part. But we hope that our projects are proving that these challenges can be overcome and often with not too much training as well, surprisingly little amount of ground truth training has yielded very good results. So we would like to conclude by saying that we think transcribers has been and will be a great solution for printed texts in South Asian languages. That concludes our talk so thank you. Well thank you very much for these insights. Nicole is joining us on stage again, so we're ready for your questions. Here we go. Microphone is coming. Thanks for the presentation from both of you. This is more a question for Tom so I apologise for that. But just off the back of the workshop yesterday about data sharing and acknowledging contributions from the crowd and members of the public. I just wondered whether the British Library has any thoughts about how you're crediting people who responded to the Columbia University call, especially if it's this one person who's transcribed 44 pages. Especially given the kind of British context of libraries outsourcing work in the past, using quite dubious means, which the National Library of Scotland was also culpable for. Yeah, particularly as the books that we're dealing with, well they found their way to the British Library through a form of effectively colonial legal deposit during British occupation of India. So this is a question I quite often face and I think it's extremely important to recognise the efforts of everybody who's contributing to this round truth and we will certainly do that. There are practical ways to do that, obviously through events, through blog posts, also where we share our data sets through our research repository, there is an option to credit people there who have contributed. So I think that's actually a very useful way of crediting them for research because that can prove to be used in citation and so on. We also had to be quite clear upfront that this was just expected to be a volunteer-led exercise. So we've undertaken other initiatives with WikiSource in India as well, where again we had volunteers correcting OCR text. There we were able to offer prizes in that certain scenario, but there was always some kind of restriction so the prizes could not be offered outside of India and therefore the people taking part had to be based in India for that competition. With the Urdu material we're still thinking through other ways in which we can fairly acknowledge the efforts of the people who are involved. It's definitely important to do something. In the meantime we have another question online. It goes in the same direction. It's also for Tom. In which Urdu and Arabic text could be translated to within the tool, so that text analysis could be done too. So textual, sorry, textual analysis. Well based on the NASC script I believe there could be. It could be, so I don't have more context. So I think with the Urdu text that being the digitized Urdu text is often in the Arabic NASC script, so that's all I can really say on that subject. Maybe we're going to follow up, but thank you. Okay, you have some more questions? I do have a question that's actually for both of the speakers. I wanted to ask Nicole if you could speak just a little bit more about how you took the Alto XML and then made it available via this site, and I think it was in Heidelberg you mentioned, the FID forestry, and the connection to the other project is I'm wondering what your thoughts are on when you create Urdu text and how you're not there yet, but once you have that text created what's going to be, what will happen as you begin to export that into XML and then try to put that onto readers, because text directionality seems to make a difference. I can say something quite important I forgot to mention during my talk is that our books, although they are online and not yet searchable because we have produced the OCR for 100,000 pages in Alto but we are not ready as an institution to ingest that into our new repository. I'm not sure I mentioned that. So we haven't done anything with our Alto other than create it and it's sitting there and networked folders waiting to be ingested but it's obviously an important format in terms of recognizing the positioning of the words on the page for full text search eventually. We did try a P2 Palo tool on our title pages to see whether we could tag the publisher, title, date, etc. that would have then come through into the Alto but that didn't work as a P2 Palo model. Yeah, we required specifically to use version 2 of Alto with our systems currently. We can't use the latest versions at the moment. I'm not sure if that answers your question. Is that fine? I cannot add much from my side because I export the Alto XML files from transcribers and then hand them over to our IT department and they do the rest but they really do. I cannot explain. I'm sorry. Yeah, we just add as well. We will make all of our Alto available as raw data sets on our repository. So it's quite important for us why I mentioned the baseline P2 Palo model that we created is obviously to try and preserve the reading order. Eventually we anticipate some researchers wanting to undertake natural language processing tasks and looking for contextual analysis in the text. I suppose regardless of the format we are producing that would have been useful but we probably will export from transcribers not only as Alto but in other formats and we've been looking at other libraries like National Library of Scotland, etc. for the types of formats they make available through their data sets and that's useful to try to make sure that we're being consistent I suppose across the industry. We have one more question here at the front. Is that still... Yeah. Yes, I had a similar question about all the rest that you said but maybe someone else knows it. I heard it yesterday as well that people export in a page and then convert it to Alto and I was wondering is it always needed and what are the benefits? I had to do it because transcribers only exports Alto version 4 and we needed version 2 so I had to convert either from Alto to Alto or page to Alto so I just found a style sheet conversion to do that but it does add an extra step into the process once you've exported to do that. I think I did ask the transcribers team whether it would be possible to export earlier versions of Alto but I don't think it is. We would have to ask the developers in terms of these things always. Yeah, it wasn't possible all the time. Right, one more question. Yeah, thank you for your presentation. I was wondering regarding crowdsourcing transcriptions. Do you offer to the transcribers strict directions on how to transcribe or do they do it freely and then you curate the result? Thank you. We offered some advice, some standard advice around the transcription conventions and also I found the guidance on the transcribers website quite useful which provides that kind of advice so that's what I was sending them. A typical question that we're coming through is where there may be a huge gap between a sentence on the printed page and how to treat that. Do they put several spaces between order to reduce the gap? As we're doing this we're kind of building up these questions so that if we do it again we have almost a cheat sheet to give to people. Okay, then if it's fine for everybody I would end this presentation here as we're out of time and of course you also get the cups. Sorry for holding you up. Here you go.