 The great thing about working with Ried is not just meeting all these wonderful people within the project and getting to know all these technical stuff and things, but also every time you step into a room and going to a congress, you don't know what is expecting you because as soon as you start talking about automatic text recognition or layout analysis, there are people coming to you with ideas what could be done with their material. So it's really like every day you're at the beginning of a new project and you have no idea that's the beginning of a new project. And this is also a bit the idea of this session this morning so that you're getting ideas what people are doing with transcripts with the tools of Ried and in order to start thinking about what can be done better and how the scholarly researchers can proceed. So without further ado it's my pleasure to introduce, sorry I need Deborah Cornel, you see me so long I need my laptops with me. She started at the University of North Texas, worked at UCLA, ChipChop Media which is a great page to visit and the Art Institute of California and since 2015 she's the head of digital services so she's in the middle of the digital services at William and Mary libraries and she's going to talk about the Georgian papers project that uses extensively transcripts. Thanks. Thank you. Thanks for having me, good morning. The Georgian papers program is a partnership between the Royal Collection Trust which is basically the umbrella institution or organization for the British Royal Archives. The Royal College of London is joined by the primary U.S. United States partners, the Omahunder Institute of Early American History and Culture and the educational institution William and Mary both based in Virginia. The goal of the project or the hopes of the project is to transform through digitization, metadata, transcription and academic engagement the understanding of 18th century North America and Georgian Britain and its monarchy at a time of profound cultural, economic and social change which created the modern nation. The partnership between the U.S. and Britain is largely because the Georgian monarchy was King George III was King of England during the American Revolution. We gained our independence from Britain so there's a good area of study on both sides of the Atlantic in this area. By engagement it's not just a digitization and metadata or transcription project. They have people at King's College London and William and Mary trying to build course programming around it. They are actively doing fellowships where people go in and investigate with the Royal Archives and then produce symposiums and papers out of that. So what you'll see here is a top link is actually about the Georgian papers program. The middle link is the Royal Collection site where they're actually publishing the digitized material and their collection aids so you'll find PDFs of what they have digitized already and the bottom link is actually the crowd sourcing transcription site that William and Mary is building and will hopefully have available by the end of 2017. So William and Mary Libraries at William and Mary is the project lead for the transcription portion of the project. The content of the papers is approximately 350,000 pages or images covering letters, diaries, ledgers, account books, menus, recipes, writing and receipts. The content is a range of different hands or done by different individuals and the quality or penmanship is quite varied. Given the amount of the content we have, us along with King's College thought it would be a great opportunity to explore Transcribis as a way to get this transcription done along with crowd source transcription which we are constructing in a Mecca plus media wiki site to handle and hopefully have done by the end of 2017. So King's College and us put for Transcribis as a tool for the program and we began using it a year ago. So this is an example of some of the, I would say the average documents we have. We do have a good number in secretaries or clerk's hands which is easily read. We also have documents that are just horrors to look at because they're highly marked out, um, scribed like that. So the vast majority of what we've seen so far is this type of quality. So what we've been using Transcribis for is sort of the Georgian Papers transcription philosophies to create diplomatic transcriptions. Basically capturing faithfully the text but not concerned with sort of the structure so no heavy TEI markup or not. The end product is really being for a baseline or a raw transcription much like the Bentham project. And then these raw transcriptions will be made available along with the digital record and the images at some point. But they'll also be used. The transcription is at King's, King's Digital Lab and King's College will do more metadata analysis or data analysis to reach more subject terms, name authorities, sort of markup of places, events to see what kind of other stuff they can do. But it's also making it available for people doing digital humanities who want to do text analysis and that type of stuff. So what we've been using it for is a transcription. The initial content we got because the Royal Archives is doing the scanning of the initial metadata and then they send us images in the metadata. And it's a phased project. So the initial content we got were the Georgia Third Essays which was quite some in good hands, some in horrific hand, lots of markup and crossed out. So we got about 3,500 pages transcribed. We used student transcribers. We pay them to come in and do transcription. And we've achieved one called the Georgia Third Model which is about, I think it's more on the 16% error rate, but we've also applied the Bentham, the English M1 model to it and we come out with a 5% error rate which we find very encouraging knowing that the content we fed in was not as high quality as it could have been, it was quite a range. So I think going forward, we now know to put the really poor quality content to the crowdsourcing and use the higher quality content for transcribers. So what we're looking to do in 2018 is to begin more transcription and start testing out the text to image matching. And hopefully build up improving the Georgia Third Model and still continuing with playing with the Bentham model to see if we can either, I think Tobias recommended maybe combine the models to see what that can do, if we can enrich to that. So that might be since they basically cover the same sort of time period, in the 18th, 19th century it might be a good universal model to work on for English hand. The one area we see we're keenly interested in exploring in the first year, analyzing all the content as it's coming out is we have a quite large portion of the Georgian papers, our ledger books or tabular data. Anything from household establishment list, menus or I'm talking about the accounting of what the Royal House of Pull bought. Just a daily accounting of different households and sort of receipts bills. Which I know transcribers is just starting to get in. But we're hoping to work on this with our computer science at William and Mary, we found one faculty and a PhD student that is interested in doing machine learning, dialing analysis. So we're hoping right now, literally two weeks ago, they started looking at transcribous and our content to see if they would be a good project for them for tabular layout. I'm trying to think, we've also explored the OCR capabilities of transcribous from some previously transcribed letters that were printed in the year, I think it was like 1920s. And it's done really well with that. So we're trying to do OCR to capture the data to then match up to the manuscript images and see if it can train the model better. What we're pursuing now is besides carried on with doing transcribous with our building or geo model or working with the Bentham model. It's hopefully going for funding proposals within the US National Endowment for the Humanities for supporting maybe transcribous workshops. There's a couple other institutions in the US that are playing around with it, but also supporting the Georgian papers transcription and hopefully funding for giving more computer science people interested in it to help us along with the humanities. So many thanks. That gives you enough time. Thank you also for crossing the ocean for COVID. No, thanks for inviting me. Questions? I think there's always like two or three minutes for questions so that you can talk about it on your mind. How many people are working actually on the project? Do you have such a distributed structure? On the prior. So just on the transcription or the entire? The entire question. We have, let me see, one, two, three, four, five. So on the transcription portion, currently have seven people. That would be me as staff to graduate, history graduate students. I think I have four undergraduate transcribers. And then if the computer science gets interested in that, that's just on the transcription side. Then we have the Omahundra Institute has one lead for the project. They have one academic lead for the campus. And then they have numerous fellows and graduates. And then you have King's College or King's Digital Lab, which has a metadata analysis and their entire lab is actually charged with building the ultimate platform for this. So a platform that can hold the transcription, the images, the enriched metadata, but also give a portal for academics to actually manipulate and work with the product. So you have four or five people there. And then you have the academic side of King's College, which is you have the project lead, an academic lead, and then anybody else who's interested in the project comes in too. And then you have the Royal Archives staff, which is you have the project lead. They've hired a digitization specialist for the project, which is basically a five-year position. They've hired two cataloger metadata people, plus you have the normal catalogers as part of the Royal Archives. So the part of this collection is it's never been described or cataloged before. So they are going back to step one to totally rearrange. So that's just the committed members. So there's fellows and other partners that change in and out. We have faculty members who are interested in doing crowdsourcing, so trying to build their academic courses around doing the transcription. Or we have lots of students who are interested in going into archives, libraries. But also a lot of students who are just interested in learning foreign language on historical documents or reading historical documents. So they'll come in and learn how to do transcription. So it's a great way for that. Thanks for the question. Anything else? Other questions? So thanks again. No, thank you. Next up is Karin Thun from the Georg August University in Göttingen. She's a fellow medievalist, which is always a pleasure. And while I'm trying to set up your presentation, I need to say that your model for Inconovia is one of my favorite ones, because it really shows, first, what we can do with old prints. And second, how to work with lots and lots of abbreviations. And I think I hope you're going to talk about this. Thanks. Thank you. So I'm currently working on researching 15th century instructions for liturgical singing. And I will first describe my motivation of working with transcribous, then present the sources, the main print source, and the additional manuscript sources, and then go into detail concerning my working process with transcribous. When I first read about transcribous, I was trying to look up passages concerning singing in the 16th 5th edition of the writings of Johannes Tertemius. To handle them more than 1,200 pages of the edition, I made use of the OCR of if we were a PDF viewer. The result, as you can see, was quite garbled. And what you see here is one of the better parts. But it enabled me to look up words like phalmum for psalmum or muficar for musica or cantare. In spite of the unreliable OCR result, I enjoyed having the scanned images of the original printed hand. And I wished to have more and earlier sources available in this way. Then I started preparing an edition of the Liborodinarius of the Borsveder congregation. And I wanted to make available searchable PDF files of the sources. I thought of an easy way to connect my transcription with the scanned images and remember reading about transcribers and that it would be able to do this. The main interest for me at this point was to get searchable PDF files. That my transcription would provide the HDR engine with training material was a nice side effect. Before I describe the transcribing process, let me first characterize the sources. The central source is an inculnable print, the ordinarius divinorum nigrorum sancti benedicti di observancia borsvelensi, printed Marientale 1474-75, consisting of 183 pages. And that is the one that is transcribed in full and that will be presented here. Additionally, there are at least 27 manuscript sources in different types and writing. Some seem to be more or less complete. Some are only short excerpts and some are not in Latin but in German. Characteristic for the inculnable print is its highly appropriated writing style. At a rough guess, about 80% of the words are abbreviated. Typical kinds of abbreviations are special characters that replace a whole world like Ed. You'll see it down there. Or replace specific sequences of letters like us. The first word means prologus, rum or rem. Here you have divinorum. The last letter is rum. Con or cum. This word means congrum pro. The word means prosus or pre, representator or tour. For we are tour, the last letter means tour. There can be a stroke above a word indicating a missing letter. N or M, as here in the first word, you see here, is pros felensi. The second one means tympano. And then there are, of course, the real appropriations. I've just selected divinorum, benedicti and dominorum. There are much more, but only to see what kinds of abbreviations are used. A usual OCR process wouldn't be able to cope with this, even if it were able to read the single letters. So there wasn't a choice but to manually transcribe the text. To facilitate the reading of the abbreviations, an early modern print version of the text was transcribed first. This transcription was then adjusted to the earlier print version. The classical Latin of the later print was altered to the medieval Latin of the earlier print by find and replace, especially AE to E. And with CO, the T was replaced with C. And some other frequently occurring words were also changed with find and replace. And then I adapted the line breaks for every page. And this can be done in every text editor. Then the page layout had to be marked in the graphic interface of transcribous. And at first I had to mark not only the text regions, but also all the lines. After much of the work was already done, I discovered that an automatic detection of the lines was now possible. Now the adjusted text can be copied into the text editor of transcribous. Proof reading is necessary to eliminate the difference between the early and the later text version as well as my own transcribing errors. It was facilitated by the highlighting of the matching lines in the graphic editor and the text editor as you see it over there. After all, the pages were edited in this way. The work was exported in the different formats supported by transcribous, among others, the searchable PDF files I wanted to have. So my workflow in detail, upload of the scanned images, scanning the classical, changing the classical Latin into medieval Latin, preparing 5 to 10 images for text regions and lines, adapting the line breaks of one page, copying the adjusted text into the transcribous text editor, proof reading, and repeat task three or four to six, until more than 100 pages are transcribed. The method described here is quick and easy for sources where you already have the text. For example, if you already transcribed a similar version. So as I was mainly interested in getting searchable PDF files that already had the text, it was the method of my choice. The method of dealing with abbreviations recommended by the transcribus team was different. Appropriated words should be transcribed literally, possibly using special characters for the abbreviation signs, then taking the abbreviations and offering the expansion as an attribute. The transcribus team was so kind as to offer to redo the transcription of the incunable in this method themselves using my already transcribed text as model. I think the following part could be better described by Gerhard Mühlberger, one of the technical staff. I will explain it as good as I understood it myself. The transcribus team first used their own transcription method as training material for the HDR engine but was not really satisfied with the outcome. They then used my quick and dirty transcription as training material and found that the engine was able to cope with the abbreviations at least with more frequent ones. I present to you a part of the automatic transcription of one of the pages. If you are familiar with this kind of sources, you see that not only the single letters are read correctly but also most abbreviations are expanded properly. Arrows occur at three different kinds of difficulties. First, uppercase letters. Those are not so frequent in the text so there might not have been enough training material. Additionally, the uppercase letters are this source in most cases marked by a red stroke which might alter the appearance enough to make the recognition difficult for the HDR engine. You see here the uppercase letter are falsely identified as an F. Some of the abbreviations are incorrectly expanded. That may be those that have been represented in the training material less frequently or ambiguous concerning the abbreviation signs but I have found no fitting example on this page which shows how good it is. For some words, the abbreviations are consisting of nothing more than one or two letters and can only be read if you know the insipid of the chant or the prayer. Here you see Paternoster and Per Undem Domino. Seen as a whole, for me the effects of working with transcribers have been such that I am planning and already have started to use it for the rest of the sources as far as they are available for scanning. I appreciate the discipline it gives to the transcribing process as well as the searchable PDF files that produces and I'm looking forward and seeing the next improvements of the HDR engine. Thank you for your attention. Thank you very much. Questions? Have you tried to build a model when you already made abbreviations into the normal words or did you understand building them all after you fed them as an abbreviation and then what it truly meant in an attribute and if so, what did work better? As for the model, you have to ask someone else because I only made the transcription and then had email contact to Gerd Müllberger and he answered me with what I just told you. Yes, we did a model and were surprised that it didn't bring the results we expected because we spent a lot of work into actually transcribing the special characters but there are some special characters actually included. Might also be a case of frequency, so not enough training data because the character set is larger. In the other hand, it's funny that it learns the abbreviations so nicely. So, yeah, I don't have a proper answer. I sometimes discuss with the people in Wostock to surely have a proper answer to be understanding. Normally, the training signals of the transcription should be transmitted. There's a number of characters that should be matched to the number of characters in the image. So, you have, for example, the abbreviation of the year after the doctor and you transcribe this doctor. There's some limitation in our software so you cannot read arbitrary, long expansion of abbreviations. So, one or two characters, it's possible. It was also for us a surprise that it works and I still do not like that. Yeah, but I think to your question, when you really have a transcription which better fits to the image, then I would expect that that would be better. So, please transcribe what you see and it works good for you, but it's, yeah, weird software. Okay, other questions regarding abbreviation or Inconovula. Did you say something about the amount of training data you provided? How many pages for the Inconovula? I contacted Scott Milberg when I was over 100 pages. I don't know the exact amount. And I worked farther because I just needed a working text. So, I just transcribed and transcribed until I was ready. I don't know at which point they started to train the model. Okay. It's recorded in the model, so which one is that? I'm not sure if this is mainly a question for you or more of a question on abbreviations generally. But does the model actually, if you expand the abbreviations out and write the full text, is that in any way a certain effect of the model itself or does the model only look at the words that are transcribed on the page when you train it? So, does it make a difference if you go through and expand out all your abbreviations? Or if all you want is a text that has the abbreviations in it to a literal transcription, will it be fine if you just transcribe exactly as it is? So, you mean abbreviating by tabbing? Does that have any effect on the model? So far, taking as no, does not influence the model because we just ignore any markup. I'm thinking about something that Milberg said yesterday. He said, in keyword spotting, he tried using, tried searching for the word in spoke. And he said, he wrote it with two Ns and a CK as it's spelled today, but it found also the earlier writing style with one N and missing a C, but it looks alike enough. So, if you have a word where an M or just one letter is missing, it will look alike like the word with an M if it's long enough or something like that. Okay. Okay, thanks a lot. That is Karolina. Karolina Lenke and Paul Ounish. Karolina will present herself. She works for the Barnach and Swarov edition. This is one of the first edition that used the HDR from Rostock, also due to the proximity in space. Thanks for being here. Hi. Yes, Paul Ounish and me are presenting the Ballach 2020 edition. It's a project of the printed edition of the letters written by Ernst Ballach. And we are cooperating with several Ballach institutions and with Siddler from Rostock as you heard. Ernst Ballach was an attributed, was attributed to be an expressionist. He had approximately 400 different partners of correspondence and as Ballach was stating in this quote to Lucy Möller-Vandenbrock, he was quite fond of writing since he basically was writing every day. Nevertheless, he was quite fond of having a hot punch or a smoke too. So these are our cooperating partners. And more interesting is who was Ernst Ballach. So he was a doctor, son born on the 2nd of January, 1870 in Wede in Hamburg. He was an acclaimed sculptor, printmaker and writer and died on 24th of October in 1938 in Rostock. Ballach became famous for his sculptures of the so-called simple people, such as beggars, farmers or old people and as well as for his memorials against the war. For example, as you can see here, the War Memorial of Magdeburg from 1928, 29. With the political uprising of the Nazis, his aesthetics of arts were classified as a degenerated art by the new government. And as a consequence of this label, every war memorial by Ballach was dismantled by the Nazis between 1934 and 1938. Starting our work on the transcription of Ballach that does we plan to generate transcripts with two independent editors. Based on the first edition of Ballach Letters by Friedrich Dross, the published correspondence with Reinhard Pieper from 1998 and the Love Letters to Marga Böhmer, a publication from 2012. Those publications cover approximately 1,700 of the now known 2,100 Ballach Letters. They basically work as a control unit for us while we are generating transcripts from the original handwritings. Since the first edition from 1968-69, almost 400 new unpublished letters were discovered. Transcribos is now generating a transcript of those letters which we are approving in a second step. When we are applying Transcribos for generating transcripts, we are using the algorithm-based and writing recognition by SITLAB, developed by the University of Rostock. SITLAB is drawing on a dictionary, especially configured for us. It's based on the two volumes of the edition of Ballach Letters by Friedrich Dross. After the automatic baseline recognition, one of our students is making the necessary corrections in case the baselines are too short or anything like this. Then she starts the transcript run and proofreads the text for any misinterpretations. As you can see here, for example, on the first line, Dein und freulichen Brief, the program cannot read the E and is putting in a C instead. Or in the sixth line, Schmeichelhaft, actually Ballach was just writing Schmeichelhaft without an E, and the interpretation is that it cannot read the empty slot. We have train sets of rather small units, like 50 pages, for example, that are around 20 letters. In total, we have already processed 300 documents from the unknown letters, and we can say the character rate is clearly improving. As you can see here, it's decreasing, and that's a good sign for us when we have to do the comparison of the transcripts. So while Transcribos is improving, Ballach's handwriting still possesses some difficulties. As you can see here, these two letters are representing two people, but actually it's Ballach. So it's quite fun to read if you're not familiar with the handwriting. The left picture is a young Ballach, and he was quite fond of writing in very small, tight lines. And the older Ballach is much easier to read, so you will recognize him immediately if you find a letter in any archives of Germany. Yeah, it might look like two different persons, but actually Transcribos has problems to interpretate the characters if it's just trained of the newer letters, the older handwriters. So further difficulties of Ballach's handwriting are the graphically incomplete representation of the letters. He seems to write as he speaks, and therefore, for example, a lot of German E's are missing in the end of the words. Transcribos also has a tough time recognizing the line breaks when Ballach is starting a word in one line and finishing it in the next one, but the program frequently reads it for two separate words. Ballach's also very fond of writing in several directions, as you can see on the image to the left. Turning the page in a circle, one might find oneself reading a letter upside down. This is actually corresponding with difficulties to recognize the text regions and baselines. It is a problem especially for purely computer-generated acquisition of larger text corpora. The advantages are we only have to use one editor to transcribe a letter. We can generate a large corpus for threaded transcriptions up to 2,100 letters, and we are able to independently define the truth, the ground truth, where the program is designed to train itself with this information. And of course, we have the possibility to prepare a digital edition, which will be a likely follow-up to the Ballach 2020 project. Thank you. Any questions? I just need to ask you a sentence. As you said, the image is very strange in itself. Is that correct? Yes, we can... Yes, for now. Can you say it like this? I think so, yeah. Yes, the idea is when you produce, when you transcribe something, you can change the status as a pop-to-ground truth. And then you can train. You can say, train everything which is marked by a ground truth. And then your transcripts are automatically enhanced and you train. This is about the machine. So this is basically just in time retraining of the model? Yes, you have to look at what you said. We have a put. Yeah, that's about it. I have a question about vocabulary. What's your experience? Did you check it with using it with dictionary and without dictionary? What's the experience there? So what do you gain? What do you lose by using dictionaries? What do we gain? I think we started without the dictionary and halfway through we added the dictionary but I actually don't know from the training results. Was there a big difference? I can't recall actually. Yes, a bit. A bit. A little bit. And the problem of hypercorrections, so I add for example an E where it was left out by Barbara because this is one of the problems if you're dealing with early model texts because there is no standardized vocabulary to use. So adding vocabulary there helps in one way because some of the words will be recognized correctly but a lot of the words will be hypercorrect so that some doubling of ends as we had before will just be normalized. So what that means, you would say it has some advantages with documents from 20th century to use vocabulary or Kunra would say that. Or we need to ask Torias Charles. Yes, yes. He's the expert of vocabulary. But I think we should keep the problem of vocabulary in mind because it's not like self-evident to use them. Okay, thanks a lot.