 This is Transcription Practice Session. My name is Tobias Fodl, I have the pleasure to moderate this session. We have three talks, limited to only 10 minutes, but I think we still will have time for some questions right after the talks. First off is Anton Tratter with the Ambrasser Heldenbuch, so a document that has been preserved here, like right here for several hundred years. I will tell you today about the project Ambrasser Heldenbuch Transcription and Scientific Data Set. So the project was funded by the Austrian Academy of Sciences within its program Go Digital. The project was hosted at the University of Innsbruck and it started in January 2017 and ended officially in December 2019. So first I will tell you a bit about the Ambrasser Heldenbuch, the codex we transcribed. It is a late medieval codex which was commissioned by Emperor Maximilian I in 1504 and it contains 25 literary texts. These texts are medieval texts from German texts and those texts are like the Nibelungenlied or other texts. 15 of these texts uniquely survived in the Ambrasser Heldenbuch, which makes this codex one of the most important sources for medieval German literature. Among these 15 texts there are like Ereg or Kudrun, so making them as texts among the most important ones. So this makes Ambrasser Heldenbuch very special and in another aspect because these texts are originally from the 12th and 13th century. So making them middle high German texts but as Ambrasser Heldenbuch was written down to 300 years later, these texts only survived as an early new high German version. So that means of these 15 texts the original version does not exist anymore, we only have these later adaptions. The Ambrasser Heldenbuch is about 243 parchment folios, so you can see these pages are quite big and there is a lot of text of 500,000 words. And another important aspect is what makes the Ambrasser Heldenbuch that special is that only one scribe, his name is Hans Ried, has written down all these texts and it took him 12 years. So he was a tall writer for Emperor Maximilian and beside his usual, his day job as a tall writer he also wrote the Ambrasser Heldenbuch. And making these aspects, the Ambrasser Heldenbuch is very important codex for the German philology and there hasn't been any entire transcription, there has been transcriptions of single texts and as well editions, but most of these editions are retranslations into the original version of middle high German, but as there are no original versions anymore, so we want to focus on the only text that exists and not on a model of how the text could have looked in the past. So this is the beginning of the Nibelungenlied as it's in the Ambrasser Heldenbuch and our transcription looks like that and because of the fact that only one scribe has written it and his writing is very clear and very looking the same over all the pages, it was quite easy to identify single characters he used because they do not vary much, so we identified special characters he used, different variants of letters and you can see here those letters marked in red, letters which do not exist in the standard alphabet, but we wanted to differentiate from the standard ones as in line one, you can see there is in the first and in the second words there are two different types of variants of the letter S, we identified in some there are four different variants of the letter S and as in line one at the end of the line there is a different variant of the letter N he's using at the end of words and as well the superscripts we used as the four superscripts Hans Ried used in the Ambrasser Heldenbuch, so it was quite easy to get these letters and we call this an allographic transcription because we differentiate between different types of letters, after the transcription has been done we did simultaneously annotate the text and we focused on structural elements, not on the content of the Ambrasser Heldenbuch because we want to develop a dataset which can be used for either linguistic research or if we want to make an addition, a critical addition of a text of the Ambrasser Heldenbuch, so we focused on tagging verses, dances or as you can see the beginning of the text there's the insubit, we also annotated the insubits if there's our initials like here at the beginning, so those were quite well could identify those aspects quite clearly, so there's not much about interpretation and if we had to annotate content or something like that it would have been too much and could not have been possible to tag them. So next one I want to show you what we could do if we transcribe and annotate the text, this is a prototype of a web interface and because of the linking in Transcribus it was quite good to link the lines of text with the images and as you can see here the beginning of the Nibelungen again, here we have on the left hand side the transcription and on the right hand side the image and the image and the text are bidirectional linked, so if you click on the line on the text, right here on the image, the text will be marked on the left hand side as well, so you can click through all the texts and have here good navigation. And also those aspects like mentioned before like initials, insubits, they are highlighted in the transcription on the left hand side, so to imitate a bit the text. Another aspect before mentioned was that verses and stances are annotated, so it makes it easy to change the text into not only having the line breaks as it's in the Ambrose Helmbruch but also to get the lines, get the verses and stances, so if you want to get the text as it's like an addition because in the Ambrose Helmbruch the text is not structured, each line gets for a verse but the verses are separated by dots and lines like here and so you have to get familiar with that but for someone who's just interested in the text and wants to compare it with an addition and you want to see which stanza this is, so you have here all the stances in one group and they are numbered, so you can get here through these texts as well, so it's for someone who's interested in the text, wants to get some information about it and wants to see how this version of the text is written down in the Ambrose Helmbruch. So just a prototype and there will be some more features like if you can search through the texts and if you're interested in some words but we're still in development and this is the first aspect of how you can use the transcription that was made with Transcubus. So that's all I want to present to you about our project, I don't know if you have time for questions. Thank you very much for a nice presentation. My basic question is, so this is not controlled by anybody, by no person, so what you showed, this is automatic result for Transcubus. So this is like, if you could clarify, how did you use the Transcubus to generate this text or have you checked the text afterwards or corrected what we are seeing now? What's the error rate and so on? No, we did not use the HDR, we transcribed all by hand because it should be a database to use for critical additions and so it has to be 100% correct and because of the process if you transcribe and annotate simultaneously it's easy and we're faster if you transcribe all texts by hand. So we have all texts transcribed first and corrected by second person and our denotation as well so we had two and now we're starting a third round so the text will be corrected a second round and so it should be almost 100% correction rate. Thanks. One more question. That's not the case. So let's move on to our second speaker. We have David Brown from Trinity College. He will talk about the Beyond 2022 project which is fascinating since it's showing us that cultural heritage preservation by digitization can also help by recreating what has been lost. In the record office in the four courts, seven centuries of Ireland's history is carefully preserved for future generations. Every day, clerks and record keepers, bathed in sunlight from the record treasury's glass ceiling and iconic windows painstakingly care for this priceless archive. But on June 30th, an explosion sparks a massive fire. As smoke billows across the city, the collective memory of a society dating back 700 years is erased, seemingly forever. Now, as the centenary of the four courts blaze approaches, we have a unique opportunity to reconnect with this lost national archive. Beyond 2022 is a groundbreaking international collaboration to bring the public record office of Ireland back to life. Working from the architect's original plans, the project is building a virtual record treasury, a digital reconstruction of the spectacular six-story Victorian archive. Rare photographs and documentary sources from before the tragic fire show us how the building was once rendered, faced in cut granite with ten arched windows running along each side. Inside, ironwork galleries recreated here from pen and ink drawings rose up through five levels. In the central atrium, a staircase led from the stone paved floor to the ornamental roof with its huge skylight running the length of the building. The galleries beneath contained millions of historical records filling 100,000 square feet of shelving accumulated over seven centuries. These records touched on almost every aspect of life in Ireland, birth, marriage and death, land, taxes, crime and bankruptcy. By recreating and refilling these empty shelves, Beyond 2022 and its archival partners in Dublin, London and Belfast are working to create a public resource with global reach and impact. When complete, you will enter the beautiful Victorian Reading Room, order a historical document and cross into the virtual record treasury to access the very shelf where it was once stored. From there, you will be brought to surviving records or substitute copies identified in archives and libraries around the world. Ambitious and far-reaching, Ireland's virtual record treasury restores a vital link in Irish history, returning lost records to the world and enabling a new understanding of Ireland's past. So that gives you an idea of what we're trying to do. The destruction itself, nobody really knows who did it. We had a civil war, there were two sides. One side barricaded themselves into the record office and minded with explosives and the other side shelled them with artillery that they borrowed from the British Navy that was nearby in Dublin Bay. So nobody was held responsible, although everybody participated in the destruction. We then developed a quite unique political system whereby both sides in the civil war basically became the dominant parties in our parliament and they still are. So there's never really been a great incentive to engage with this. We're probably the only country that blew up their own archive. There's lots of destroyed archives around the place. We've also funded a seminar series where we bring people in to discuss the circumstances in which they lost their archives. We had people from Timbuktu over about six months ago. We had somebody from Baghdad last month. One of our outputs is to try to create a toolkit to describe how we've gone about finding surrogate and substitute items and how we can help other people put their lost archives together. I think Gunther, you mentioned part of the Austrian National Archive had been lost in the 20th century as well, so there's probably a little bit of crossover there. So how do you commemorate a civil war? That's one of the challenges that we're facing in Ireland this year. We've already done the commemorations for our independence and various nationalist commemorations that led up towards that. But the civil war itself is actually still quite alive in Ireland and with partition it remains alive. We have a committee for a decade of centenaries and this was the task that they faced is how to try to do something that's positive and neutral at the same time. And that's when we hit on cultural heritage. This is from the committee itself that funded us. The project seeks to recreate through virtual reality the archive that was lost in 1922. We recreate the archive using copies and surrogates of items made prior to 1922 including transcriptions of original documents, calendars, printed copies, indexes and scholars' notes. Well, it's incomparable with the loss of human life. This culture loss was one of the great tragedies of 1922 and all Ireland and international product of this stature and visibility would be a lasting and meaningful legacy. Democratising assets to seven centuries of Irish history. So we are under no pressure. The project is led by us in Trinity College and our main archival partners are the National Archives in the UK, the Poli Record Office of Northern Ireland and the National Archives of Ireland. And again, surprisingly, this is the first time that those three bodies have cooperated in a project in a formal way. They all know each other perfectly well and go to the same conferences. This is the first time it's actually been a formal cooperation between them. So that was basically a big win for us from the very beginning. We now have about 40 additional archival partners and some of them, they have collections that some of them range from maybe four or five items that are of interest to us, maybe just some replacement deeds. One organisation has an entire warehouse containing 30,000 archival boxes of supposedly lost material. One of the surprising things that we found is how little of this material has been accessed in the hundred years since the archival was destroyed. And it's become quite a trope in Irish historiography to say, well, there's no archives. I'm not going to do any archival research. This is my opinion. There you go. You can't just prove me one way or the other. So we're hoping that as a result of this project, we're actually going to have a new year now in Irish history where people can actually look at evidence, which they don't do, and produce rather more robust historical research than we've been getting for the past maybe 20 or 30 years. The National Archives of Ireland had those original plans that we used to reconstruct the building. And we also have quite a lot of photographic evidence of what it looked like, including news footage from Pathé. This seems to, again, be one of the first instances of a military action being timed for the TV news, because it was filmed in its entirety, the destruction of the four courts. But it wasn't easy just to arrange for this destruction to take place and have TV cameras at the time in 1922 ready to go, so some kind of PR was going on at the time. We've constructed this wireframe model of the building, and that's been rendered using the photographic evidence. It's fully three-dimensional, so using an Oculus headset or using your own desktop, you can walk around the building, and you can just simply browse the shelves, pick things off the shelves. The historians and educators love this, and the archivists absolutely hate it. And the idea of being able to walk around an archive, picking things up and throwing them over your shoulder, they do not like this at all. But this is the modern technological world. If we can do it, we do it. Throughout the 19th century, the Keith was produced an annual report and also a diagram showing how the shelves were being filled up as they went along between 1864 and 1922. So we know where in the building everything was. We've also very recently located the original accession books from the public record office that when they received something, it was marked in, a description was given, and its bay and its shelf and its item number on the shelf were recorded. So at item level now, we know exactly what was in the building. And this has two uses for us. First of all, as we're now searching, we can search much more efficiently. We know the titles of the items, so we can just search, say, the National Archives of the UK's catalog for an item with the same or similar name, same date range. We know that's probably it, visual inspection, and we're good to go. We also now know what not to include, because people are coming forward with their collections saying, I have this, I have that, do you think that's interesting? We say, of course it's interesting, but it wasn't in the building, because once we sort of start spreading ourselves out, we will quickly become overwhelmed with things that we simply can't process. Most of the rediscovered records will be represented with metadata, and we'll get most of that from the archival partners that we're working with. They did have very large and dispersed collections that had collections of newspapers, and they had collections of pamphlets in addition to the manuscript material. We'll probably end up scanning around 500,000 pages, and we'll get other digital assets from other resources. One of the things that's come up since we started the project isn't just physical archives that are at risk, but digital archives that are at risk. So there are lots of either dead or half-dead digital humanities projects knocking around with a lot of archival material scanned and embedded in them, and of either no longer in use or very close to being in disuse. We calculated that the average digital humanities project has a life of about five to ten years, unless resources are put into redeveloping the project itself. So if you have a digital humanities project written for Windows 7, it's not going to work on a 5G phone. It has to be redeveloped. So we're actually taking in quite a lot of data now from place names, projects, and personal names, projects, and the old digital humanities things, and major projects in the UK to scan parliamentary papers. And we're ingesting all of this in sort of quite a scrappy way, but we hope to have a robust metadata model that will enable us to search across all of these old projects that we're effectively trying to save in the same way as we're trying to reconstruct the old physical archive. At every stage when we meet a new partner, we demonstrate transcripts, and we do a live demo. Usually we get a page from them and we make sure we have a model that works, and then we will process the page in front of them. And it's all very well for you. A lot of us are used to seeing this being done, and you go, 93%, I suppose that's okay. When you do this in front of somebody for the very first time, and you take their page and you give them the text afterwards, they're stunned. And whatever the character error rate is in what we're processing, we've found that the error rate for getting funding for these additional projects is pretty much zero. There are lots and lots of projects coming through that are coming through because we're demonstrating transcripts for other people's projects in front of their funders going, well, yeah, we know scanning maybe has a bit of a dead end, but here's what you can do later, and it's the first time a lot of these people have actually seen this being done. So it is a game changer for the funding environment. Never mind just what we do ourselves. And we convert all of our transcripts output into Mirador, the triple IF viewer, and the manuscript document is unsearchable by reusing the transcripts XML file with a solar-based search that we've developed, and we've populated this with Ireland's specific personal and place name ontologies, including historical names and places. So again, this is another example of, we have this again from old projects, and it's another example of reusing existing digital assets and giving them basically a new purpose. So with triple IF, we can seamlessly cross-search across a range of sources, the manuscripts, the text from the manuscripts, print, or the metadata that we're getting from catalog descriptions. And the multiplying effect of the cross-search is what's actually amazing, because we can basically get things that we don't even have in our triple IF manifest. We can take a document that's in the catalog of the British Library, but if it mentions the same person as somebody who's mentioned in the text of one of our documents, they will come up side by side and you can compare the two pieces of information. So this image indicates the intensity of what we've found so far, so the darker shaded bays are where we've actually found quite a lot of replacement items, and the white ones are where we haven't found anything yet. But we still have until June 2022 to find it all, and we're also trying to use the name Beyond 2022 as another project goal of getting funding beyond 2022 to digitize the rest of the stuff we won't get around to doing at this time. So our basic workflow is to troll for replacement records, to reassemble these into the original series where we use ISADG, make the series searchable, and make the search operable over the entire virtual archive. Okay, our key deliverable is a knowledge base for Irish history, so we're taking semantic web approach. So we have ontologies of personal and place names, we have geospatial data, we have very strong expertise in natural language processing, so we're able to include sentiment within the ontology and various other structures that we're pulling out of language. So this, in addition to the actual cell records and the images and their text, is actually our key deliverable. We want the user to be able to ask natural language questions of the archive, to simulate as much as possible what a person would do if they walked into an archive and went to the counter and said, I am looking for, or I am interested in. And the archivist will have an intelligent answer for that question, and that's what we're trying to do with this. We're trying to have in-built intelligence within the search that will mimic as close as possible, probably using a chatbot that we've been playing with the whole experience of walking into the archive, although obviously we will then let you walk around the shelves. So the semantic knowledge base, here we are, is the very steam that we want to bring into it and being able to search across time and be able to search the history of a place. So the semantic knowledge base will, we hope, open up this new era for historical research, enable things to be done much more quickly and much greater level of thoroughness. And obviously with a much greater quantity of primary sources that have been available up until now, or certainly in the past 100 years, will eventually include contemporary collections of documents that were not in the building. This, again, is the Beyond 2022 bit, but they're complementary to it, including records both in Ireland and records from the Irish diaspora around the world which are even greater in quantity. And that basically is Beyond 2022, so happy for questions. Thanks a lot. We have time for one or two questions. What kind of ontology are you using? Do you know that? For your keywords. Do I know that? No, the ontology professor knows that. The way we work is we're half historians and half computer science, so we have a core team of four computer scientists. One specializes in ontological research. One specializes in semantic web. One specializes in this visual, this virtual reality experience. And then we have one unfortunate person who basically does whatever stupid thing we ask them to do. So that's all run down what we have, what's called the ADAPT Centre, and they also have this pool of around 70 or 80 other developers with individual specialisms, and that's kind of how we operate it. So really all we do is we say what we want, and they ask us what we're giving them and tell us the format they wanted in, and that's how we split up the work. So I would have to refer you to our colleagues in computer science for that. The same way as they would have to refer you to me if you had a computer scientist with a question about 17th century historical documents. So sorry I can't answer it. Thank you. Thanks a lot for the talk. The final presentation of this session will be presented by Henna Lloyd. She's from the University of Toronto and she's going to talk about the DEETS project, so we're moving back to the Middle Ages. So today I'm going to be speaking about our current work with the DEETS project. So what we're hoping to do in the future is to build from an existing corpus to train transcribes better for abbreviated Latin, especially that's our main focus, as well as deal with hyphenated problems. But our main focus is on the abbreviated Latin because it can take up up to 50% of an individual charter. And currently we're working with an expanded model which has a less than ideal CER. So we're hoping that changing our process with this will improve our results with HDR. So we're working with English Latin charters dating approximately from the conquest to the 15th century. A resource for all of these charters is the documents of early England data set called DEETS. It's an online corpus of charters. It's a really excellent resource. The majority of these have transcriptions attached, as well as some original images, but largely transcriptions, almost 50,000. And we have three models in transcribes so far based on almost 300 folios. The first model is based on charters from the cartillary of the Knights of St. John of Jerusalem Prima, dated 1442 to 1447. And it's this script here, as you can see. And then the second is fairly similar but still a distinct script, the early charters of the Cathedral Church of St. Paul from 1241. And then our third model is based on a combination of ground truth from both of these previous cartillaries. As you can see, our test results have been less than ideal. Our CER for each individual model has been around 11 to 13, 14%. And then as an experiment, we tried testing data from each of the cartillaries on the entirely distinct model for the other cartillary, which as you can see had terrible results. That was just sort of a test we were running. Then the combined model had better results, averaging around 17 to 19%, although sometimes as low as 8.65%. So this is hopeful for the future that a combined model will actually elicit better results than at least the individual models on each other, although there's still clearly quite a ways to go. Our greatest challenge is the abbreviations. Latin charters, as many of you are aware, are heavily abbreviated. And the HDR struggles to transcribe them properly due to the dissimilarity between the actual image on the screen and the often lengthy expansion. We had previously expanded all of the abbreviations, but now we're going to work on a new approach because this just doesn't seem to be working for us. For example, as you can see, these five to three letter words are actually expanded quite substantially and the character-based HDR just simply can't process it a lot of the time. Other minor issues that we've encountered are the character confusion, which I suspect almost everyone experiences. Vowels are commonly confused for each other, especially E and O and A. We also find that N is often substituted for M and T for C, and the letter V is very similar to the clause et, so that's another challenge we're experiencing, although these are relatively minor. Other character confusion isn't as important, such as V for W and J for I, which can increase our CR, but doesn't actually matter fundamentally for the meaning, so that sometimes makes the CR a hair misleading, but not substantially. There's also the challenge of line breaks. Models struggle to recognize where it's split over two lines quite frequently. With recent updates, it seems that the model has an understanding of what a line break is between the two words, between a single word, because it'll put the little hyphen on, but it still never does it correctly, so that's just a minor issue we're hoping to overcome. Question. Concerning the training set of the 270 pages, rough estimation, how many different writers were there included? We're hoping to expand to a substantially large number of writers, but actually these two models are probably only based on one writer each. Each cartulary, the individual charters would be collated, and then all of them would be retranscribed by a single scribe, so it's a relatively small number of hands at the moment, but we are hoping to expand to a much greater number, hopefully to be able to encompass any sort of English Latin charters from this period, but currently just two hands essentially.