 our second day of transcribers user conference. We're starting with a session called Transcribers in Practice. And our next speaker is David Brown, from Community College, Dublin, one name. Good morning. We did a little project with Google about four years ago. One of the developers told us that their ideal archive was a little black box, a big dog, and a human. The archive is in the little black box. The big dog was to make sure that nobody touches the little black box, and the human is there to beat the dog. So this is our little black box. The public records office of Ireland was opened in 1867, after a long campaign of historians and civil servants to have a single repository for all Irish state papers. The collection has grown over the years, and perhaps in fairly unsatisfactory version of the building is all across the city. And we're basically inaccessible to most people, especially people who even knew London's records existed. So the building that opened was part of London's four-course complex, was the state-of-the-art facility, with modern climate control, and a brand new poultry and carbony machine, which in 1867 was the GPU of its day. Over the next 50 years, all of Ireland's important records were brought to this building, and it created one of the best archives in Europe that stretched a whole recap to the height of the ages. On the 20th of June, 1922, in the middle of the Civil War of Northern Ireland, that was newly independent from Britain, to celebrate our independence, we could completely destroy the archive. There were only maybe 100 pieces of paper left. Since then, the most prominent in the Irish historians has been the lack of evidence to conduct historical research. The loss of public record office was catastrophic, but during its short lives, a tiny amount of records have been packed in coffee, and before it existed, they'd be copying in these other little repositories, either by historians or by government agencies. Many records of data from long before 1867 have been very well used in more than one copy. The Beyond 2022 project, Trinity, aims to reconstruct as much of the last archive as possible from copies in certain circuits, and we place them in this new virtual record treasury, that's been built from the original architect's plan, so what you look at there is a digital model of the original building. Thanks to the early publisher, who produced a report every year to say what records they received and where they were put in the archive, we can actually pinpoint the shelf of the original building and put our virtual copies onto the direct shelf, and then we have a three-dimensional avatar that will walk you around the building and take you to the record, if you have a lot of time, otherwise you can obviously just put it on the screen. Just before the record office was destroyed in 1919, a complete cycle of contents was published, and from this we've been able to create a database for all of the last records, and these entries are linked to metadata that describes the replacement of them, and its current location and the space of digital image where available, otherwise it brings you to the unique and the other repositories, capital ugly, you can double find it. And that shows how the database works. That's the copy that we found in the Martian's library, and you can see on the bottom there, it says that it was copied before the great fire. So the project itself marks a new coming together of the whole of the largest collection of Irish records, and this hasn't been known before, our project forward includes the national authors of the UK Record Office of Northern Ireland, our own Public Records Office, and the Irish Management Commission that's been involved in creating a lot of prescriptions as it was established in 1958. Now the project has developed, we've found copies and surrogates in archives large and small, really old records, but this particular set is the Medieval Chancery Roads, and we've found 13 repositories have quite a large selection of copies of that, our search goes wider and we get some more general records, and this gives you an idea of over the over 60 value currently found copies of Irish records in that are all certified as having been copied in the Public Record Office before it was destroyed. That doesn't include all the ones that were pretty sure were originally Irish records. So the Huntington, which is a very relaxing place to work, gives one such repository. The Huntington purchases enormous amounts of operating modern manuscripts for their own collections, particularly related to British and Irish history, but they also have very large volumes of an axon to manuscript from all over the earth, so if you're not familiar with what they have, it's worth going to have a look because you may be surprised to find that they're extremely wealthy buyers, the counter auctions in Europe and some of the important everything in the room, this seems to be the way they operate, but it's not technically a catalogue though, it's one of the drawbacks, and it is actually worth a trip. One of the audience is a collection of transcripts from a predecessor of the Public Record Office, which is the Birmingham Tower in Dublin Castle, and it's a British part of the Irish Record Commission that was established in the early 19th century to transcribe and to counter them. As many early Irish state papers as they can find, they found a part in 1830, though some problem with the money and some problem with people getting paid and having worked together for about 20 years, they have a lot of that, but they hate each other. So the copies of which you reckon there were at least 80,000 pages were dispersed, some of them found a way back into the Irish Public Record, because a lot of them were sold to the collection of Sir Thomas Phillips, who was the major huge collector of manuscripts, with his life being sold in 1920s and 1930s, things like hunting can end up only quite a lot of these volumes. They're all very, very similar, and there was a common agency, so as you can expect, all of the collectors were trained. They're all very, very similar. What we found, we basically have now a meta-collection, the collection of collections, and we find these very similar examples of these canondens in the repositories, pretty much everywhere it can go. The last one there is one that was kept in the Public Record Office. It's an outside room in the middle of survival fire. The rest of it all come from better old cases and options and various other sources. We've developed, we've finally used transcripts, because as you can see, this is perfect for transcripts. If transcripts were being developed by us, we would have designed it to do exactly this job. We guess target error rates are less than 3% for all of these. So we're basically using this as essentially a full OCR system. We have about two tools to speed up the work. We have a tool that was about 15 in cream, and that's on the, we did a blog post about it in September, and it's on Gishok, and it will search all of our available models for the best match. We know there are only 30 clerks involved. We have 10 models already, so it's a quick job to search to see if we already have the model. If we don't have a model that's usually only giving us 95% accuracy any way because they're all trained to do as we do exactly the same way. It's also helped us to have a button to them that's not signed, we don't want to give it, but through the transcripts model, we were able to find what it was actually done by the same person that we used before, and it's just another one of that series that got broken up in an auction. We find the tools for both handwriting recognition and for speeding up the basic action of finding the right model to use. The second tool that we developed, because our project partners both showed the scale of the terms and the 50 volumes so far, we think the size of the volumes, we reckon it's only 20 million words, it could be 30 million words, but we don't know how much stuff these people produced. We've written a outer routine that will put us in a triple I after this because our partners, the National Archives, the three National Archives, wanted us to work in triple I after they work in triple I after. So we're using triple I after our viewer. We're probably going to be producing models out of the text on the side that is highlighting the management the way that you were looking at yesterday, so that's basically the same job. And this is using triple I after so that we can use different types of documents and looking on the screen at the same time, so the one on the left is a publication that was done by the Irish Manuscripts Commission about about a year ago, and it's in the same series as a manuscript part of the parallel Irish calendar. The content and the structure of all of those are both just a calendar of the same big long series of commissions, so the structure of the models is actually the same, they're just from different parts of the country, but we can be an item collection as a way we can search across any type of document whether it was originally published and printed, or whether it was originally from the manuscript. We've been involved with entity disambiguation for about 12 years and we use entity disambiguation and entity extraction, so we do a lot of indexing. So here you see there's two sets, Ellingworth, so Charles Cooke, and if you don't know much Irish history, Charles Cooke was a symbolic madman who buried millions, so hundreds and hundreds of people in Connacht in the 16th century who had inherited a lot of madman. The painting of them is actually wearing all of our Cromwell's clothes and they put his head on to take all of Cromwell's head off. But we can disambiguate the text to a period of index, so we can do this in quite a wide way as we often do, we can have them name and place ontologies that we use and we also have magnetically routines that will use these distance algorithms to put together words that we don't really know what they are. We can do the same thing with place names and we can work with you on this every now and then to match the old place names with modern place names so ultimately we can use them to go to that location safely and have a look at it. We can also use text disambiguation to find the content, so when we start with an entity we can also inherit a language. Some of you went to somewhere some of you lived, some of you died, some of you got married in, some of you mentioned the sacred sentence and some of you had some various different contexts and we can automatically produce these natural graphs in their associations. So we find that this is going to be an excellent tool for historical research as well as for things like genealogy and we've even found some use for it in environmental research. You can use synonyms for various language events and you can put together graphs which link those to historical events. For example, if there was an unusually wet year, was there a lot of unrest? Do you find it very useful? And so it's coming along to it. So ultimately our intellectual repository will hopefully be a replacement for our original repository and it will become a new national language that we don't have further than 1992 at any moment so this will push it back in over 800 years hopefully within the next three or four years. So it will be useful. What sounds amazing from my background is in medieval history so searching for documents is really, really familiar. Questions or comments for David? I would ask when you thought about this project which kind of experiences in other countries have you found just to build the way of finding solutions for your project? Did you? It's actually from a an Irish project that we did in 2012 called the Down Survey. It's one of those maps that the Promenians when they were seizing Irish land after the conquest in the 1650s mapped the entire country and it begins to back to that large scale catastrophe maps that basically showed every field and linked that to a set of records that showed only every single part of the land. That was assumed to have been completely lost on the front but we hear a couple of volumes in the Bibliothèque Nationale that have been taken by pirates on the way back to England and some other bits of pieces and we sort of get together as many of those as we can. In the course of this project we actually found more in the backtrack of five different positions. So what we thought was that if we could look together this important set of lost art houses maybe we could put together the rest of it but that's really where the idea from the project came so that's not even a European experience from an Irish project. Thanks a lot for having a great talk just a brief question about TripLIF implementation you have been doing. So this is used for display and do you do first an export of the page XML or can you directly use the page XML and to display the transcriptions in your TripLIF view because I mean TripLIF is going to be the topic for transcripts. We go directly from the transcripts to TripLIF. So you are running a TripLIF server for the images and you grab it. Okay that's great you need to polish that. I'm sure we will be back soon with this. Great, thanks a lot. Thank you David again. Welcome all. Yes, thank you everybody. It's been really good to listen to all the interesting presentations that you have and about the project that you actively working on. And I'm also about prospects and biases in some regarding to some 60th century Swedish material. And when I talk about Swedish documents I mean documents that are written in Swedish and that were related to the history of Finland or Sweden as Finland was part of the Swedish realm during the 16th century. And I'm kind of trying to make case for my the time period that I am expert in as a 16th century historian. I have seen many biases that affect the choice of research topics and I'm not talking about political biases but the fact that the sources are the best available often are the sources that are studied most. And here is a picture that I often use when I teach or talk about the way how the documents survive from oblivion as the as first of course there are the historical processes, the partial documentations of the historical evidence in the first place. But what is quite important and what affects our historical research whether we want or not is also the way how the selection and archiving of sources is done in the organizations like archives or universities or libraries and studying the 16th century documents I have become very much aware of that when it comes to the medieval times of Finland there we have wonderful medieval documents documents and stuff those have been quite well catalogued and studied. When we go to the 17th century there we have the Swedish administration, general administration taking form and it's a wonderful massive administration with lots of archives well documented, well catalogued. When it comes to the 16th century it's a mess. It's a rough time there are lots of organizations that the classic organizations vanished or were in difficult times the lay organizations were changing so the problem then is that the material from the 16th century is not easily available, it is scattered it is written in one very fragile papers that are in decay some are very difficult to read nowadays so my aim then has been to do something about it to take the poor 16th century documents in the focus in the hope that the more understandable more readable bringing them to the core of the future and present day research and there is an example of some podiums that we are starting with it's called Akta Historica it's in international archives of Finland and similar kinds of documents are also in the on the Swedish side as after the independence of Finland the documentation was kind of divided between Sweden and Finland there one has to study those many archives to study them unless we digitize them better and Akta Historica in international of Finland includes some 2,000 pages of the same different documents mostly from the 16th century and these documents are related to administrative decisions and orders the problem with them is that they are letters from the king letters to the king copies of letters copies of complaints copies of private letters and these are written with many different hands so there will be we will need quite a lot of work however we are relying on previous very un-digital work done by my family and my colleagues as there are previous transcriptions and we are trying to fit the transcriptions existing transcriptions with the scanned documents in the hope that we can then learn or make the system read the variety of hands available in the 16th century documents even here the best scribes are easiest to read the king had the best scribes so what you see here is the letter from the king John III and I guess that's more work must be done in order to make also the poor handwriting of local judgment and narratives to be readable and this is just an example and the work is very much in progress but the plan is to build a larger 16th century source project around this so that we will benefit and promote transcripts and read and but also otherwise build a more solid and cataloged and indexed knowledge about the 16th century Nordic past, thank you Thank you, Anu comments or questions if not presentation given by Marek Sloan, Michalaj Sloan and Adam Zopala I'm so sorry about pronunciation but I know our names Marek Sloan in Polish from Poland from Warsaw and our task is to index tax register of 16th century I did the project we started 140 years ago in the year of reconstruction of the Pulsat and the network and space structure of the state was formulated and started already in 19th century and more than 6 years ago we started a series with precise roles of reconstruction and presentation and in 2 years I have to complete the main effect of the work is the map of 16th century Poland with all settlements existing in the night with the names in this time with some data about each village so legal status large ownership and also localization it's not so evident in every cases and our main source our task register from 16th century so one settlement is mentioned about 10 times in different years of course as a part of this of life and sometimes it's just one or two and the other case is about general speaking each settlement should be mentioned sometimes there are status of this village name and some tax categories so economic objects people, land of use and how many was taxed in there and we transferred this map printed in several volumes in last 50 years into a GIS database and now we would like to provide a drawing of this map with some footnotes because our reconstruction made in these 60 years based on this source but the user had no access to sources used by us so our database we have scans of manuscript tables with data from this and the map and all connecting together in our database we have about 25,000 of settlements and this the main series of our source is about 70,000 pages of manuscripts written by several hundreds of hands and not only hands but type of writing and we think that it's about 250,000 mentions about concrete settlements in this series so you can see that it is definitely the task for transcribers so we decided to check how we can use this tool to help us we created the model based first on seven title same as the style we choose the sample of the same style in other hands but how you can see in these photos we have some of our material some features which doesn't perfectly fit to transcribers first of all we have several languages it is sometimes in Polish, sometimes in Latin and sometimes on one page you have Polish and Latin what's more we have very irregular a lot of irregular variations that's why yesterday I was asking for real time about it but another problem which is the biggest is that we have different style of headlines and the rest of the text so we made the model but unfortunately this HDR model is not perfect our CR is 25% we didn't make so huge model it's only 11,000 of words so we said that if we will make it bigger it's definitely insufficient for us here I have an example that mostly the words which are frequently transcribers read them well but the biggest problem is with these headlines which are crucial for us because it is the name of the cities so our problem is that these headlines are written in another totally other way than the rest of the text and here you have an example that this headline transcribers read the words so that's why we have to find some tool to make it bigger the problem is that since we train the model of our documents it's very inaccurate for the headlines that are most interesting to us and one solution would be to create a different model specifically for the headlines to increase accuracy in transcribing those headlines and this would require as we filter our dataset to only feed it the headlines and the hand would be extremely a lot of work so I am hoping by creating a tool that will do this automatically and I am creating a network that is supposed to classify those headlines so we can create a dataset to construct a model only for the headlines and maybe the biggest difference from most transcribers that are transcriber users is that they are working with the transcribers but to create my classifier I need to work directly on the communication API so basically my work would be this AI part I have to filter from the layout analysis that I use for transcribers and filter it and get on in the headlines so this is how it's supposed to work I get the separate lines and then I need to put all the normalized one classifier in one class so we don't use it to create this new model that is supposed to be more accurate for the headlines and then put all the headlines in the second class so the approaches I will use a convolutional network very similar for transcribers used uses for layout analysis and I will use it for construction I will not go into the details of convolutional network since this would be a very long presentation but one of the main challenges that I am facing with creating this classifier is that actually I have very different size of input because as you know lines in this historical documents can be of very different size and usually those classifiers for images are expected in a normalized input so input is size so one a possible solution for this might be a system of stacked out encoders to generate stars outputs of non-trivial sparse outputs to feed them to my classifier alright so what we want to do and then the end we would like to index these 70,000 pages at this index connect with a digital map and for this purpose we need transcribers but with some additional us, thank you thank you for your excellent presentations any questions or comments yes so I would like to ask two questions first we are going to the abbreviations so did you just expand it in the original descriptions and how did you do that the regular ones we we did but the regular ones we left without I mean like only the the letters which are yesterday we spoke about it a lot that there are several tools which can help us somehow like for example, the tagging but we have to test it but we don't have an answer how to do it and the second question I asked you did the Polish Autography in that time was probably quite different than today's and I have this feeling that transcribers is sometimes more like consideration that you have to write like the letters as they used to be then like for this modern way of editing how do you deal with that are you like adapting it to the modern Polish Autography or your transcription or is it in the old Polish Autography the problem is there was no stable photograph in the time so we cannot it's impossible to use and the second week was for us there are names geographic names so we use vocabulary we have the list of these names with some source versions of these items and we would like to use to combine transcribers with our GIS database so that for example we use our list of names taken from the sources check thank you very much what you were just explaining that you use the dictionary now about the variants of these names in order to to find for it transcribers so the question is is it also affecting can we do OCR correction while because I'm doing something similar when we find variants of placement many times I actually decided it's an OCR error or not and then if it's an OCR error this is relative to that and then do you also use these variants to find to add to your database of to the dictionary this solution is not ready yet the answer exactly but so I hope we can use the list of names because we have a list of names each name from 16th century the source versions we have normalized versions and today versions and we would like to set with pure spotting all forms in a given part of of these passwords I'm not sure with other questions and answers next day we will manage to answer because it do work in program thank you very interesting I want to ask you do you have a list of place names have you connected this into some kind of national ontology or like municipalities and places or is it like have you done try to map the place entities in some kind of historical level do you understand what I mean settlements it's quite simple because this network was based of the autonome villages although it's no ontological problem with settlements sometimes but it's two or three percent also an ontological project ontology of historical geography in Poland but the big questions are with districts municipalities etc and not in a given time because what we are doing now is one point it's 16th century so we have no problem with ontological questions of settlements thank you Adam, thank you where is Kriste from the short house in the UK so I'll start with the customary fashion class everything in it so that ends today from the National Archives of the United Kingdom we are the British Government's central wildlife we've got about 200 kilometers of shelving and we are on our way to be here so I'm going to talk about some of the projects which we recently got involved in so I'm going to be asking someone to try so the principal collection we've been doing transcribers on is a project with Prop 11 which are wills or last testaments from the Prophet of Port of Canterbury actually it's a very large collection the Prophet of Port of Canterbury proved and sort of passed the wills of people who own their own property over a certain value in the countries of England and London and also proved the wills of people who held property in multiple diocese and also proved the wills and testaments of people who own property overseas as the collection runs from the late 14th century through to the mid 19th century there's around 2,000 physical volumes they are copies of the original wills and there's about a million wills all at all and it says there are an estimated 10 people per will as in 10 people have mentioned each person's will on average and what before would be useful to do with them is they are written pretty consistently throughout their time in English it's the very early ones in Latin and they're also written pretty consistently in the same kind of handwriting and they are very formal there in their structure they are legal documents so there are forms of words that have to be used consistently throughout most of the run and the other thing we thought would be useful is that there you can see a copy of all of the digital microphones that are available in the microphone in the 19th century as I believe there was such a kind of digitalized equipment which has caused problems which I will do in a minute but what's exciting about them is we've currently got them counter-locked to have a name a date of death and the execution of what itself is proving by court an occupation and a place name but by transcribing they would be able to write huge amounts of detail this is a network map that we haven't generated using computer assignments but it just sort of shows the number of connections in one will have a larger image here so this is the will of Catherine Sturdis who died in the early 18th century and you can see here just the vast array of connections drop T place names that sort of come off just like where they have to transcribe they expect to give us a wonderful insight in people's networks over a very long period and so we started doing some work with it and we encountered some issues with the person's success as well so I'm going to try to get an overview of what work we've done so far and what we're going to do in the future so as I've said the images are not the greatest quality that's probably one of the best quality ones so I think it's probably that many of us in our portfolio with a lot of information here so that sort of creates the problem that's going to push me in due to the platform so the addition and the fact that the segmentation so these sort of lectures at the top of the page there they get sent as lines and then the experience of some of these creates a lot of results and also here those two sort of line which are used to fill out the space so it can't be filled in for anything else so you might check those have been second to second lines as well so again that causes some problems that's for a very regular feature there's also been some issues with the characters themselves and the program recognises that one character has been a separate character which leads to sort of slight odd spelling so this M and E on the far left has been recognised as two models and I think there's a focus between the M and the second to center so the M for you becomes a free there so this one here so it's a pretty standard phrase in the world in the name of God so that the M there has been split into two characters so one is an error and then the second one is an error as well so we get really interesting which I've heard as well and again again there's a gap between the ascenders there but you know we've had eight M in the cutting compared to some other projects we haven't done quite as much work which has some more training models that's an interesting result so we used a training data of 26 images so it's now 16,000 to 16,000 words with about 3,000 distinct words and then Branson test data with 10 images generating reference of the training data that's 7,000 words with 877 distinct words 388 which were worked with you so that's been interesting and so these are some of the new words that have worked quite well so it was quite complicated so it's interesting and you can sort of see it gives us an idea of stuff we might find in the text as well some of the other projects so these bodies are going to come up but they have not in the 10 images we did the training but otherwise quite good some of the semester new good words so this sort of shows some of the slightly strange words we've been having so you can see why as well so in that case the line segmentation is sort of ripped down into the receiving line in this case two lines of text have been taken together as one so I guess that's a quite long transcription it's the rift I don't know it's like an awesome but obviously that again is the cosmetic line so it's basically this mark at the very bottom of the page where the microphone is dark and we pick up the text and ask ourselves if we have this sort of neolithic so as an example the transcription not the going is but it's readable in parts but what we have had success with like with presented is with the keyword searching so this goes with search for nephew it's obviously very common thing to find in Will's family relations and you know the keyword has worked quite well I've been asked by one of my colleagues who's done a bit more work on this to ask if we can get the page of the line the keyword spot will be different so this has allowed us to do some interesting things to try and improve the transcription so there has been using Python to try and basically we've tried to work the first two words to say there is a different one in the end to try and correct spelling errors by looking at sort of common diagrams throughout the data that we've used and basically create five scripts to be able to deal with that so we were looking at sort of particularly so if you can see here the combinations we find regularly so we've got obviously four names we have a dictionary of four names and so on and also other words so for instance with said nephew we find that the data we've looked at is the four name there and so what we can do is start to build a model which corrects the spelling and you can see that it sort of comes to the solution of the issue we've had I'm not aware of the amount of time so I'd like to go through what we're planning to do in the near future so it would have been to sort of speak to people not just a story I'll just try to have a logic test to see what we can do with this the information that hopefully the transcription will create the wills and how we can solve the problem hopefully burn network around to get the tumor to facilitate sort of more discussions we also have a postdoctoral fellow who's working who's working to do basically mixing HDR with crowdsourcing the idea being that we create ground proof and do the HDR on the QA but somehow trying to have a platform to understand the question the fact that more of their books that was many years ago trying to help improve so with that I'd like to thank you very much Any questions or comments? Did you work with dictionaries? Did you create a dictionary for you? I think we're to work with English dictionaries The new Russian will create an internal dictionary with visual training so with the new model for the HDR Plus you will get also a dictionary based on training data and then use Jung Any other questions? Okay if not, thank you Chris paper in this session is given by Olga, Bella, Borodoma, Makdenen and Joshua Good morning everyone so my name is Olga, Bella, Borodoma and together with my two colleagues Rob Dillon and Joshua I'm here to present our upcoming transcribers based project called Catch 2020 Now before I move on to the question itself, a few words about our team back home Our research group is called Anthropcenter for Digital Humanities and Literary Criticism ACDC for short and we're basically a team of scholars working in digital humanities based at the University of Antwerp working with digital humanities digital scholarly editing and more specifically textual genesis of modernist works and how to computationally analyze it In other words we work with modern manuscripts which are often very messy and extremely difficult to read and I'm sure many of you know Our approach to transcribers technology is from a genetic critical point of view the purpose of making a genetic critical addition so that's something that's something we should bear in mind Part of the ACDC research group is the Center for Manuscript Genetics and over the years the CMG has developed expertise in digital scholarly editing by working mainly in manuscripts and Samuel Beckett One of the projects run is the Beckett Digital Manuscript project which is a digital genetic addition of Samuel Beckett's works It is precisely the experience we have garnered while making that addition that we would like to use in our cooperation as transcribers So what is CACH 2020? The acronym stands for Computer Assisted Transcription of Compliance Handwriting The project which is funded by Research Foundation Flanders and DERI of Belgium which is an organisation that supports digital humanities initiatives in Belgium So let's quickly discuss the acronym I'm sure you're all familiar with the concept of complex handwriting and we'll encounter a few examples in the presentation But what do we mean by Computer Assisted? What we certainly don't mean is that Computer Assisted means automated So I don't believe in fully automated at least at this point in time Well the answer is simple As genetic editors it is our task to produce accurate transcriptions and to enrich them with annotations and markup So we know that transcribers technology currently achieves a very impressive character error rate of only 5% with good training data and relatively neat hand We're more interested in the real messy hands of modern writers But regardless of that the purposes of scholarly editing only 95 success rate is just not good enough Even 99.9% success rate could not possibly wake the proof reading stage because it will still generate more than one mistake per page Because the editors role in an edition is so crucial we propose to put the editor in the driving seat of development technology In other words, instead of letting the machine do the bulk of the work and the human proof reader correct it we want the trained human editor to get assisted by transcribers functionalities just at the right times in her workflow This is by no means our own idea We met with the REIT teams from Innsbruck and Valencia in Antwerp just two months ago and the impressive developments that Veronica Romero shared with us they also apply the same logic in their work While they focus on computationally supporting the transcription process we hope to support the specific workflows of digital scholarly editors or even more specific of genetic editors that work on complex modernist handwriting So the research question that underlies Catch 2020 is the following How can the core functions of transcribers best be integrated into the workflows of digital scholarly editors and the core functions we're talking about are the layout analysis and handling and text recognition But there's also another one which is not yet part of the transcriber's desktop client The facsimile transcript might be for cases in which you have a texture of transcript already so the editors work Although there is a tool available that's not part of the transcribers interface So our plan is to implement a web-based virtual research environment that supports digital scholarly editors at producing text genetic digital editions at all stages of their workflow In the best case this development is not a new standalone web interface but an add-on or a module of the existing or currently developed transcribers web interface The target group is scholars with training editorial theory and with a particular focus on genetic criticism but with rather little technological know-how We want to help the editors organize their facsimiles design for annotation guidelines transcribe their documents annotate their documents with further information map their transcripts to the facsimiles and publish their results all in one workflow At specific points within this workflow the editor has the option to get computational support by calling the transcribers rest API where tasks are executed results are sent back and processed seamlessly within that environment We think that collaboration is key here It's better to build on and merge existing developments than to start from scratch and reinvent the wheel Tools for most of these problems have already been developed in institutions across the world and our aim is to incorporate these solutions into our workflow and add features to them to cater to our specific needs So what kind of means are we talking about? Let's quickly go over a few examples An important issue for a digital genetic addition is to be able to link with the authors of revision campaigns encoded in TEI and the transcribers' layout analysis So we want to incorporate TEI editing more into the transcribers workflow For example with p5 version 2 the TEI allows to capture reading and writing orders Transcribers currently only allows to annotate the reading order The writing order is determined by the editor This is the crucial part of their work and we would like to bring manual and computational annotation into one workflow So we want to capture in which chronological order the page has been filmed in the same environment in which we do automatic layout detection and HDR The following example visualizes what we mean by the writing order and comes from gene choices drafts for UDISP and although this example might be complicated it's actually quite simple for Joyce So this is how the page is filled based on the decision by the editor So this is the writing order Yeah So the editor will show zones based on their decision so it will look something like this So this is what we envisage for the workflow What transcribers has is just the reading order in a sort of chronological way A few more examples of such needs typical for a genetic critical edition Very briefly one of them is detecting highlighted text whether underlined, deleted superscripted or overwritten Here we have an example again from Joyce We have underlined and blue crossed out crossed out These are important, very important events in the writing process and they should be registered somehow in the workflow Another problem is detecting intervening additions This example from Samuel Beckett's draft for film notes the neat handwriting Another example which is related to the above is detecting substitutions by combining into linear additions of deletions in the following line Here we have an example from Virginia Wolf Another issue for a genetic edition is the genre based layout detection Obviously we're not just dealing with prose words we have drama and poetry Here's an example from Beckett's end game with a typical drama layout which is another feature we would like to incorporate Something genetic editors deal a lot with our metamars It's a keyword spotting that we mentioned a lot and we think it's very useful feature but we'd like to use it on metamars And here are some examples of metamars by James Joyce Again, another issue is version comparison Very often you have words that are illegical upon version and it precedes it Here's an example from Belgium by Johigo Klaus Last but not least and probably most crucial thing or more We would like to reach out to the community for feedback Many of us are dealing with civil issues so let's join forces and collaborate in order to solve them in the best possible way Please do give in touch with us and let's work together Thank you for your attention Hi We set the main task of this current editor is to transcribe the text and and reach with markup annotations and that is if now it's really the first and the most important task to transcribe the text It's just one of ways to make the text easier So if we have good written transcript it's not necessary Also user can read text written beforehand and if we have for example automatic transcription also with some errors but then revised like a little and with annotations etc. It could also be good current edition Thank you Well, two things here What you said about legible text that don't even need transcription I perfectly agree with you There is no need for transcription then we shouldn't transcribe What I'm talking about is text by modernist writers that really do need to be transcribed and not just because the handwriting I think the main task of the editor is not just to provide an accurate edition which is essential but also to comment on it So the hermeneutic aspect, the interpretative aspect that's why we call it genetic criticism the genetic part is actually doing exactly that coming up with perfect text if that's at all possible but the interpretation part is extremely important So I'll comment on the accuracies I think the purpose of genetic edition or any edition is exactly that, to be absolutely accurate because if what our editions are used for, they're used for scholarly work So if we provide an edition that is not 100% accurate then we basically don't deliver what we should deliver So I think an edition should at least aspire to be 100% accurate And then the question is what's more sensible? Do we do HDR revised by a human editor or maybe in some cases it's just more sensible for the human editor to do transcription work because he or she is so familiar I mean let's mention Beckett for instance the writer I work on at one point you just learn to read his handwriting because he's done it for so many years and you provide a more accurate transcription from the start so HDR is a very useful tool and we're going to definitely incorporate it into our work but that's the whole idea but sometimes you have to ask yourself that question what's more sensible to use HDR and revise for the other way around just to manually transplant but it's very individual and that's the core you have to make based on the individual I hope I've answered your question You're quite right, we're all individuals and my experience with the Benetton edition is that starting with an HTI transcript is a real benefit and then the experience I can then correct the HDR since we've started off with rough transcripts let's say from PhD students and then the experience they just have seen and we have to do a further process as well so we my sentence I think the idea was not to say well do it one way around or the other way around the idea is to get both approaches optimized in one way in other ways optimize additional layers with existing web interface web interface development that supports manual intervention for our specific needs which is not a big archive that I don't need 99.9% accuracy but for our specific needs more that's the idea I think we all also believe in exactly what Benetton came in that having a base transcript being done by HDR is the best to do but embedded neatly in our work basically what we want to do is to make sure that the editor can intervene much easier same interface always thank you for the first time