 hefyd yn gwneud i chi a mwy oedd oedd yn cof Lex Gund righteousness. Yn gyfnod, mae'r hyn yn gorabe ni'r cymryd archim yn llwyddon. Rwy'n amser, mae'r syniadau rwy'n gwneud i chi eisiau o'r cyhoedd brysiau fel Llewynidd. Rwy'n gofio ar gyfer hyn, ddysgu'r cyfan hyn yn ei wneud a gwybod. Rwy'n gwybod oech nhw nhw, maen nhw hefyd yn gwneud at eich cyfrifiadau. Mae'r arhyf yng nghygreriau gwahanol unrhyw unrhyw unrhyw, mae'rions ar y cyfnodau ychydigio'r ar gyfer arwreithio'r argynigion bach o ffordd, fydd wrth gwrs yn gyllid dynnu yng Nghyrch Cymru datblygu'r arbu. Yr hyn o'r cyfrnodd mor deuluu'r argynigion hynny? Wyglun o'u gysylltu gwirio'n bach o ffordd ac yw'r Argynigion Fyrdd i chi'n meddwl hwn. a ond y dylai'r bwysig o'r gyda'r hyn o'r cyfnodd. Mae'r gweithio yn ymgyrch â y llyfr nifer 150,000 o'r cyfnodd yma yn ysgrifennu sy'n gwneud i ti axymel, ac yn ymddo, sy'n gweithio'r gyfnodd yr olyaf, ac yn ystod i'r cyfnodd, mae'n gweithio'r cyfnodd, y tôl sy'n ei wrth creusio yma. Ac mae'r gwahodd o gynhyrchu ar y llyfiad o'r llyfiad fel pa yn jechydhy Wayward. Y llydd y llyfiad yn y llyfiad maes gynhyrchu'n jyfnol ac mae'r llyfiad o'r llyfiad sy'n gwneud o'r llyfiad o'r llyfiad, mae'r llyfiad o'r llyfiad o bwyny o'r llyfiad o'r llyfiad ym mwyn. Roedden nhw. Y ddechrau o'n rhai yn y llyfiad ar y h-r-r-r-r mae'r modd o'r modd yng Nghaerhau'r bod y cael'r barn mae'n rhan o'r idea, mae'n meddwl yn fath ac mae'n ei ddim yn gweithio'r modd o 150,000 pwysig mae'n gweithio y bodai'r chweithio'r gael ffôl GDR Mae'n gweithio'r modd yng Nghaerhau'r modd o'r panhau Cq mae GDR Plus yn gweithio'r modd yng nghynghau ac mae'n gweithio'r modd o'r 150,000 pwysig mae'n dda'r gael gweithio'r cweithio'r modd o'r gael a dyna'n ddim yn ei wneud o'r ysgol i'r hyffordd yn y ddechrau newyddol yn y cyfnod, ond yn y cyfnod ysgol ym mhwyl yn ddyn nhw. Rwy'n mynd i ddweud eich 1, 150,000 ym 15,000, sy'n dweud i'n gweld i'r cyfnod a'u bod yn y cyfnod ddweud i'n 1, 150,000, ddyn nhw eich gweld i'n cyfnod yn y leol. Eich gweld i'n gweld i'n cyfnod yn y leol, ac rwy'n wedi gweld i'r Cyfrifol byddoedd. a ydych chi'n gweithio'r rhaglen lleol ag dda'i gwneud yn ei gysylltu'r roddau a'u cyd-drygu'r rywbeth i gyd yn ysgrifetio'r hwn yma. Mae'n holleg o'r gyrfa yn ddod yn 1884, ac yn 1903 yr hyn adnoddau'r cyd-rygu'r rhaglen lleol wedi ddod yn ddod. Mae yma o'r rhaglen lleol yn ei ddod yn ei ddod yn ysgrifetio'r rhaglen lleol. Layout analysis will work. This is test it out. HDR is sufficient, but the ratio of possible even reach the better quality. The goals are that we will make the document searchable and probably do text extraction so that we can provide the text to the users who will do some text mining work while they want to do. HDR plus, in that regard, is a game changer. The model based on 800 pages, so quite a small model, spread over the centuries over every year basically has pages resulted in a CR of around 7%. That's only with 800 pages. Still curious what the 150,000 pages will look like. If we apply this model build for the minutes in Zurich on the documents of the National Archives, we get a CR that is slightly higher, something around 9%. But that means that the German correct can be recognized without caring if the same scribe has been part of the training set. So there is the possibility that we can build general models. That means with HDR plus we have the hammer that nails it down for more or less everything. It goes even beyond. Some days ago, I received a model from Passau that was trained by Eva, about 1,100 pages and I applied it on the documents. The result is also already quite good. The documents in Passau, which are birth registers, as Eva talked about in the minute, are completely different, but they are also in German correct. The result is quite astounding. So we can say this problem of recognizing that the German correct is basically solved. So we are in the same ball arc as with OCR for Gothic script, for example. That means until the end of the re-project we will be able to provide a model for German correct. We are currently assembling documents for Gothic handwriting that is working to 16th century. I'm pretty optimistic that the results will be similar to the HDR model here, give us something around CR of 10% for documents, not part of the training set. And like in the long run, within the next five years or so, I'm thinking that we will be able to recognize charters of the 12th century and basically everything there is in a proud truth. All thanks to HDR Plus and Boone and Alive that introduced you to HDR Plus this afternoon. This was the big scale. On a small scale, we started applying from Springus and HDR recognition basically with the goal to do QR spotting. QR spotting will be subject from a technical and from a content perspective tomorrow. We use it for the guest of charters that have been produced, handwritten in the 19th century, index of letters also from the 19th century. Older cards will find the aids, mostly also from the 19th or 18th century. Even with not perfectly suitable HDR models, you are able to search it via KDS, the QR spotting, and the results are so good that we use them internally. So, from the outside, somebody approaches us and says, I'm interested in, for example, this register of charters, but I'm not interested in... I don't know what time span I'm really interested in, but for example, in places you can use the QR spotting to give them an idea on which documents they will be able to find something. QR spotting even works on totally problematic documents. So, you can see here, a very old scan, probably from the 70s, it's from the state archives of Byrn, which is a sacred site project I'm working with. And they provided us with, I think, some thousand pages of registers from births, from deaths, and from marriages. You don't have suitable HDR models at all, as you can see below. Hopefully, I've seen better results from transgressors, but nonetheless, it is possible by using the QR spotting to find the spots. Of course, we need some more time to go through all the default positives, but they have a rate calculated and estimated of around 95% of all instances that are going to be found even with this bad HDR result. Does anyone have any questions? Yeah. G2I, you mentioned the G2I... Ah, yeah. Right at the beginning. Yeah. Are you your own? Sorry? Are you your own? It's the University of Rostock. That's a sitlub. Reed realises abbreviations, not only in the text, but also on paper. So that's the University of Rostock. So, the G2I was prepared and implemented by Pounderons. Okay, can you say a little bit about what the ground would look like outside? The initial ground truth has been birth documents. So, transcribed by students, and the only thing that we did to prepare it was to transform it to TIXML. And from there, the TIXML, the TXT files have been provided. So, if you would like to work with the G2I, the text image tool, what we usually need is one text file per page. And then the alignment process can even be started in the future. Any other questions? Yeah. What is your progress with the abbreviations and all the documents? So, what is your progress with the medieval abbreviations? I'm one of the proponents that's growing with the idea that you should not expand abbreviations. So, try to keep as close. Most of the Latin abbreviations you will be able to code in your info. And for the Germans, there are some ways around. So you basically have to not invent, but try nearest possible ways to work with them. You're free to. Well, I mean like, you know, for me as a person preparing editions, this would be like very uncomfortable to leave their like unexpended abbreviations. So, does they have any working, is that unexpending, then while transcribing. So, how do you approach this? Because, you know, do you have a solution how to make it comfortable for the editors? Yes and no. The technology of HGR tries to match one-to-one the letters that are in the document and basically the input and the output should be one-to-one. And if we automatically have them expanded, this is working for an hour or so, and possibly we have a beautiful example of that. But in the end, we don't know why it is working and how good it is, basically how good the quality is going to be. So there I'd say, even as an editor that provides expanded abbreviation, you can start from the abbreviated version and then have it expanded. I'm leading a small edition project in Zurich where we are trying to do a very close transcription of what we are seeing. So no expansion, no automatic expansion. But what you can do with some small Python tools is that you can look for signs of abbreviation and then tag it automatically. So that works pretty nice. I can show you this in a very quick way. Thank you. Okay, we have time for one more question. I'm Mr Kesson, I'm a Sharkas of the Netherlands. I'm also working on transcribers. I'm wondering how you plan to present the results of transcriptions on your website and the public. Yes, that's the problem. For the moment, it's TI XML exports. So the TI XML export is currently being improved and this is very useful. And in the background, we're also using the TXT files just to search through them. So probably into indexes and stuff. So you're not planning on implementing the keywords searching on your website, for example? Not directly. I mean, the keyword searching itself is too slow to use it real-time. So, for example, in this, if you're doing a search here on 80 pages, this will take you 30 seconds. And there's no way that you can make people at the end of the day to wait for 30 seconds on a website page without feeling their heart or something. So this is like keyword searching needs an indexing process in a second part, which is currently implemented in transcribers. So then it will be possible that we can use the rest API and directly with it. But as of today, it's not possible, but it's going to be possible in the future. OK, thank you. OK, we'll leave it there. Thanks very much, Tobias. Our next speaker is Eva Lan. Eva also works on the read project with us. And she's from the die season archives of Pass It Up. So take it with Eva. Thank you, Rhys. What we started off in the project is having scans of around 800,000 images of registered books. That means birth, death and marriages records, which appear in a lot of archives because you have data about people, right? And the first run through the images is what we found out. Of course, the handwriting developed from something in the 16th, 17th century over more table style, hand drawn tables to printed forms in all different layouts. So the first thing we did coming in from the archival perspective, by the way, I'm not an archivist, but I'm a computer person. So language and communication is, well, up in a way, sometimes a bit challenging, but we worked our way through. So we scrutinized the data set and came up with a subset focusing on the timing between 1847 and 1878. And we found out that there's some 26,500 images that fall into that time period. And this is basically the data set that we're working with. We have prepared run through and we use an archival approach to that. So we selected scribes and Trace describes the writers according to a second level literature we have. We know which priest was working from, from which period of time. So everything, even from a historical perspective, is round here. The best results, the best figures that can go on this data set so far is 6.98% of character error rate. And this is really astonishing. We have not seen, unfortunately, this completely tested from our data set, but this is still the work to be done. This is, again, thanks to the roster proof that we're getting those results now. In the same timeframe, of course, and this would be the second level to test things, but we have not worked our way through producing run through data here. It's the marriage records. There are a few numbers of images and the autism records with a lot more images. So basically there's two sort of user requests that come to the archives here. One is the big, big group of family historians who say, well, I cannot read the old handwriting yet. I want to know who my ancestors are. And, well, if you do have the chances, I also want to learn as much as I can about them. So basically what the deceased, the illnesses that they died from, where they lived to, they were engaged with whatever is there in the record. So for that, we're throwing in the technical term of information protection. And I put up a rough overview slide here again. Elin, our colleague at Mada Lofts, will give you a further insight this afternoon. The main principle is we have, from the archive opponent, what you need is the images, is the transcription, so that more or less perfectly they want it to be read and all the layouting plus the text, the recognition, what we call the processing. And what you get out in the end is basically XML records, what it can work with. In our case, it would be that and these papers. An example here, and let me just put the ask yourself who can read the text, who can read a 18th century great handwriting. If you can, thank you. Who is familiar with illnesses of that period? Well, basically what I'm asking is, if you look at the record that we get out of there, it says the automatic reading that we had, and this is dated around May, it's dated when I asked. So it is incorrect here, but what I want to demonstrate is basically the output that we're working with. It should read correctly, xtrofysucht, so that's a disease, something like tuberculosis, and it was not cured by then, so it wasn't a disease and it appears in our records. Having that in the output, of course, it opens avenues to any sort of historian, going into the history of medicine, going into the spread of diseases, going into whatever combination you can find on the location of ages, so the US case is really particular. When we have the second batch of researchers being basically the scholarly researchers, their interest is something along those lines. I'm doing research on the development and spread of illnesses, child death, the migration of families, the age of living, and I always put in the records here the time frame that we're focusing on. So here, the US case is mainly fever-exrupting, and that is what Tobias showed earlier, so even if the transcription is not perfect in the sense that the calligraphy of people would call it, it can still get good results. So again, the setting, you might recognize that it's image and proof, and then you do the layout of the processing, the HDR, and you end up with a list of hits for a given keyword. So here the aim is not to get the full record but only those parts that are of interest to you. So one of the... and that was what we already talked about is the basis for the presentation. One of the interfaces, one of the viewing and display things is within transcripts if you want to drill down from a user perspective. Sorry to say that here, the complex Java tool, but the beauty of that is how you get the results and the system tells you how much the hit that it presents to you is worth or what the quality is. So the numbers on the right-hand side, the highlight gives you the... a sense of probability of the worth of the project. There are several avenues in the project, so the university in Valencia did a different thing, opens up full avenues. This will be talking... this will be talking about that on the platform demonstrator here. The web interface does have capabilities to bring that in and to allow you to encourage there as well, but it's not instantly. As a historian, so on the researcher, you can't wait for the results. You need to set up your query by the way to come back the next day and you get the full function. If time is not an issue, it will serve you really, really well, even on large basis. That is a schedule to be part of the web interface in testing phase 1. I still want to show you some results and some comparison of what HDR plus can do for us, because we have done testing on the other side. The HDR is what is currently in the system available to those who have been granted access. For our case, it produced around 17 to 20% of character error rate. It was training around May-June time frame this year. The first test results on a new HDR plus in the system is the around 7% that I already mentioned, through the tests are still to be taken. These tests, these stats, go on to 1,200 pages of time group. So not the full data set yet, not the 26,000 pages, but that's most of it. So the results are on stripes, on hands, but have been trained. You can start with the default parameters in the training and you get the curve like the very first things, the red curve here, if you continue training. So if you do the same looking again, wait for several hours, the results will improve. You might notice that there's several spikes in there. They should not be in there. The pages where some parts of the transcription are missing, where some irregularities are happening. So these are here for the investigation. The second statistics that I can show you is another green curve. The red thing is what we saw on the slide before. The green thing, well, the spikes are even higher. There's discussions still to be done with the HDR plus experts. We have run through some of them. What we do believe is that because the system is trained differently, those are real outliers and we need to really investigate on these three pages again. In general, the green line is below the red line. So there are a lot of improvements already possible, but still more to be done if you tune the parameters. Of course, the line is not all the times below because the training process is different. So it's a black box thing for the user perspective and what the feeling we get though is that the system is a lot better. Well, with the exception of some of the other lines. So the last three slides that I want to show you are an evaluation slide of 50 pages of stripes that are definitely not in the run through set we had before. And those nice kitty slides here is what the error rate compares into a transfiguracy. So if you do compare the run through version run the HDR on top, you can get a per page statisticity. Now the red and green have a little bit different meaning. The green is the word error rate and the red thing is the character error rate. Here, again, this is the old train to the HDR and what stripes is that you have on the left-hand side aligns the access is running up to 80%. So the columns are rather high. Well, the average character error rate that we get out of that is 23. something percent. Now if we do compare this with the HDR class system but I was able to train a couple of days ago the number of character errors is dropping by 2%. This is still far from where we want to go but it really shows not improvement is possible. What you will notice on the slide title here as well is that it's not dictionary. Now this is an interesting part but still needs further investigation what we want to achieve. Actually given dictionaries like for the diseases we do have a database about a lot of them as recorded. For the names, we know the names in our databases. Just overlay that and hope that the columns are dropping or diminishing as well. As a last slide, this is if you apply the training dictionary in fact an interesting phenomenon comes out because training dictionary means only the words the system learned in the training. They are not necessarily the words that you need to recognize in the data set itself. So here if you pay close attention the character error rate is up to 23% again. Now if we add the perfect dictionary we do expect that to become lower rate. But ladies and gentlemen is the end of what I can bring you OK, we have time for maybe one or two questions. There's a microphone coming for you. Thank you, this is just a comment. I know that the city archives in Copenhagen have been working with some death certificates and they have been working also on various types of disease and trying to figure out what kind of cannabis they were in. So perhaps you could go to them and see whether they have anything useful. Some of the disease may be in Danish of course. It may not be helpful for you but the Latin word might be you. OK, thank you very much. Any questions? I wondered if you're trying to apply natural language processing to any of the data that you generate and if you have a particular with dates what sort of success have you got? Well we have not gone into those experiments here but there's, of course, I think the applying different dictionaries or dictionaries per column especially for dates combining things might improve but this is beyond the scope of the data. Thank you. OK, we better leave it there for questions so thanks very much Eva and I'll invite our next presenter to the stage. So next up we have another couple of people who work on the REIT project It's Lowy, Heavonin and Maria Cardio from the National Archives of Finland and they're going to talk about their work together. So take it away. I hope you can hear where I'm responsible for the project. The National Archives of Finland we have over 200 kilometres of records we also have quite large digital archives with 76 million inches at the moment the archives have 32,000 visits annually and over million visits in our digital collections. Today we're going to tell you about our project with the collection of renovated core records. It's one of the largest collections of the National Archives a collection starts from 1623 when the Court of Appeal was established so these renovated core records are basically transcribed records from law enforcement who produce the records for the Court of Appeal. At the moment we have around 800,000 pages digitised from the collections only the so-called notification records but the rest of the 19th century and early 20th century collection will be digitised by Finland Search during this next year so we will get around 5 million pages more. So our aim in the inquiry project is to provide the core records in machinery of form in addition to the digital images to achieve this to produce two sets of ground truth gave us our values text written out by specialists and people on specialists. The first batch of written last year by a community of researchers working as volunteers they produced around 300 pages of ground truth and with that we produced a model form core records with a character rate around 9% which is quite doable and actually quite good as none of the ground truth producers were specialists in telegraphy or even at the court records and with this time material we got actually quite good readable text at least for keyboard services and so forth and this year we've ordered around 700 pages more of ground truth from a private company and we've used this to produce new models currently very close to testing plus and first results seem to indicate that we can get the character error rate below 3% maybe around 2% even or comparisons say character error rate of most common oceans of a decline reader with Finnish government records from 1990s is around 80% so we can get a lot of the results from 18th century handwritten text down to 19th century 20th century MS board documents and this is an example of our first HDL model and this was produced by the family researchers and has some error rate around 12% so this is a very early model and the main problems were that in let us in and in are quite hard in the Swedish 19th century language also the numbers work on I'm sorry that these are all written in Swedish but with any German speaking that conscious of I think you can try to understand some of those but overall this is quite good result compared to the first one the main problems are that one of the letter N and the long scanning letters is comparison is a new model we just finished yesterday evening with the new HDL bus this is done with about 200 pages and on the first thing I think this is quite free or might be around 5-8% error rate the main problems are the person names for example here there is something called Toliwis it should be Fabias but the good thing is that the code spotting to the project that was spoken about does recognise the correct place name and it's a 71% accuracy for the correct length so I think we could use the keyword spotting to search for proper names and use this model for the basic text so about next year about the future and what we are doing with the code records so during 2019 we were processing the 800 pages of code records with 80000 pages of code records with HDR and we are going to provide the recognised collections through our digital services so at the moment we are building new interface where we can show text and image and also build new interface where our customers can do the keyword spotting through our services next we are going to have 5 million pages more of this collection in digital form so we are going to start a new project and hopefully we can process the 5 million pages in the following years with cooperation with the read co-op but thanks for listening and we are happy to answer thanks very much so we have time for a few questions you are going to make a combination of the keywords spotting on your website and full text search on your website well basically we are building a frame where it looks like our customers are using our services when they are actually using services from the read co-op for customers but yes, so it's possible to do both kind of searches but to have that keyword searching will be the most widely done by indexing so we are going to index the whole matter and provide you some keyword searches so maybe you can ask another 30 seconds it should be instance so depends on your internet connection any other questions or comments thank you very much I've heard I'm sort of a new person in this area so I've heard the term CER and WER and you get one number for those scores and have you looked into the actual subtypes of what actually causes the error rate to go up in your co-op so you looked at letter combinations for instance we said double N created a problem so I could think of other combinations of letters that might be difficult for the program to recognise yes, we are currently investigating a script used during 19th century so T's and F's are hard O's and A's are hard letters and typically none of those are also hard because they are usually separated and the use of the numbers and personal names caused the problems may reason why personal names are that we don't have all the names in our French word material so programmastic is those especially with the co-workers where there are tons of place names and names of the states or farms that appear that often certain names are Would you agree that a combination of say statistically based transcription and some kind of script based simplification of the automatic transcription problem would be a forward where you look for patterns or faults or mistakes that you might be able to solve by means of a rule for instance where you come into a possible problem and you can base the solution of that problem on say a bit likelihood of there a double N in your corpus because you've already identified that would that help would you be able to reduce your character error rate or your worth error rate in that fashion? Yes I think that might be a solution on the other hand we are just an offer and we aim to produce the text as downloadable also for customers and scrollers so that they could do the error correction there and if we are going to do something on quite many pages I think we are will be satisfied with error rate less than quite where the base names are it's problematic but we can try to use for example a dictionary for base names or hand personal names and try to look if that would provide better a small error rate on what we are currently investigating on how we should do that model for page DR Thank you Anyway it's 95% more than we have right now Is there any other questions? You've just mentioned place names and guest tiers the problem about using guest tiers in 19th century place names and personal names is that they are often written slightly differently and there is no standardisation and are you contemplating historical guest tiers or working together with institutions in Hemsker Sport or something like that? I think that would be that we could work with for example with Institute of Languages of Finland on the other hand I think we will skip the use of language dictionaries because studies with the modern osia will finish and to show that you shouldn't use the dictionary and you get a lot of voice results of Finnish analysis for Swedish when using dictionaries or software and at the same time it might be the case with HDR and we'll have to investigate that Thank you We'll leave it there for the questions and we'll be going to the next speaker so thanks very much I'll welcome the final speaker in the session to the stage it's Mark Pote from the Amsterdam City Archives I'm not going to present the resource that's been so far on the whole tool that we have developed in the last volume of months and this project is called in Dutch Kraut Leet Computer Leise which means Kraut teachers you have to read so we want to use the Kraut to read historical documents for this we used the notarial archives in Amsterdam we have been working with this archive for a couple of years and some basics are there we are going to have a few meters long to tell you the different tree numbers at the moment we have 6 million scans which is roughly 30% of the archive and we want to have them searchable completely completely transcribed in a couple of years and the material is from late 16th century until early 20th century but we are focusing on the early modern period 16th century 18th century in the last couple of years we have been working with traditional indexing them names, locations and we do this with Kraut source tool called Fede Handen which is some years ago by Kraut Company picked to write and a lot of archives haven't used this because there's a large crowd over there and for now we have been indexing this I'm just going to skip this but the total amount of scans we have at the moment in the city are kind of almost 35 million this is what we have done so far so we have a group of about 500 volunteers that read the documents say what kind of document this is and then they mark where there are names and locations and just very some remarks what the deed is about and then just have a basic search engine all consisting of about 250,000 documents so that's about 1 million scans and that's 5 to 10% of the archive but of course we want to do a complete archive we have been working 2 years now and we only did 5 to 10% so we don't want to go on for 20 years or more so we decide to speed things up so we can in 2025 we need to have a complete archive search where this will be names or full transcriptions or a combination of both we'll see that depending on the technology I think and to do this we decided to work also with HCR but the tools that are available now we're not subscribed for us because we wanted to use the crowd we have more than 500 volunteers actively working already on this archive so we want to keep them in the same environment that's why we decided to use the same failed handle many hands it means platform also to use HCR transcriptions this is just a little bit of what we have done so far is that we there was a set of 5 notaries 18th century transcribed in Vietnam just to see what's happening what kind of models do we have the best models we have so far based on one notary are about a character error rate about between 6 and 7 but actually what we want is almost 100% transcriptions in our database so the whole idea was to make two projects and the first project on data only we will make transcriptions by the crowd starting with 18 notaries the crowd indexes or transcribes 100 to 100 pages and then we start making models if the model is good we can go to the second project where the same crowd gets the results improve the transcriptions until almost 100% just to see how this works I want to go just on the live version of the website which is soft live just to show why we use this so we have all sorts of interesting things for crowd sourcing we have a credit system that every time that you transcribe a document you get some credits and afterwards we organize all sorts of events and you can buy books with your credits to keep the people active they also really like the events coming together they are all working from home and once or three times a year they will come together in the archive to see how things are going they become friends there is a forum where people can ask questions or funny things they find the archive and they can do and as you can see so we have been working only one week we have some results transcribed in the last week if you see there are two notaries that we have already more than on the scans so in a few weeks this will be done by the crowd so we can build our models I would like to show just a little bit of course what we did is actually building the transcribes platform in Fela Handel but with all the tools we have in Fela Handel so there are all sorts of possibilities this scan is too hard for me given to somebody else or there is something funny inside the scan or I want to see the next scan in the same row of scans so I can read a little bit further but I am not just to know where this is about and so basically people start transcribing here and then a second person always gets to correct the reasonable seconds to show you how the correction works so if you are more a crowd user can be a volunteer or somebody that is working in the archive you will get the same transcription, the same document and you just decide whether the volunteer has correctly transcribed this and then finally we have a document that is ready to use in the HDR model and I think this is a way since we have all the volunteers there and you have the whole community across also that we will have some improvements in having a lot of transcriptions to start building real big models for these 17th century art forms so that is what I want to show you any questions? Thank you Mark I've got a crowd sourcing project in London so I was really happy to see that so have we got any questions in the audience? Very impressive it looks like just at the beginning of testing the new system and the example you've shown is of input to create a ground base have you tested with the users then receiving the HDR output and then editing it I'm going to talk a little bit later about our own experience and the users have found system to date quite frustrating but not using the old system Well that's what I've actually in the second project we will have HDR made transcription for now we don't have that yet of course we'll have the second part within this project which means that you check what other users have done and that's working perfectly so we can just see how this will finally go on of course it will depend on the character where it ever rates whether this is satisfying or not if you have like only 3-4% error it will be fine for the product if it's like large amounts it will be frustrating if you have to change everything Thank you Any other questions? How much is the current weather rate for your manual transcription? 2 stage transcriptions How many words or character are corrected by the second stage? If you have on the left side you have the first transcription for some user and on the right side there is some reaction The question is how many things are corrected overall? We started just this week so you can see here of this doc so this is almost 100% so almost more than 100 documents 37% is corrected If we talk about our project we're doing now we did like with the crowd 250,000 leads so almost 1 million documents in two years But the question How many errors have been corrected in this one? At the moment I cannot This would be interesting to compare to automatic Some volunteers are more accurate in all of this as well I think you better leave it there for the questions so thank you to Mark and all of our presenters Thanks everyone