 This is a library in Wolfenbüttel and he is a great T.E.I., can you say, programmer. The T.E.I. 2016 is a great tool if you're using T.E.I. to check out what you can do with it. And thanks for talking about the T.E.I. Very good, there we go. The Wiener Psystium is the oldest newspaper that also lives in existence today. It's still issued today. It was founded in 1703. In the Austrian Academy of Sciences we've just started a project to make available at the start 420 issues from the 18th century spanning more or less equally across those years as far as available. And the one thing we need to do obviously is to get a large amount of text into a readable, some state that is readable actually. The images are available through the Austrian National Library, the UNB, which is part of their annual project, Austrian Newspapers Online. And they have already started to do some basic OCR. But as most people know, with Fractur, it just doesn't work. To give you a few examples of what pages look like, I've just collected six different layouts of the front page, and as you see that already makes quite some differences. And it just goes on like this. There are no two issues that are exactly like when it comes to layout. Going through it, manually transcribing obviously is a no-go. You can't do this. At the moment we have about 100 issues in our pipeline, and that already leaves us with about 2500 pages. And for our first set we plan to make available about 7500 to 12500 pages in 400 issues that depends a bit on what we can actually get through the pipeline. As we started out, we of course came across the usual difficulties. So as you can see, top left, bed-based line detection stemming from that really bad text definition and bad lines. If you have spec images you get a lot of noise that is either recognized as noise or usually a separate. And so on, so on. So we actually came up with a workflow of pre-processing, processing, post-processing that currently looks like this. We just gather the images, do some basic automatic de-squeuing, which is due to the fact that the images we get from the Austrian International Library are in basically all states you can think of. I think I accidentally found one image that was actually perfect. I think it was an accident. So we tried to do some automatic de-squeuing to just get some head-bone. Actually, the results improved by more than 50% by just de-squeuing it. Uploading to transcribers, creating the documents. For that recognition of the moment we use Abbey because at the moment that gives the best results when it comes to correctly detecting the reading order. Two columns and especially mixed content that has two columns, the singing column, two columns again, etc. The results of the lab recognition by Abbey are actually quite good at the moment. And so we just stuck to that. In some cases, OCRRHTR files, then we just start again and see how far we get when we process page-wires. Then we make some corrections on the text, which is currently done by Innsbruck University Innovations. Then retrain the model and see where it goes. Currently, our model is actually quite good. The current model, which definitely has to come with a few difficulties, which is the last one we trained on a set of, as you can see, 340,000 words, actually does give us a quite good character error rate. So we are actually quite pleased at the moment. We hope we can improve that later on when we increase the size of the train and set it a bit further. But as you can see, there are a number of difficulties involved in there. We have a lot of tables that are just part of the page. They may be combined with lists, normal running text, and of course if you have some French text in there, which happens frequently, actually it changes from a broken script to an antique one. So actually there are a few challenges in there for the recognition, but at the moment actually it works quite well. The picture you saw earlier, this is what we get out of it after some minor work on the tables. This is our result and I think it is pretty good for starters. We need to see how far we can get on processing tables especially, which is one of the few things we still, I think I still need to have a chat with the developers. But apart from that, we're actually quite pleased. And if you want to follow our progress, you can always visit our project homepage at the Austrian Academy of Sciences, where you can see our current status in our reporting tool, which just pulls the API and looks for the status of every single page. So actually get a small waymark where we are currently and where we want to go. That's it from my side. I hope it wasn't too long boring. So if you have questions, just shoot. Thank you very much. One of the most important points, you have your character already on the test set of about 4%. At the moment. Printed text, honestly not very good. And you have 51,000 lines of training materials. So the idea of any more training materials doesn't seem to be so reasonable. There are a huge amount of training data you have. So where are the difficulties? We have seen numbers here from hand printed text recognition, which were better than those 4%. And I'm surprised that with printed text, it's still so bad. Well, one of the difficulties is the fact that there is a lot of mixture. So we have a very constrict and we have a T-con, just mixed up, more or less randomly. And the fact is, as you see, we have at least 6 major different kinds of layout, which usually means different types used by the printers. So most likely a bigger training set is going to be better if you have different models for the different types. Can I say, the dealers are too pessimistic. For running text, the average is below 1%. But the main problem comes from all these computation marks. So the test set has a lot of these table things inside. And of course the error rate tool measures very exactly if one space is too much or if there is any problem, it's counted as error. For running text, it's much better. Especially I think white spaces are one major point because quite often you can't see it here very well, but quite often in the running text there actually is no space between two words. They just didn't manage to fit the text in their line in the column and so they just pushed it together as far as possible. Usually at times really completely leaving out any white space between the words. We can read it but if we transcribe it, we add white spaces in between which actually leads to let's say up to 10 errors for one line which usually makes the error rate go down really, really badly. And as you see there's a lot of this tepular material with for example figures like 2.5, 2.8 etc. etc. which are also used quite a lot of trouble because not all of those are available for example as a unicode character so you have to transcribe them differently as 2.8 which then again may give you three errors for just one character. So the CR definitely is too pessimistic. And as I said we have to see how far we can go with maybe developing different models for different types because they can be really, really different. Any question? How big is the team? How much population of the project? What's the scope of this undertaking? Well at the moment we try in the first two years to make available about 450 issues from the 18th century basically five issues per year that is available to us from the Austrian National Library. Well basically the team is rather small but it's usually people who are in different projects on that just spend half of their working time on that project. So it is really small, we do not have too much time on and it comes to correcting which is the reason why at the moment we spend the money that is available I'll give it to IUI to actually correct what we get back improve the training and then see how good we can get because later on we won't be able to do a lot of correction so we try to get the character in a way to really put it further down as far as possible. Any questions? So how do you finally deal with the front pages? Did you just cut them out or did the recognition work on them? It usually works, okay. The thing is... Because usually this is, you have fewer training material and you're full of graphical arguments. What is really complicated and usually gives a lot of problems is if you have a combination of a graphic element and text which usually is the case in the logo. So the upper three don't pose too much of a problem. The lower three are usually quite problematic because usually the name vNet sitem is not recognised automatically because the layer recognition value usually assumes this is some kind of graphic and just ignores it so there's no baseline so the HDR can't work on it. Which again actually makes the CR a lot worse because in that case we have I think 25 wrong characters because they haven't been recognised at all and we add them to the transcription and so there again is quite a lot of errors in there. But overall when it comes to the running text on the front page that is recognised with the usual results. I just wondered how did you manage to get such good results from the table? Because the table you showed was amazing and from what we heard yesterday of all the problems with tables I was just amazed to see how good the results were so what are you doing that seems to be working so well? That involves some tweaking to be honest. Basically what we do we add a custom structure type to everything that we dub as a row in the table we just add a custom structure type and in order to get the columns correct we add a pipe afterwards. So basically that's something that does not work automatically to recognise that they add. The text recognition is quite OK as far as it is possible given what's in there. So the text is OK but we add some markup to get to the cells and the columns. Great, last question. Thank you very much. Next up is Alexander Djerkov from the University Library of Belgrade. He is the head of the set library and he published about smart and sustainable libraries. Thank you for talking about your experiences. Thank you very much. So hi guys, I'm very glad to be here and thank you truly for having me. As a matter of fact as you can see I'm then coming down with the flu. Unfortunately the guy that was supposed to come here he was even in the worst condition. So I drove here and I'll try and I hope that my voice will last long enough. I have sprained myself all day long so we'll see how it works. This is not much of a presentation really. Some of those projects are much more advanced. More or less we just wanted to show up and to say that we're still around and that you're putting some effort in this and to make you being able to encourage other people also to join this project. You find it very interesting, very important and especially since we are adding serial clatters to hold this world of transcribing. So we thought the whole part of the world also will participate in this. Easier it will be more important than if we try and we see how it really works. Amazingly it does work, right? So I mean we see Belgrade is your partner. Thank you for that. We have put some library resources and we have also managed somehow to get our national Ministry of Culture interested but not on the scale that we really wanted. And I will mention two other projects later that we have used some funds from them and actually we are looking for the next year hoping that I will be able myself personally to persuade the Ministry of Culture that this is a very big thing. He's going all over the region. He's, you know, he's not so full of names like Cyrillic, we have to take care of our history. We have to preserve this for Europe and sooner or later as Bulgaria became the member of the European Union you have now the Cyrillic letters on your curses. But it would be more fun if we add different types. So the general importance is obvious, such, and we think what brings us to this project let me make one comment. By the end of the 60s, Robert Escarpi started to study the decline of reading saying, you know, that we are going to approach the world where people will not read. But what he was supposed to say was that people will not read traditional texts as they used to because the new sociological findings shows that the population of Earth has never been reading more than nowadays. We all read, everyone reads, every single moment in your lives we read but what do we read? So what I see as the general importance of this wonderful project in transcripts is to make more really meaningful texts and resources available for people otherwise people will read underwears that have been printed heavily all over. Later on I'll make also one more comment in my surprise. So those two projects with Ministers of Cultures are how to read Cyrillics and how to bridge this gap between all manuscripts and present status. Let me just mention that practically Nelman owns the keyboard in Cyrillic letters in Serbia that speaks for itself, right? Proud breaking effort in bringing new technologies to real life using the Cyrillic culture environment we make ourselves at University Library of Belgrade proud that in the last five years we have been a very tiny small but regional leader in digitizations and that we have brought many different things to our population and that we have trained a lot of librarians and a lot of people who know how to do that. Let me just say that before we start doing that there have been four million digital objects made and none of them are proper. So it's important thing, you know, to spread the word around and to train people how to use these things. Sometimes very tiny little details but they will corrupt the systemability of those objects in the future. So we think also that this is important and we are trying to learn. And as you can see myself being a professor at University of Belgrade teaching people Syrian literature I'm not either trained, skilled or educated for this but I think in order to preserve our past have to make an effort to look for in the future. We have assembled one working model, thank you very much. We have four of them in our pipeline and we have trained 13 transcribers users in different roles seven transcribers and four editors and in the next few months we have a few more adding to this number. This focus is obvious but I should mention this digitization academy, it's a very fancy name of course but it's not a big thing at all. But it is a focus group and we have assembled a lot of people around them so people start spending in very tiny libraries in Serbia and getting interested. On the other hand there is a project by Ministry of Culture to establish the real digitization academy in Serbia and that will be another major thing in the next five years. So if we can make those two digitization academies, the one the small one that assembled people and this huge project in a very fancy with a lot of fundings, that will bring the new and new flavor in this. What my colleague said to me, I have no experience myself, but they said to me that this is very an easy tool to use and a lot of fund they had actually. The fund without an effort but the fund that comes up, so you have an effort to see the result. I mean, that is a real fun to me. They said that they have trained few transcribers from various institutions so this program spread around and that they work with them closely. Of course we have had to choose corpora and to evaluate which one will be the better. We are doing right now on the sum that we can manage but I hope that we will be sending to you guys something even more interesting because we hold a huge number of very old manuscripts. This is the third best collection in Serbia at our library and we have picked up two writers who were writing their notes in different languages in Serbian, in French and in German or in Serbian and in German and in English and I think that will be something interesting to see how that works really. So we are preparing those for a lot of years. And as I said about Serbian I should also mention as a follow-up to your comment about the possibility of the future project developing in terms of monolingual translations we have a huge cultural gap in Serbia because the language has changed so immensely at the beginning of the 19th century that none of my students could read anything that has been written in 18th or 17th century up to two years of specialist reining. So as a follow-up to this if we could make those monolingual translations available for them that would be the real revolution. Not only to bridge from Latin to a Bogata world but also within the same language that has been changed so immensely, so much transformed that no one can read it nowadays. And German Switch was one of those guys who described the Balkan Peninsula as a very important guy in terms of the history of science and he left, you know, beautiful manuscripts so they could put him here in this. He is one of those who would be very interesting for us to work on this. We will find new authors and new tech scopers and as a matter of fact, we have already found them and we keep running into some new situations just like this. This professor by which is a very old guy he's an artist but he was keeping his diaries all the way over the years and years of periods where he brought like 15 volumes of unique and really important works something like Barlas that we saw that. Very, very something that could relate to. And so you see it's not totally thinking about what has happened like two, three, four, five centuries ago but this is also something that is covering the current situation. I mean the new material for us is, you know, getting to an existence all around ourselves. And this is the moment to make one comment and I hope I will not disturb you with this. I myself am attracted with this idea of ground truth. It's like, you know, if you read philosophy for 30 years in your life and you see, you know, how many times that has been so severely disputed that there is any truth around us anymore and then you come somewhere and you see all guys, you know, handing ground truth to just as such. It's a beautiful thing, you know. I can't, you know, restrain myself from thinking, you know, if I could bring Derrida here among us, guys, you know, to say something about the truth. So hopefully if we meet again then I will put myself in a position to impersonate one of those guys, you know, post-modern era and try to make comments about, you know, what happens with ground truth, you know, technical world, you know, and these procedures. But then at the same time, speaking about, you do this new materials, we should be aware that people have been written all over themselves nowadays. Can you imagine, just for a second, just to entertain you, can you imagine like in two centuries when someone comes up with the collection of pictures of dead bodies of people who lived inscribed with everything and ask what was that that has been written on them. So we do not know what will happen with projects like this when you start, sometimes, I mean, you ask yourself about the ground truth then you might end up, you know, in the future doing very, very different things entertaining yourself. This is our real priority and I keep, you know, bothering him, asking him, please come to Belgrade, really, we need you and bring some of your guys over, you know, that's important for us, I mean, we will make more models and we say, keep translating, keep transcribing and keep enjoying. Thank you very much for this presentation. The ground truth, what this means for, like, for the fellow philosophical standpoint. Just let Ginther show what I could not. Okay. Yeah, so, thanks. Thank you. Actually, we were very happy that we were able to train Cyrillic as well, another alphabet and the funny thing is that, as you can see on the learning curve, it had some problems but it came down to 8% on the train set and certain person on the test set and this was just 6700 words. So let's have a look to the writing. 6700 words is really not much. These are, I think, documents from the university where students made notes from lectures. You see, yeah, it's a very regular writing and therefore the results are good but nevertheless there are even some Latin script inside, like here. So, there was already some challenge to cope with two different alphabets in one model but the results are nice and you can expect that if you double the number of pages or the number of words, so with 15-20 thousand words it would come down probably to something like 8% or even below. Thank you. I'm really looking forward to your next presentation about ground truth and discipline. Thank you very much. Last but not least at all is our own Eva Lang from the new season, the archives of Passau. She brings together two of her passions in the read. One is history. She's working as a history guide in Passau, leading guys through the old and the new city. I once also had the pleasure. It's really great, so if you're in Passau hook her up. And the second one she works as a computer scientist in the industry and at the university and now is at the dear season archives and she will work about table processing. One of the obviously challenging topics around in Vienna and today especially. Thanks. Well, thank you Tobias for the nice introduction. I want to start with a quote that I received from well people around my archive and that I kept hearing all around. So basically what we keep hearing is that church registry books present information table format which is highly intuitive to the reader and easily to be understood. So automatic processing and automatic recognition must be easy. Well talk to the technical people they tell you don't put up your expectation that high, but we will get there. So I will show you basically what we have done as a manual process and then give an overview of how the automatic processing is currently standing and where that's going to take us. So standard workflow for us is selected data you want segment and layout, transcribe and do some quality assessment. We've heard that all over perspective I want to add to that for table processing is basically you need a table template. So define your template this is part of our data selection process in the segmentation and layout you have the table layout versus text regions and I'll talk to that in a bit, baselines and template matching. For transcription you can do that either by hand or using HDR once you have a good model for your data and quality assessment. Well the technical world gives you different answers on the visual control the tools currently allow. So let's step into what we have we have a full corpus of about 800,000 pages out of which we trimmed down our data selection to 1579 scanned images in those images we found roughly or we estimate roughly 700 different scribes most often it is the parish priest who we assume to be the record keeper but as that's passed we don't really know who kept the books and sometimes the second in line in the parish put in his hand so things get messy you don't have one writer not a nice collection like Bentham has just one style we don't have one. We take this time frame for the handwriting to becoming more and more generalized so the perception was that things are resemble each other a lot more than to go through the whole collection of 1600 and a little something to 1900 and a little something more. The forms the printed forms in the books are similar in content so just elaborate that a little in post a rule that the parishes need to keep some like a standardized form on death records mainly showing who died what their occupation was where they had lived how old they were what the burial date was what the death date was what the reason for dying was who was the priest and some annotation and sometimes you have the court in there so there's some pieces of those content information which were given by order of the church and others were just interpretations that's why we ended up in that corpus of those roughly 27,000 pages with 11 different print categories all in all if you take into account the right to print of the headers and the columns you end up with subcategories something like 88 different templates so it depends on how you define difference I will show you some of these in a bit but let's let me walk you through the manual process that we've been applying to start with that so for segmentation the first an easy thing before the table editor was workable and usable and easy for us to use for segmentation was to draw text regions basically for anything which is considered as a table cell you will see differences just in a bit how that looks once you apply the table editor where you only segment the whole table area and then you cut the lines to make a rectangular table layout so basically if you look in the three-fold column if you have a look on that so the writing continues basically over the graphical lines so that's that's a display issue in the UI so we decided for ease of transcription to turn those table structures into text regions which cover the text line basically what the technical people called the bounding box of the line then you set up the baselines which is a manual process which we outsourced to a subcontractor I'll be talking about the automatic processing which really gives us a lot of hope which changed or was made available last week in a bit and you use the text editor to transcribe what you can read there so now jumping into the automatic bit the technical partners within read told us there are different technical steps to be carried out before we can fully automate the process but we're on a good way and in fact we have a workflow from image to something like an excel export which the colleagues in Vienna and in a Grenoble setup for us so that works in a way that you first define the templates and I just have brought here four of these all these into the category 10 columns but you see the headers are different and it was a little bit of trouble for Florian and his team in Vienna doing the table matching you see that middle bit and the binding varies depending on how much you see on the page so that caused a little difficulty where we found a way to fix that so technically you would end up with 11 columns here with one having very variable width so once you define those table templates using the table editor you can match that on your collection the Vienna colleagues have a tool which can do that well at ease it's not within transcribes so you need a separate tool here and basically that gives you a rough alignment of the red template against the graphical lines in the print that rough alignment in the further processing step is done so that the red lines actually match the graphical lines as hot colors so basically the dark black lines is now the overlay of the final result of the graphical lines which used to be and in the next step you use the line finder which was provided by our colleagues in Rostock and which really makes us very happy because on the tables we tested it this looks gorgeous so basically here you only have the columns from the table template and the lines in there next is you need to detect the rows this is input from the colleagues in Grenoble to actually make sure that you're looking at the books at record level so we want the information in a horizontal matter so every row belongs to a personal record entry once you have finished that task you put over you run the HDR on that which you trained on the ground truth you provided and you get out something like this so for those of you who can read the text there are especially if you look in the last line there's no word in German we just have three ends we do have something with three Fs but that's not in that century so it should read Cran instead of KR triple N those are just a few errors which we think we can tackle once the correct dictionary is applied to the process checking on the quality this is just an output of the text comparison tool which transcribers offers and what I find a bit confusing it shows you the word errors basically so the green portions are the correct ground truth and the red portions are the wrongly red words what I need to provide as information on this the results here were trained on four scribes and the data you see here is a fifth scribe so the system did not know that hand we put in as training data 20 let me do the math correctly 100 4 times 30 120 minus minus something like a little over 100 training pages and tested that on 12 test pages and out of those 12 test pages well that was the construction of the net and then applied that to a batch of 30 pages which is this scribe and these are the results in applying the dictionary of the four scribes well what we get to see here is basically the more you know the hands the better it gets but already in as data is similar things already are somewhat workable for us 50,000 words 50,000 words okay thank you so basically as a summary the table editor is a useful tool for supplementation what our transcriber gives us back is that handling the transcription is a more user friendly way if you use the text regions still but that might be well just perception of technical tools the HDR quality is good and keeps improving when trained on the scribe you're trying to evaluate one more proper dictionary is used when more training material is supplied and currently we're looking at character error rates of 13 to 17 percent okay that's the quick excursion into table processing thank you very much questions in the back two questions the first one is if you got a table templates how do you know which one to apply to a specific table this is the first question and the second is how much of the work you have shown here is actually done with transcribers is the table editor available is the template application mechanism available to your first question I went through the effort visually grouped every single page of those 27,000 pages and categorized them coming up with those 88 very different unique things and those were grouped into 11 hyper categories and what I found in that process and that was basically creating run truth for table matching and what I found in that process is most of the images in the batch correspond to one table category and a second one only a few so you don't well, experiments need to show how many of those we actually need in the end but we have a clear we know which thing to expect to take that a little bit further what we want to achieve as well is something like the system giving us the right template to apply so Florian is working on a measure on how to evaluate which template fits best the template creation is something you can do in transcribers for the application of the template to your transcribers document you need to use the NOMACS editor which is a product developed by the CVL lab here in Vienna so that's a one part that we take outside but the streamline workflow we have is if you know how to use Python there's command line tools available which bring in the whole process and all of that is available on GitHub so once you have the systems installed and the tools are all open source tools installed you can run that on your machine so it is available but it is not yet supplied in the graphical user interface maybe just a short comment I think that's very much experimenting and testing so we are of course waiting until the matching processing of the tables is stable and it shows the size which are really satisfying and then of course it will be part of the usual workflow in transcribers but the images I showed you in the second part were actually done automatically so all of this was done by the system no manual work in there August, do you want to say something about comics? it's why they are all hurt this is in English it's a good video and we actually have all the data software so we shouldn't use it but thank you because in the area of this development it's here thanks thanks a lot Eva and thanks a lot to all of you