 Hwyddeiddo i gyd, wrth gweddol iawn yn y g議 mewn cyfion yma yng nghymru a'i fath o'r ffordd. So, dyma gydag ffnwys yma ac y mae'r clifwyr yn ei bobl. Y gydag y yr un bwysig hefyd yn hwnnw, a hwnnw i'r cyffredig nad yma eich cyffredig yng nifer o'i ddym ni'n clywi ac yn ddigon i'n hynny, writher mae'i roi ei wisi'n hefyd yn ffliwg. On the panel today we have four speakers. Three of them are from the REED project and then we have one person who's representing the user perspective of a transcribers user. So I'm going to introduce them briefly but they're mainly going to introduce themselves and I've asked them all to start off by talking just for a couple of minutes about what their role is on the project, how they use transcribers or REED project technology, why it's important in their work and I've also asked them the question of where do they see transcribers in five years time. So on the panel we have first of all Roger Leban from University of Rostock, Philip Scofield from University College London, Melissa Terrace from University of Edinburgh and Debbie Cornell from William & Mary Libraries. So we'll start off with their introductions. Let's start with Debbie if that's okay. Debbie Cornell, William & Mary Libraries. I gave a presentation this morning so you can understand how we're using it. I think I've gained enough information here to go back and explore more with the tabular data issues here so that's great and the text to image work. We're looking to really explore more of the complex relationships on the material we're having to transcribe but as in five years time I'm hoping more of that capabilities of transcribers is built within the tool instead of kind of having to build it outside the tool because it comes within and we're also hoping to build more funding for it and bring it more to the US to get used for our projects. Thank you. Hi there. Hi. My name is Melissa Terrace. I am now at the University of Edinburgh two weeks ago. I was at University College London so I've just moved and I've been part of the transcribental projects that started. I was at that first meeting when we decided that we were actually going to do it about ten years ago and I've been part of that project since I do have a background. My PhD was in its processing and artificial intelligence in the handwriting recognition so I've always had one foot in the computational sciences. My new role is Professor of Digital Cultural Heritage at the University of Edinburgh. I work with big libraries and archives and museums. I have small libraries and archives and museums on a variety of digital tools and techniques that they can use. And I'm fascinated being part of the transcribes teams. They're working on the report now, working with the Bentham team on what they're doing and helping us on the user issues and the user interface issues, but also the link to libraries and archives and the institutional links that we can build up with people. And in five years time I'd like to see a whole range of things happening. I'd like to see more institutions involved. I'd like to see mechanisms for people to be able to run their mass digitisation of manuscripts through this kind of software to be able to make it usable and useful for researchers. But I'm also fascinated by a meta question which is how this is going to change historical research within five years time. So we're changing the parameters of the types of searches and the types of access and documents that people can have from doing it themselves, one by one, to changing the scale. I want to think about what that means for doing historical research with digital resources. I have questions about the data and about some of the data that's getting fed in, especially image collections, but also the transcripts that then get generated and what we can be doing to collate these so that other people can access to them. And how this fits in with open lab movements, so open access to data sets and licensing them in ways that other people can take them. And I see that Transcribers is part of this wider movement within libraries and archives into digitisation and making their content available and making it accessible for wider than audience as possible to do worse than powerful historical research. Hello, Philip Schofield. I'm still after University College London. I'm Professor of the History of Legal and Political Thought in the Faculty of Laws. But so I'm an historian and I'm also a general editor of the new edition of the collective works of Jeremy Bentham, so I spend a lot of time doing textual editing. Bentham did in 1748 to 1832. The Bentham project was established in 1959 and I've not been there since 1959, but I have been there a long time since. 1984. There are about 100,000 pages of Bentham's manuscripts in the UCL collection and another 15,000, 20,000 in the British Library, those are the two major collections. Bentham himself published 40-50 books and published during his lifetime and would destroy the manuscripts on which those books were based. So the manuscripts are in addition to the printed text. Today we have published 33 volumes in the edition, 12 of the correspondence, the rest of Bentham's works. My back of an envelope of the calculation is that there will be 80 volumes altogether in order to complete the edition. So this is a major, quite unique, well maybe not totally unique, there is the Marx Engels edition which is probably comparable. But this is a large arts and humanities project which has a massive amount of material to transcribe. We worked out that at the rate we were going with transcription it would take us to the end of this century in order to finish here. And so with Melissa and other colleagues about seven or eight years ago, not quite ten years ago, but we established Transcribe Bentham as a crowdsourcing initiative, a scholarly crowdsourcing initiative where we asked volunteers to transcribe Bentham's manuscripts. This was also linked to digitisation of the Bentham papers which allowed the uniting of the collections from the British Library and UCL in digital form. That was a collection separated at Bentham's death. So that's the first time that that's been brought together since he died. I think part of the success for Transcribe Bentham was that it was linked to the addition that there was some further point to doing it apart from simply transcribing the manuscripts. And also before we set up Transcribe Bentham we had a very well constructed database of the Bentham papers so we already had our metadata in place up to 15 fields of information for each and every one of the 68,000 folios in the UCL collection and those folios turning to a separate 100,000 pages. So our volunteers have now transcribed 19,000 pages using our crowdsourcing website and those transcripts are now feeding into the editorial process as we take them forward to the critical edition. What I would like to see is a complete transcript produced so you come to see where the HDR would be most welcome in that instead of waiting for 10, 15 years for a transcript we might get it tomorrow, which is what we all want. We don't want to be on the loss rather than realistic but still. Then as a starting point to improve the transcripts so extra generated transcripts might be put out to volunteers who could then correct them and then feed even better transcripts into our editorial process in their course at a time. And we've also started to use the OCR of a facility on transcribers with excellent results. That's taking late 18th century print editions and very quickly putting them into the editorial template which we use to send them to the press. So what I would like to see developed is a workflow which starts with images going to ACR transcripts, credit source input into an editorial process and coming out at the other end and text which is ready to be sent to the type setters to be put into a critical edition. But also making available a website where a lot of the manuscripts we don't publish for various reasons because the main one being that in what we are publishing our core event works and so there's a lot of additional manuscripts such as plans, notes, earlier drafts but which are still of interest to scholars and having those available on a web platform for people to search and to look at and also to look at the way we're really interested in looking at the critical edition and see what we've done in the edition compared to the raw, the starting point with the manuscript. So I think what we have is a set of materials of great historical importance of still of philosophical relevance which are of great interest across a wide variety of disciplines. So the way I'm making that more widely available will also feed into better scholarship. Thanks for that. My name is Llewellyn Lannan and I'm a mathematician at the University of Rostock and I guess most of you have already fed up with my introduction because what we are doing here is not mentioned very often. Transcreavus and the REIT project I'm leading the SIDLAP team and the SIDLAP team in Rostock produces somehow at least one component for the ATR for the text recognition and also for line detection for the decoding for the keyword search. So we are a technology partner in the REIT project. Our own work with Transcreavus is quite different from what most of you are doing with Transcreavus because we are using it as a destination of our software in a sense, not using it as users. But we are using it and this also gives me the opportunity for me to mention it because we ran into some discussions whether we would love to have application and testing via Transcreavus and the answer is yes because setting up a single project without Transcreavus means a lot of overhead of work for us, for instance to handle data and so on. And we are more than glad that Transcreavus also takes that burden from us so we would love to have some workflow like we develop new technology and you are using it via Transcreavus and so using all this data transfer and a major presentation and so on via Transcreavus. Also we are using Transcreavus for ground rules of course so the data you are delivering to Transcreavus as far as we are allowed to do it of course is more than welcome for testing new algorithms and so on because we are far from being able to produce ground rules on our own. So this is really a very valuable contribution in one of the areas which Quintar always has in mind when he speaks about the various aspects of Transcreavus. Where do I see Transcreavus in five years? First of all I would like to point out that still I would love to have Transcreavus accompanied by ongoing read projects. Maybe we have read two and read three or I don't know what it is always called so common read project and it would be great for us if we were on the team on the board then also. Then something more is there to see in ours Philip I guess I would love to see a transposition in Transcreavus also while we are now sticking with 1.3 I believe in five years we would stick with 3.1 and so that's probably also the expectation which sets a benchmark for Philip and his mates who do very valuable work for us in setting up all this environment for work. Yeah but then apart from these things I would love to see Transcreavus also on top yeah at least in Europe of the world we'll see in the state of the art technology for automatic text recognition and everything which is somehow joined to that. I would love to have it as an acknowledged and appreciated of course data source for ground tools for the technology development what I pointed out earlier for many people who are working in technology development in this area having good ground tools is really a very tough question. Of course I also see the different demands here in the audience we had frequent discussions of that said for instance scientific demands and having a virtual research environment what is the basis of the ongoing read project application and you call at that time was this destination so it would be great if this virtual research environment for scientific purposes is well established and widely used so this meets what Melissa said for. And also on the other hand I see that there are completely different demands somehow from archives libraries so people who have sort of mass digitization in mind mass processing rather than having a maybe higher quality but a smaller collection digitization so I would really love to see Transcreavus and having find its place. In this wide range of demands and meeting the requirements of archives libraries like public institutions and also what we already have and know from commercial usage also like big publishing houses and so on who also can use the technology of course so there's a wide range of different applications and somehow. I wish Transcreavus having the requisitions in the next five years to find a good place well established place in all of these demands. Thanks Roger. I wonder if we can pick up on that last point about transcribers having multiple applications and users so are there any strategies or ideas we could have about how to balance this we've got archives and libraries we have computer scientists maybe commercial use as well in the future. How do we balance the needs of these different groups? We have to keep the core driver of the project which is helping academic researchers so that was a core element to it and it would be a shame to lose that but at the same time there's a hope that it will be sustainable at some point. That's all which means there has to be a revenue stream and this is the most balanced with these digital projects is how you keep something ongoing especially if the funding stops. So there has to be a balance but within that this isn't a startup like Google or Facebook. It is a product which was designed around a particular academic research task so we shouldn't lose sight of the fact that we have communities that are engaged with it that we are hoping to open up historical mindless threats for people to research on and yes there's a relationship with institutions and yes there will be a commercial relationship with other people. I'm sure that when people see some of the bigger genealogy companies see the kind of successes which are coming out this will be very attractive to them and there has to be a mechanism for balance to make sure that the original core community that this was built to serve can still be served in a way that we have the opportunities and the resources to do this kind of thing as well as the resources to do it. To keep the machines up and running, to keep the people's salaries being paid and to keep the whole thing taking over so it is about balance in the future. I mean as a historian my sort of desiderata would be to have one place where everybody's transcripts would deposit it. So I'm not reinventing the wheel, especially if it's unfamiliar hand or what does everybody else does and we've all tended to be very private about our work and I mean one of the great things about Bentham project is that anybody can have our transcripts because we're there to promote Bentham scholarship and maybe if there is a republic of scholarship then that's what we should be aiming for is one place where we do everything we can do. No matter how good or how bad, then let other people look at it and improve upon it. I mean I used to think that the role of archives was to put things into neat order and stop people looking at them because it meant getting lots of shelves and taking them out of order. I mean fortunately the good thing is that over the last few years things have completely changed and there is so much more emphasis now on any stuff available. So it seems to me that it may not be a balance of conflicting interests but we've all got an interest in making all this our historical past available to whoever wants to look at it. Of course that's in the name of the professional historians but also that last interest in people generally in understanding their past and that's why we need an attractive web interface which is easy for people to use and to get into. I'd also like to say in the US getting an open source is really big. The only competition you have is private vendors to do the work for you or crowdsourcing transcription and going through a very lengthy process of developing yourself of how to do if you want to do a critical edition or an academic edition however you want to term it. Is you have to set up a tool or a full process yourself or it's transgribus it's not there yet but it has the capability of getting there to have that entire process like you're saying from the start of digitization all the way through to publication. That is what has been key in getting people interested so far in the conversations I've had is that the fact it's an open source and I think institutions in the US are very interested in if they can get ground funding or apply their own institutional funding to initiatives like that they are very interested. How good are we at the moment at sharing our resulting data sets that come out wrong as individuals? Do we have a central repository, a data repository that we're telling people to practice? There are ones in our data, is that right? So you're using something? There is a central repository but it's just not a very long one so we haven't yet to not have currently a mechanism to ask people can we share your data so that's on the duty. Cos that's part of the open source movement as well, I guess we need to kind of show people the way in some ways that this is how we do that and using free available things, whether it's Github or whether it's in Odo and with this parking documentation information that people can take and reuse so that we don't start with any wheel and acting almost like these kind of open source vendors but in a way which is friendly to our used reality. And I know that there are issues with copyright and other issues with permissions and it gets complicated and people can be a bit gargant about what they are working on so they've published it themselves but on some level we can provide mechanisms or point to existing mechanisms that we can build on it that will cost like to more people get off that we can actually build on the closetries quite easily if we just kind of get on with it. I'm just sorry I'm making more work for us. I wonder if anyone from the audience has got any questions or comments about this idea about how we should work with transcribers in the next five years or how do you see it developing in the next five years? The impression I've got is that we're going to the road and starting to use it now with the impression that we're taking away from the conference about the risks and it's going to be very successful. The issue I think is emerging is one scalability in that the amount of images and text that you're storing is still very, very small but this will be balloon exponentially in a very short period of time and so will be the data processing requirements of the system. So I'm wondering what the plans are to make the platform itself scalable. That might be one for Gunter. You get another chair. So the plans for making it scalable are to actually be controlling storage that's not an issue so 50 terabyte are reserved, 100 are applied. More issue is training resources so the training models that's something we have the right to GPU servers. Currently, and that may be a chance to say that last year we had a downtime of I think 11 hours or 14 hours unexpected downtime for 2016 and today we had also unexpected. As you probably experienced the last year with the hours nothing was working so your question is really correct. The hardest bottleneck is currently to angle the ingest of the files. Just more technical detail can be resolved by using more servers for this. They are already available so that's not an issue. On a larger scale we know that we need to distribute computing power and also storage. So it makes sense to think on a central system but this distributed storage is distributed computing. So I think that's a very very important thing and would also fit very very well to the whole structure in here. So there are many universities with excellent resources and it would make very much sense to reuse or to use small parts of these resources within the network. So yeah we are aware of that. I'm currently also at this time of the day. I'm not worried now but it's correct. Things will grow exponentially hopefully and then I think it will work. Are there any other questions or comments for the panel? I have a question if this conference is all about the users. So what do we think the priorities are for improving the user experience for a typical transcribers user? I was encouraged by the web interface and the development of that would make life easier. We had our transcribers who would be directly in transfer risk which is a bit of a learning curve for them but as we weren't doing as that much markup or editorial work, once students got the hang of it it went pretty quick. I think the other thing is just more of the tools built in. So like a text image, I think it's there. I don't know if I've requested it but that process being able to just do that all on our own and kind of be queued up to know when that would be processed. I'm trying to think about that. I think just more documentation but I think with our project is where we have a philosophy of very much as open source but also hopefully we'll get to the point of publishing but making it available on our transcription side the work we're doing. So more slideshows, videos or just articles about this is the step we're testing, this is the steps we did to get there and that's basically what I've learned from so many other people. I feel like that's the way to get back. I move behind the scenes. It says that, that's what I'm trying to find. Anything else about what's the priority for improving the user experience? Of course, the web interface, I mean I think the priority for me is let's say a typical story if there is such a thing. If you want to get access quickly to it, you can get access quickly to a document instead of having to go out the way around the world to a library that stores it and then just have a simple way of getting a transcript and then exporting it into a work document. And then being able to spot, keyword spotting of a large group of manuscripts, that's what we do as historians, we read through loads of stuff and eventually find something you want to look at in more detail. Having that in a simple web interface is what's absolutely necessary and is what will be the make or break of read because that's what the audience is. I think from a lot of people's perspective as well, having something where they can put their documents out for processing or send them out to people with a particular expertise who might be willing to help. With our benton material, there's a certain amount of it in Russian, and we found somebody who was interested in transcribing that material for us. They're creating those sorts of opportunities for people in a way that's easy to understand. Easy to grasp and it's intuitive. It's really, I think, crucial to the success of this. This links to funding because if you get more money from the EU, you have to show how popular your system is. The more people are using it, the more time you have to get into funding. Just to emphasise what has been said before about explaining to the users what's going on, one of the things I learned within these days, these two days, was that there's more demand in learning about what you say is behind the scene. We just went into a tough discussion about the meaning of the parameters, interpretation, for instance. It was a funny discussion, but it showed to me it's much more as necessary for hopefully rather than having just a slider which carries a number of changes and other figures about what does it mean and what is the effect of all of it. We started seeking and discussing maybe the answers to the inputting, some more help even from the scientific part, maybe the slides, maybe also videos in a sense, what happens or what shall I have to think about or what do I have to do if something happens with the results, for instance, now in the keyword search, is it a good idea to sort of enlarge the parameter or make it smaller or whatever because I guess if we leave the user help us and it produces a lot of, well, bad user interaction or bad user satisfaction emails to the platform, to George for instance, in this case, in the next morning we'll have the bad emails there. I guess even for us technology people it seems to be really necessary to put more behind tons of people's self which was now for instance part of the slides, of the talks, of the videos. That's what I take home. This might be one, sorry, go on. But if a learner or email can go out when there's any version to update, there's been many times you're using it and then also if something's not working correctly, you just go, okay, I've learned, go check to make sure it's not an update. Yeah, you never know when we're going to update it. There's a question there. Just one thing sort of adding in some ways to what Roger was saying there, but more generally it still seems that this is relatively experimental when people are playing around and there's quite a lot of people and I've spoken to people here and somebody's tried this and I feel, oh, that's interesting or they can say it works, it doesn't work or you might want to experiment with this and whether it's the technical side or it is the user side, there's a lot of knowledge in this room which everyone being together for the past few days has been hugely helpful. But if there is a way to try and make that available within the interface, not necessarily sort of a little bit more, I guess, just make it more available and maybe create a network of users or places where people can share their experiences and their ideas. Because a lot of the time you do feel a little bit like you're on your own playing with something you really don't understand, coming from a historical archive background. I have no idea what the technology is doing. So if it gives me a bad result, I don't know is it me? Is it the technology? I think there's a lot of experience and knowledge that perhaps could be shared a little bit better. Something like a forum or something, that'd be a good idea. E-mail is just an e-mail list for subscriber users. E-mail list that could create user and other people could pitch in and help. To see, I don't know how much traffic that would be, but it's probably quite a little traffic, we're very helpful. That's what you want ideally from an e-mail forum. It's worth trying. Yes, so you'll be getting even more e-mails from me. Something you might be able to have with Melissa. We've had a lot of success in Europe and most people in the room are from Europe. Have you got any suggestions of ways we can promote the project outside Europe because you've got a lot of international connections? I always think it's best to show people. So I'm thinking about demonstrating to people what they really want to do. I'm very hooked into the Russian digital humanities community. It's a really long story over a beer, but I'm just back from Siberia a couple of weeks ago again. And they're digitising quite a lot of stuff over the week too. So there's issues about the culture of canon, about the Western canon and how much has been digitised and how much we have to look forward to making sure that mass digitisation happens not only in the usual places. I don't know how well this would work on Cyrillic and if there's any models that have been built already on Cyrillic. Yeah, we had the model this morning from the University Library of Belgrade. I can't remember the character of a rate, but it's okay. It's okay, yeah. So it's always about finding the right people that you have to contact to and it would be worth talking to a few core groups that are working as a group called Goldie Age which is about internationalisation of digital humanities and they are working specifically with people who are not Northern American, not European. So there is interest just now. There's a big digitisation happening in Sudan of Arabic manuscripts. There is quite a lot of work happening in Namibia. There's quite a lot of work happening across Russia. There's a lot of interest in South America and that's the type of user group of scholars who are using digital tools that could be ambassadors and it's probably finding a way to do that. It might be worth the digital humanities, the big conference that we can use in computing for humanistic research is going to be in Mexico next year. So it's DH 2018 in Mexico 6. It might be worth doing another workshop there because there's going to be Red HD as the association for South American digital humanities and there will be a big representation from people from across South America. Everyone's running another workshop there. I think it's about finding the ends and finding how you can... I don't know what to think about that and who we should be talking to because I think some more workshops are taking and I'm sure people try to work national level in Mexico on the people who are partnering with the DH conference next year so that's the kind of people that we should be talking to about the other products. I'm just ranting now as I think of things but there's ways in that we could get attached to various different international groups of people and I think the digital humanities community is probably a good place to start because there's so many emerging groups of scholars that have already got networks established. Summer in Mexico sounds nice anyway. Roger, my understanding is that if we start expanding into new areas for doing HTR, languages that don't have for Western alphabet is more difficult. So things like Arabic, Hebrew, Cyrillic text is more difficult to recognise. Is that true and can you speak a bit about that using different alphabets? Yes, it's certainly... let's say it's partially true because from my core technology or technological point of view or even mathematical point of view it's our intro, right? So we just have codes coding character let's say or something and we all know about the notion of this uni code and so what we restrict ourselves to is that we represent the characters by their uni code code and what the character really is we don't care and so every uni code letter for instance carries also reading directions or whether this is right to left or left to right we don't care so this is really not important and if you have signs which are composed of various accents let's say or other parts of the letter then we map these things to a single character code and work with it but of course we are also a very practical component and so I believe that if you have an alphabet which is composed of very different letters for instance or very difficult letters I can imagine that the zero cognition procedure itself has trouble to map the image finally to the proper code so I believe that the writing style of the character is very tiny sub elements so it's probably hard to do so it's sort of a bigotty it's a mathematical background it does not care about what's the meaning of the characters but we are aware that there are more difficulties than just the mathematical answers so as usual the answer is sneeze actually yes or no somewhere in between there's again a slide Thanks Are there any other questions or comments? I want to ask you all what your favourite tool was from the tool pitch and why Roger, your answer is obvious I mean my group course of course represented with the keyboard search and with the text to the image and of course I'm only for the text to the image which was very hard fun handed us the list of tools I would have to say that the text to the image and I'm sort of interested in the e-learning app basically for our project a lot of the academics have been saying you want to teach a student to do transcription and reading documents coming up with something like that so that might be really useful on that side but I like the text to image and layout For me it's key words spotting because I think that would have the most transformational effect on the historical research that I'm involved in where you're just trying to find instances of something within a whole body of documentation and actually it's you're searching, we can use this to search for the new deal in the hay staff this is what this is all about at the end of the day if we have mass digitisation especially mass digitisation of manuscript material if we get this right we can find the new deal in the hay staff quickly that's going to change the volume of little queries at the time you have to spend looking at original manuscripts so for me that is a win for the deadline to be able to speak up on historical research that I'm interested in I'll just say a word for the scantend I guess because it's a thing that I really like it but we had a student who was visiting the Bentham project from the Czech Republic with not easy access to certain books and there were a couple of books she particularly wanted to read and to for about an hour to do about 600 pages to put the book in and then to take it out and play and on and on and on it's really really great fun actually and I think Louise and others are thinking about doing a scantend having a competition seeing how many pages archived material you can put in a day yes we've got an idea because it's international archives days in June every year so we have the idea that we're going to have a scanathon using a scantend in three countries and see who can scan the most material in a day after all you know we can combine what we're doing with fun and Bentham will be very pleased because that's just what the end is for the best of we'll bring the panel to a close shortly has anyone got any last questions or comments they want to make before we go okay well I would like to ask you to join me in thanking the panellists for speaking thank you so much stay where you are because we're going to hear from Gunter who's going to do an announcement and close the conference yes thank you thank you also for this great discussion and I think once again thank you for coming and also an invitation for the second transcribals user day in a year or so we will organise it by sure I hope to see you again I don't know if it will be here probably somewhere else maybe in the mountains let's see and yeah I think it was really a great confirmation of that that we are somewhere going in a direction that is appreciated by you and this gives us of course a big motivation to go on we know of course that a lot of things a lot of requirements are there but on the other hand we have the feeling that many requirements are similar and we get a good picture of what is really necessary and what is what would help you in your work I like very much the open atmosphere also doing all these desk projects with you don't be too impatient with us sometimes it takes a while that we are answering the emails yeah but I hope that you understand this and yeah I'm looking forward to see you again, thanks for coming