 to introduce to you and although she wouldn't need introduction probably last one but I think all of you have read something written by her or had in some ways contacted without knowing her so she's really the mind behind the transcribers user conference she's really mad the mind behind all the dissemination activities which is an active research in the field of this whole phantom world that's spread from the University College of London to to the rest of us and she's going to talk about here spotting in practice now and thrilled to have her as the last speaker of this conference. Thanks. Hi everyone, everyone here will be okay. Okay so yeah thanks to Vice for the introduction and thanks everyone for staying at the end of the conference. My presentation isn't too long so you'll hopefully get to be one time. So as I said I one of my jobs is to do a lot of dissemination for the read project so I organise this conference and I do a lot of events and writing about read but in my other hats I work on the Bentham project and I'm a historian at the University of College London so that's what I'm going to be talking about today and some of our experiments with handwriting. So it's just a summary of my presentation I'll first of all give introduction to the Bentham project then talk about our crowd sourcing project that's transcribed Bentham. I'll talk about how we've been working with HTR over the past five years and finally show our new toy which is our Keyword Spotting tool and then talk some more about the future as well what we hope to do with HTR and how we hope to improve. So Bentham as we mentioned a few times already at this conference but some of you might not know who he is so Bentham is an English philosopher and former and he lived from 1748 to 1832 and he's most well known for coming up with the philosophy of utilitarianism and this is the idea about maximising pleasure and minimising pain for the greatest number of people in society. He also invented the an optican prison design which is the idea that surveillance in a prison can encourage prisoners to be on their best behaviour and he's also famous as well for a slightly strange reason. He requested that his body be preserved and publicly displayed after he died. So this is Bentham's body which is called his auto icon and you can see this in the corridor at University College London if you can't reverse. So the mission of the Bentham project is to make the definitive scholarly edition of Bentham's published writings and also his unpublished manuscripts as well. University College London set up what they called the Bentham Committee back in 1959 and in 1961 the first general editor of the edition was appointed after which the researchers who worked on the Bentham edition came to be known as the Bentham project and we have the current general editor of Bentham edition here at the front of Professor Phillips Gofield. So Bentham left behind a quite big collection of manuscripts which we held at University College London and British Library as well. So there's about 75,000 folios in total, 60,000 at UCL at University College London and about 15,000 at British Library. So far the Bentham project in nearly 60 years we've edited and published exactly 33 books written by a Bentham but we expect that the total of the edition will be at least 80 volumes so unfortunately we're not even halfway there yet so this is why we need new technologies and HTR to really speed things up. So we started to think about this back in 2010 when we received a one-year grant to prepare and launch a crowdsourcing project which is called Transcribe Bentham. So this project is an online initiative members of the public come to our website and we asked them to transcribe pages of digital images of Bentham's manuscripts. What started as an experiment has developed into a really well established initiative since 2010. Volunteers have transcribed about 20,000 pages of Bentham's writings at quite a high level of accuracy and for that we own a huge amount of things. So crowdsourcing transcription in this way working with volunteers has proved to have three main advantages for the Bentham project relating to preservation, scholarship and public access as well. So first of all it means we've digitised all of Bentham's papers, they amount now to around 95,000 images from both University College London and British Library and obviously the transcripts that our volunteers produce are helping to make these images more usable, make it help people to read them. It's also important for our scholarship we use the transcripts produced by volunteers to give us a head start in our editorial work. It means we don't have to transcribe everything from scratch. Transcripts are also useful for other researchers and students in Bentham. And then as well the crowdsourcing side is a really good example of public engagement with history so it allows us to spread the word about Bentham beyond academics and the public can become involved in a research project like ours so it's very rewarding for both sides. So volunteers, their names will appear in the acknowledgments of volumes of the edition where they've contributed transcripts and we've already started to use some of their work in one of the publications such as the Bentham cookbook. This is a real collection of recipes that Bentham wrote for prisoners in the Brisbane prison as you can imagine they're not the most appetizing of the recipes. So this is a screenshot from our crowdsourcing website so our website is called the transcription desk and it's all on a media wiki framework so similar works in a way to Wikipedia. Transcribers come they register for an account they can explore the manuscripts on the site choose a page to transcribe and start transcribing. What we ask them to do is produce a diplomatic transcript so an exact replica of everything that's on a particular page. This is a challenging task for two main reasons. So first is Bentham handwriting. This is a particularly bad example so I wouldn't give this one to a new volunteer but yeah Bentham writing was never neat and it got increasingly worse as people got older so volunteers always have to make sense of as on this page the frequent editions changes deletions are certainly going on and then secondly as well volunteers have we ask them to use TEI or the texting coding initiative mark up their transcripts so on our transcription desk we have a toolbar where they can click and add TEI tags to mark the features of the manuscripts like marginalia, paragraphs, editions. Once a volunteer has transcribed a particular page they send us a message saying that that page is complete we then check each page that's submitted to us the page eventually gets saved and uploaded to the UCL library repository where anyone can view it and we also spend a lot of time supporting our volunteers in other ways as Mark was talking about yesterday setting challenges for them doing events user documentation answering all their questions so to talk a bit about our users before we want to talk about HDR so in common with many crowdsourcing projects transcribed Bentham is not really dependent on a crowd it's really a small group of active users who do really good work so our website gets about 300 to 350 unique views every month but only a hand for the people actually registered for an account and transcribed something so since transcribed Bentham began eight years ago there's been about 660 users who have transcribed something at least once but it's about four percent of this 660 who are really our core users who we call our super transcribers so they're the ones that have done most of the work there's about 31 people who have transcribed 19,000 pages which is about 95 percent of all pages transcribed on the transcription desk so it's a phenomenal amount of work for the really small group of people that have done levels of participation amongst the super transcribers vary so some of them come and go every few months or even every few years and there tends to be between three and five people participating every week 15 of our super transcribers have contributed something in the past year and I said some of the others come and go so what's important about these statistics is that it really shows that we're dependent on quite a small number of people who have made a huge achievement they need to help basically so we need to make sure we support our existing super transcribers they love transcribing and they do it very well so we need to make sure that we keep them happy but we also want to encourage new people to take part people who might be scared of benefit from handwriting we want them to take part as well we know from the surveys that we do a difficulty of reading benefit from handwriting is a real barrier to participation and so we hope we can simplify this experience of reading benefit handwriting with HDR so we started working with HDR in transcribers and transcribers sorry in 2013 as one of the partners in the transcriptorium project which was the forerunner project to read so at this point I think it's fair to say that the we were quite uncertain about the capabilities of the technology and the technology is obviously advanced advanced already quite a lot in the past five years so what was decided was that we would focus on training a model to recognize some of the easier writing benefit from collection and there's a lot of pages in the collections that are actually written by Benton secretaries where handwriting is very neat so on this task we collaborated with Alejandro and his colleagues at the passenger recognition and human language technology research centre at the University of Valencia who were part of transcriptorium and are now part of read as well so they helped us create nearly 900 pages of training data images and transcripts from the Benton collection and they processed them but under transcriptorium they were working with hidden Markov models rather than neural networks and the best result that we managed under transcriptorium was a HDR model with a character error rate of about 18 percent so this was a promising start but we hoped that better results would be possible so when we came to the read project we moved on to working with the SITLAB team at the University of Rostock and neural network technology and that was integrated in transcribers so again we used the same set of training data this set of ground trains relating to pages written by Benton secretaries so quite neat handwriting written in English and the results of this one gave us a model with a character error rate of just three percent on the test set. We also created a matching dictionary based on the breached edition of Benton's what Benton's collected works and both the dictionary and this model are publicly available to all transcribers users under the title English writing M1 so if you see that in transcribers that's the Benton model was the writing of his secretaries so a character error rate of just three percent obviously seems really impressive but the complexities of the Benton collection mean that we could not get that kind of accuracy rate on every page so the model was trained on what we considered to be some of the easiest writing from the collection and therefore it struggles to process more difficult writing and transcribe it so we're talking Benton's own handwriting writing by other secretaries where the handwriting might be more difficult and also Benton's correspondence as well so the Benton collection contains a lot of incoming correspondence from all people Benton was writing to so that needs to be recognised as well so if we're taking just a random page from the collection and trying to recognise it with this model we can get a character error rate of probably between five and 20 percent so you can see an example here and so this is an example of an automatic transcript it is Benton's handwriting but this is Benton on a good day so it's relatively neat to prevent them and on this this page it's a character error rate of about 8.9 percent a real strength of this model is English writing M1 model is that it can generate good transcripts of other collections of simple handwriting especially from the same period so 18th from 19th century at last year's transcribers user conference Deborah Cornell from the Georgian papers program was talking about how they were using this model to transcribe papers written by the English team George III and many transcribers users have also used the Benton model as a base model when they're creating new models training new models so this basically means that you take advantage of what the system has already learned from reading the writing learning to read the writing of Benton secretaries and it gives you a head start basically on your recognition so our next challenge was to try to improve the recognition of Benton's most difficult handwriting and here the advances in layout analysis again coming from the SITLAB team at Rostock were key for us it meant we could now automatically detect the lines in all of the Benton images to a really high level of accuracy and so create ground truth more quickly in transcribers so we created a few hundred pages of training data in transcribers manually by uploading images segmenting automatically and then transcribing each page we basically copy and pasted in lines from our assistant transcripts our first milestone was a model that was trained on 57,000 transcribed words and this had a character error rate of 26% on the tester so you can see it's it's really challenging handwriting we then decided to experiment with text to image matching so this is again something coming from SITLAB in University of Rostock so it's where an algorithm automatically goes through and tries to match images to existing transcripts that you have that automatically create training data for text recognition so we have lots of transcripts so we thought we'd try that but the results created a lot of errors so we decided that it was quicker to just transcribe ourselves manually we carried on doing that and then the next model that we had is now transcribed on about 81,000 words which is around 350 pages from the collection it's trained on some of Benton's worst handwriting and the test set is again about 17 so this is an example here this is some of as I said this is some of Benton's worst handwriting and the automated transcripts below and you can see in the transcript it's making up words that aren't really there and it's struggling to read things that are crossed out along these sorts of complications and yeah on this page the character error rate is nearly 34 so really quite high so this error rate is obviously still too high for us to get a useful automated transcription of Benton's most difficult handwriting that could be used for scholarly editing as we're hearing this morning you really need quite an accurate transcript for scholarly editing and with something that has a character error rate of 35% it's probably going to be way too time consuming and annoying to be honest for us to correct all the errors and when we have we have experts working on Benton so it might be quick and frustrating just transcribed and scratched but having worked with transcribers for many years now we know that the technology is improving all the time and so we're hoping with more pages of training data and htr plus as well that our results will improve and the future is always bright so although we do not yet realize the complete transcription of Benton's handwriting what we have been able to do is start to experiment with keyword spotting and we're really excited about this so as we've seen throughout the conference keyword spotting technology can work well with htr even with htr models that have quite high error rates so even going up to 30% character error rate it can still find words that it's looking for this is because it uses statistical models trained for text recognition to search through probability values assigned to characters so basically this means that the engine returns what it thinks is the best guess for the word you're searching for but also a long list of other possible guesses that might match your word so this means that with keyword spotting you can find words in a collection even if those words have been mistranscribed by the htr on this first best guess so for this task we again partnered with the with Alejandro and his colleagues at the Passionary Recognition and Human Language Technology Center at Valencia and what we did first of all is we gave the dentistry team all of our images so this is about 95,000 pages both got UCL and the British Library and we also gave them access to the training data that we had created in transcribers so this is the easy pages and the more difficult pages as well so the breadth of the Bentham collection meant that the images all had to be standardised in various ways before it could be processed for keyword spotting and the Blent City did a huge amount of work on this so distinguishing between handwritten printed and blank pages getting rid of duplicate images sorting things into different hands and different languages as well there's several languages in the collection putting all the file names to be consistent and changing the resolution of images because the resolution of the UCL images were different to the resolution of the British Library images so there was a lot of pre-processing and necessary information to search for all the images once the images were clean they always had to undergo layout analysis which Alejandro mentioned in his presentation so this was done in transcribers in batch mode using the Rostock segmentation tool once we had segmented the 95,000 images we did a check on a random sample and agreed thankfully that the segmentation was accurate enough and so we didn't have to correct everything and basically by segmenting the entire collection it means you can search every page so the Blent City then started work on the ground truth that we created in transcribers and as Alejandro explained they processed this with their neural network htr and probabilistic word indexing toolkit and to train new models for us so they tested the resulting models on different test sets from the collection and this is just an example of two different test sets that they used for the UCL collection so the first one is the character error rate for testing the recognition on an easy test a page of easy test set a test set of easy pages so pages written by them the character error rate is around six percent it's quite low and then there's a hard test set so this is pages that are written by them with that the character error rate is higher about 15 percent so these results are comparable to the results that we were getting for recognition with the models from Rostock for Benton's for the easy writing and the difficult writing as well and they're also slightly better than the other collections that Blenton are working with that Alejandro mentioned the medieval chancery records and the records of the Spanish playwright David Vega so based on these results the Blenton City have created a web interface for us for keyword spotting the entirety of the Benton papers so you can go to this website and users can search through 90,000 images and this figure was reached after blank and printed pages were removed from the search results so the accuracy of the spotted words depends on what you're searching for and the difficulty of handwriting but it is possible to find what you're looking for even in some of Benton's scroll and there could be as many as 25 million words waiting to be found and obviously you can search pages that have never been transcribed which is really exciting for us so I'll do a demonstration for you now hopefully okay so this is the website and you can find it if you just go to the transcribed Benton site and there's a link to it from there so this is the home page basically you type what you want to search for in the box up here so I'm going to search for the word democracy and here's what they call the confidence slider so you can choose how confident you want the system to be about the results it's giving you so if you have the confidence set a low number you're basically saying I'm not as many results as possible even ones that you're not sure about if you set no confidence at a high level you are saying that you only want the engine to return results that is quite sure are the right word so you just click search and then go away it searches really quick as you can see and then it tells you how many boxes it's found irrelevant so it basically categorizes the results by box which is how we categorize the Benton to your own archive so let me open up a few boxes and see some results so here you get the individual page you can see who actually wrote it so this is a page by Benton you can see the level of confidence that the machine thinks that it's um it's found the word and then yeah you can see here it's found the word democracy another example this is a bit of a nasty page so here it is in some marginalia and this one is a page of it says it's bent but I think it's actually secretarial hand and it again is found it as well so it's really cool it's really quick a lot of people experimenting with it so far so we've had some analytics on site so I've been able to watch what people do here which is being very satisfying so we launched the site on the 15th October so just a few weeks ago and since that time we've had 140 unique users from 25 countries making searches on the site so some of these searches have been more surprising than others so we had the user identity that was searching for the word Naples we had a user from America that was searching for Jesus and we had a user from Austria that was searching for Jesus these are all real um I hypothesized the site I really wanted the Transfiber and volunteers to have their own so you can find as well and so I set up a green spreadsheet where people could record their searches that they were doing and also the accuracy of the results that they were finding so this is a few examples so one person searched for word legislature so Bentham was a legal philosopher so this is obviously one of his favorite words and if you search at 80% confidence you can find more than a thousand instances of that word the second example was searched for the word Muzzy so Bentham liked to make up words so this is one of the words that he coined but unfortunately we weren't able to find out the key words spotting 80% confidence this mainly reflects the training data as well Bentham's obviously saying there's just a term much more than you're saying Muzzy which means uncertainty or confused with Bentham the last example is an example of a Transfiber Bentham volunteer so a volunteer called Jill Haig got in touch with me and said that she was interested in finding out whether Bentham had talked about the Peterloo massacre of 1819 which is a famous event in the UK when cavalry charged into a crowd of people protesting and demanding democratic reform and there's a film out about it in the UK as well and so it's a topic of the current discussion so she searched for the term Peterloo in the papers but didn't find any reference to it but she did then did various combinations of the words Manchester and massacre and was able to find 11 instances where Bentham was talking about this event so that shows that how you can be invented with your searches to find what you're looking for and so and thanks to Jill for being there. So use cases for keyword spotting for us is already a fantastic resource that will be helpful to anyone interested in Bentham's philosophy it can help us at the Bentham project to find things that we're interested in and many things that we haven't read before it will allow people to quickly investigate Bentham's concepts and his correspondence as well and it also hope that it will be useful for our volunteers in Transcribe Bentham they can find subjects that they're interested in to transcribe so we've already done a huge amount of work so far but we're not going to stop this is a few suggestions for future directions that we want to go in so firstly we want to connect the Valencia keyword spotting tool to transcribe us so this will allow us to use this technology to search and transcribe this expert client just like with the Rostock technology and we can also connect the Valencia keyword spotting tool to our other digital bentham resources that we have like the online catalog and the Transcribe Bentham website as well secondly we want to improve the accuracy of our existing models so we want to as soon as I get home I'm going to be testing out HTR Plus and we also want to build up more specific training sets so at the moment more than half of our existing training data is easy handwriting from the collection so we need more of Bentham's bit of difficult handwriting we also need to think about the languages there's around 2,000 pages in the collection written in French so if we have more focused sets and more focused models we want to get better results and lastly we want to integrate HTR directly into Transcribe Bentham crowdsourcing platform so the idea is that users will be able to check and correct automated transcripts and if they don't want to do that they can just continue transcribing as usual and if they get stuck in a particular word they can ask them to suggest what it thinks that word might be so we believe this is good for our existing users who are expert transcribers but it could also be good for new users who are daunting by Bentham's handwriting and need to help him to get started and we hope that this new version of Transcribe Bentham with HTR will promote written user engagement, increase the transcription rate and therefore bring us closer to finishing the publication of Bentham's collective works so after the end of the read project the future of Transcribe Bentham and our experiments with HTR will all be dependent on us securing the funding but we're hopeful that we can join the read co-op and continue taking part and cooperating with all our colleagues so as Bentham once said many hands may be like work and so I can't finish without thanking all the people that made everything that I've talked about possible so my thanks go to my colleagues at the Bentham project all the volunteers at the Transcribe Bentham who do so much hard work the teams in Valencia, Rostock and Innsbruck and our other read colleagues as well so I'll stop there and say thank you very much Thanks a lot. About half a year about three quarters from Transcribe this team as a whole unfortunately regarding the naming of the model English writing M1 not my choice okay speaking as an outside comment I have heard this if I just seen that I've just seen that the one meant very early in history of English so I think okay Old English and M may be Mercyan so is there any and obviously there seems to be this intention to have many models publicly available but this is where the question goes more broadly I think for Transcribe as a team as a whole but are there any plans to have certainly publicly available models to have a more logical, sorry, naming close call I'll just say so the English writing M1 yeah it was chosen before my time but when you train your own model with Transcribe as you could choose the name so at the moment all users are free to choose the name that they wish but Gunterbite was any more about standardizing the way we named models? I was the one calling me it was a one second position so many I think 20 years ago was and then we came up the first time with the puppet version so sorry for that I'm not sure if you will have a naming convention because my experience was naming conventions but what we need is really that people can easily make their models available as I talked yesterday and there was a suggestion from Nico that the report mode should be that the model is public and that you should decide to make it not public I'm not sure if this is really the way but definitely it will be easy to publish and as I said it is also very important to have some metadata available and of course to have a look to the pages or at least to sometimes this is not going to be necessary other question it is search for optimization because it's in the hide and so on yeah just one question how do you communicate with the user the limitations or like what is it actually keywords what to do yes exactly like that's down there may be the things if we are not found like completely which they may be in the manuscripts and also yeah so users of Transcribe through blog posts and newsletter and so the all the active volunteers assigned up to get a newsletter basically where I tell them every month was going on the project and they also contacted me about what they've been doing in the searches and for example that's where I got the example that was in the paper is that what you mean no I meant the random users you know like like visiting your web page which I'm not what yeah sure and so this this website it's made by the people in Valencia so it's really just it's a prototype tool we need to show off the the technology so there are user guidelines on there and they explain how they give a brief explanation of how technology is working and that it's it's not going to solve all your problems and find every single word but as I say on our website as well I'm keeping people updated our experiments with he was watching in HDR and how the accuracy is improving and what's still to do so I think we're heading towards the end thanks a lot Luis the last two days is the microphone and your honor was key in organizing this event in the end but I think it's up to you to to take the mic now yes so my list of thank yous is also very long so thanks to all who organized this with Eva and of course the team from Ria's Maria the team from the Gengar, William, Markus, Fabian and so on so that was really great to be here again it was great to see you and I think we learned again a lot for us it was really amazing to see what people are doing with the technology what they are doing now also with the API which was really new this year and of course also with the graphical user interface with all the interesting projects which are coming up I can say that for me every time I see a new document and the new challenge and a new collection and the way to use it the project is every time I think fascinating it's really fascinating to see the treasures of archives and how researchers are dealing with this and which questions are important for them and it's really amazing to see this coming to be concerning the band and project that there's also a very big thank you to the band project and Philippe and Louisette because they were also part of the first and one objective was to to set up something there are this transcription transcription project goes in the direction of technology but it always turned out that if you touch a running project it's really a major challenge and still the work is done media wiki but my feeling is that nowadays it's really the switch maybe in 2019 we will see this coming into being but the rest of the input we got from their models yeah so thanks again and I hope to see you or at least some of you next year maybe here maybe in another place we will see but definitely we will have an extra user conference and I'm already looking forward for this thank you