 is entitled transcribers in practice reports from users we are for transcribers users to we're very grateful for coming and sharing their work with transcribers each each person will speak for about 10 minutes and then we'll have up to five minutes of questions and then we'll have to move on so it's a great pleasure to begin with Maria being your Kotako from the Society of Swedish Literature in the film her papers being written in cooperation with Elizabeth Sturck of Maria is going to do a presentation so thank you Maria. Hello everybody can you hear me? Yeah good it's very nice to be here thank you for the invitation since there is 10 minutes I will just turn on to the presentation this is going to be an account of our experience with transcribers but it is also going to be a kind of a wish list for future so well I think that the transcribers team has maybe been thinking about some of these things that we have been thinking about now a short introduction uh Elizabeth and myself we come from a project which is called the letters of most of you if you have lived in the 1880s you would uh he was a very famous painter finished painter in the paintings of claymere painting when it was said that the light comes from the north and Idlefels like many other traveled to Paris to study and he ended up becoming almost French from Paris he wrote to his mother about 30 years of correspondence he knew everybody he wrote about everyone he didn't write so much about his paintings he wrote about people and things that were happening so this correspondence is interesting it's the cosmopolitan youth on European politics and cultures culture from the late 19th century we publish the letters online with free access and in collaboration with the Finnish National Gallery we combine the letters with Idlefels works of art from the first sketch to the oil canvas so we have 7 000 pages letters and 7 000 pages of images of artworks so it's uh but too long to think about that before the time of transcribers we make summaries of events as we call them but now we are adding full text so you can imagine our joy when transcribers emerge on the horizon but now the aims for our organization the society of Swedish literature in Finland is of course that we wrap up the Idlefels project with full text but we have a larger scope we ask us how could transcribers be integrated on a larger scale in the working process of our archives and our publishing units to benefit our customers so we really use Idlefels his letters his paintings as a guinea pig and a good enough guinea pig we think first there is the question of handwriting our archive consists of private material mostly private letters uh so it's a lot different uh to uh the material which our national archives is now going to test with transcribers the handwriting varies very very much and uh particular challenges for our kind of material using transcribers are in comparison to official documents uh there is a large variety of handwriting topics and vocabulary and in the case of Idlefels uh his correspondence to his mother last year over 30 years as I said uh the young teenager can see it out there on the left it's a teenager letter and then I picked some some letters those forward in time so you can see uh that his handwriting changes and we thought that with the group he is a good guinea pig therefore but then uh segmentations and lines another common feature of his letters and private material in general is that he and his contemporaries of the rights in margin as you can see here and sometimes even all across the page as an additional layer on what he previously wrote so our experience of this is that the automatic segmentation of lining device doesn't discover the vertical lines obviously uh the transcribers program has tools to manually correct the missing lines as they also have tools to correct the reading order if it passes between the lines into the text but if you are counting on automatic segmentation and readability these kind of manuscripts are problematic it takes time and you need people to correct them as at least as I can see them and correct them now then when we get our material it seems at least for the moment there is at least a last level the machine transcription has the difficulties to conquer based on the handwriting the program can transcribe it to plausible words as a reader you can see and understand how the program has thought but the transcription is good but it is not correct so here you need the human eye as it is now uh to get it right one needs an understanding of the language language and the context transcribers has collection specific dictionaries as it is now a matter of help or word to word basis but what we still need is a sense of language that is not limited to seeing the words but the whole sentences and the additional issue not to make it easier as it all is it is often the case in private material people write in different languages uh it also that is contemporary uh they were uh educated finisher so they used different languages uh they wrote in swedish uh with which was their native tongue uh and then they added in some finnish words uh some some where uh much french much german uh danish so here is also a challenge for the program now a little statistics in the beginning the results of the models which range was not very satisfactory it still could be better but gradually the models have improved there are three reasons for improvement uh as we proceed in anything with transcriptions we add more material to the models uh secondly uh we've got the tip to increase the number of training epochs we changed from the standard 20 epochs to 175 epochs and got a considerable improvement on the training center and thirdly during the summer there was an update in the transcripts program we could detect a better result in the models trained after the update so we are looking very much forward to getting a new htr plus update that is what we are really waiting for our three last models have received the following results uh the character error rate on our training set has sunk under 5 percent and that's very good uh but in our test set uh it continues to stay over 10 percent and in general we use every 25th page as a test page so now for future and our wishes we look at this uh as a on a larger scale our projects or air deltas uh so what we would like to see is models for the 19th century and every other century of course also so that everyone working with the 19th century or other time period material could contribute to a model uh so that you could when you start a project you could take a model from the shelf uh use it trainings and ultimately feed the new knowledge into a larger model and these models would need to be language and cultural specific uh with this we mean that we cannot necessarily use one model for the whole Europe of Europe and the whole of the 19th century in Scandinavia at least the early 19th century handwriting is very much like the 18th century and does not at all fit into the same model as Edelberg and his contemporaries they have to split it up into a early model, mid model and a late model and the same thing of course applies for every other time period and every other region in their own ways but this would be very very helpful and would and would contribute on a larger scale and then a sense of language that could so that the program would understand sentences and context and could make its recognition based on that also that would be very much easier and Elizabeth and me and everyone else wouldn't have to go in and correct uh things afterwards as we as we know are doing in in this uh publication projects it all of course depends on what you're going to do with it but as we are now in a publication project with very very high standards we have to really manually correct every uh everything that is is wrong so yeah this was our wish list and our experiences if you have any questions I am at least I'll be happy to answer them. A couple of minutes questions so I know from any any questions no I think that must have been perfectly clear and it's good to know humans still have a role perhaps Travis is not completely superseding all of us so thank you very much indeed Maria and I'm sure people will have comments for you. The next I'd like to invite Marie Laura Matthew from who's working on the archive of Michelle Foucault Marie Laura is from the all-normal superiors so first of all I would like to thank you for inviting me and to introduce you to our audience indeed it's a great opportunity for to present the work carried out by the team of the Foucault's community of projects so my name is Sandra Besson I am an academic engineer and the French national center for scientific research the famous and I work at CAFES economic superior in Paris my job is to assist and support researchers in their scientific activities in the field of philosophy and historical studies I contribute to and support as editorials and specialists the long-term projects of printing and digital publications I am currently part of the tool study and annotating team for the Foucault's within us project so here is the outline of my presentation first I would like to introduce you to Foucault's within us projects called the FFL projects its partners its objectives and the digital tools that the project is in the process of developing to explore the corpus then I will share with you the result of the test with their funds to automatically transcribe the chef for us writing with the transcribed so what is the FFL project about the Department of Manuscript at the National Library of France BNF has kept the Foucault's archive since 1984 the ends is to preserve the archives of course but also to write a new inventory and to digitize part of the sources to make them available to researchers in this context the FFL projects started last year we explore and make available online a large step of mission for good within us the project is coordinated by michelle cellar professor of philosophy at the urn s field unsupported by an iron band it benefits from the partnerships of the urn s party and the FFL the objectives of the projects are to digitize describe and enrich more than 10 000 Foucault's within us relating to the preparation of his books courses lectures but the objective of the project is also to allow a new approach to his work based on the analysis of the philosopher reading criticism but how can such a large purposes be analyzed what does contributions bring to this type of sources the INA project Foucault's within us ends at exploring the corpus and also at providing new digital tools designed to enable collaborative work on data views in order to help understanding the creative disorder of the epops the FFL platform that you can see on the screen has been designed to provide annotation categories and description fees so that researchers can create multiple pathways between the documents in addition to that since a large part of the investigation consisting identifying Foucault's source sources the platform enables to create named entities and enrich them with biographical and bibliographical information that the users can search and retrieve work from dnf data but how to identify all these sources the method consists in transcribing the manuscripts automatically in order to index the corpus and process the transcript with named entities or condition tools the FFL platform will also provide new ways of exploring the archives as network categories for example on this network you can see how the reading notes blue scales are linked to concept red circles and orders of green circles thus the researchers will navigate from box to box jumping from one reading note to another from concept to concept or order making their personal pathway into the archives so an international collaboration with the real projects has been started just last year for the automatic transcriptions of Foucault's manuscripts from his reading notes the collaboration was very easy we simply registered on the website contacted the red projectile and followed the instruction on the documents how to use transcribes in 10 steps it allowed us to install the software and work with documents after uploading them to the transcribes server so learning to use the software takes sometimes but the transcribes guide and the team are active actively involved in this learning process so for the test we created learning data for Foucault's handwriting in two steps in an initial test the team trained an automatic text recognition model by using transcribes to segment and transcribe 200 digitized images of Foucault and writings this model was capable of generating transcribes with a character error rate of 15% so it was already quite well for us we then extended our training set to 600 images of Foucault's writing and these larger sets resulted in a significant improvement of the results and we now have an HDR model which can transcribe Foucault's handwriting with a character error rate of only 8% that means that 82% of the character are correctly transcribed by transcribes but what are the main difficulties we have to face as you can see on the screen the DCI referring Foucault's writing is really challenging it's not the worst reading notes that you have on the screen these are reading notes with many aberrations often out to the server sometimes equitable they can mean different words in different contexts sometimes Foucault also wrote several letters in the same way so for the transcribes test we have either identified identified eligible words with TEI encoding element gap and structural front for the full misshapen or erased words so that the software can skip them for its training it should be noted that it takes few hours to learn how to use transcribes efficiently in a project besides the interface remains to be improved to facilitate facilitates the reading of images on the typing of digital tests despite these difficulties the result of the transcribes test are very encouraging for our projects as you can see on the screen the result of the automatic transcriptions of the Chef Foucault's reading notes by the ashtray engine after the two states of training while it took us more than 40 minutes to manually transcribe a single reading notes a few minutes are enough to transcribe several automatically response values the automatic transcriptions of the notes must of course be corrected there is also certain difference in the degree of automatic character or cognition according to the different boxes for example the software sometimes sensitive to the transparency of the paper and sometimes he also reads the characters of the back of the paper keep the result of the test remain very positive the automatic transcriptions allow faster manual transcriptions and helps in the work decisions of certain full-guided words that are difficult to read sometimes the transcribes can read better than than we do thus our team at the FFL projects have trained the model to organize the philosophical writing with around 90 percent accuracy we can increase these occurrences rates by feeding back the corrected transcripts into transcribes these automatic transcripts we benefit the project's objectives to analyze and provide online access of the large collections of Foucault's reading notes so thanks to transcribes we will produce automatic transcripts for all the digitized images exports and correct them in the omega platform of the projects the results can potentially be feedback into the transcribes software in order to improve the process and create a link between the two platforms to conclude collaborating with the url collaborate transcribes project has really given a new dimension to the idea for both reading notes by expanding the transcriptions of exploration possibilities the transcripts use software with the enable the Foucault's specialist to embrace the larger focus of his archive you can find complete reports of the test in French and very soon in English on the blog of the project thank you for your excellent presentation are there any questions I was wondering if you can go back to this slide with network analysis example and put its part a bit on this so how do you use transcribes in combination with network analysis right now we don't have a combination that it was just to show you what's the ambition of the projects but perhaps transcribes will help us to extract some names entity to find some names in the whole corpus that we have to right now we are in the process to to see what we can do I'm sorry I'm not an informatist and the project is very is quite new you know so it's just we have some some different way to explore the corpus but it's not really linked to the transcribes so far what kind of humor the sources are loyal in in the description things you mean for the training yes yes for the training for the first steps of the first step sorry of the training it was the transition made by the PhD students so it was quite easy but not really easy because we had to extend the abbreviations and stuff like that but um for the second part of the test we were to me and uh postdoc students and to train to get used to some to the software and to train the transcribes it took us quite a while you know but the results we think it's very good so we are happy to to have taken this this time yeah do you have any sense of the cost I don't know for the cost because it's in the research project so we don't pay the people so far that's I think the time it's quite uh because it took us 14 minutes to transcribe manually a single reading notes and with transcribes it just takes a few few minutes so I think it's more in terms of time than more in terms of cost for our project time is money thank you very much indeed for that really fascinating person here so thank you to our speakers for being so proud and good and our final speaker is Colin Greenstreet and for the marine life and promiscopic can we turn this off I want to use the microphone so my name is Colin Greenstreet I'm not an academic and I started studying social sciences and then in a business degree so I approach HDR and projects I'm going to talk about two projects one is marine lives which started six years ago as a spin-off from a hackathon at the national archive and then secondly a project I and other people that are involved in structuring called the signs of literature initiative in the last six years since the collaboration of marine lives which is a collaborative crowd sourcing for experts sourcing project and we've also run a spatial hackathon we're in the process of shaping science literacy with machine learning database which we've nicknamed song signs of literacy in manuscripts and then we've created a small-scale glossary and so we'll be fairly active we are in the process of forming a charity a charitable incorporated organization which is going to be the umbrella home of these initiatives and importantly has very much a data science technology orientation and we're focusing on manuscripts our interest is essentially the 16th through to the 18th century so technology wise starting with marine lives basically technology was pretty basic we use script and wordpress we have a wiki and we use facilitated teams with sky I'm a next management consultant my inspiration for crowd sourcing was take the idea of a engagement manager as a facilitator take the ideas of associates those of the volunteers run small teams internationally with the data on the web at least that course accordingly over the last couple of years we have been experimenting we've moved to a splanted media wiki we did some early experimentation with name entity recognition now we've essentially increased our ambitions in terms of volumes of full text transcriptions and other things so we are exploring hdr keyword spotting and natural language processing text reciprocity mining and also the idea of our kind of box which I'll get on to we are indebted to our many volunteers a range of academics postdocs and public historians genealogists we've even had some matters in them so marine life these are the typical documents we're dealing with it's the English High Court ofality so legal documents like a number of documents other people have talked about I'm using the term metadata very loosely so these are depth positions this is a five-page position between half a page maybe 20 pages typically at the beginning you have a preamble the name of the person the occupation the date a place of residence and then towards the end you have a sign up now many of our people were not literate so you've got signatures but you've also got initials and you've got marks we were an early adopter of transcripts and then we walked away from it so three years ago we found the character error rate we were getting on a relatively small ground and said we're really not very good but we've come back to it in the last year and what I want to do is to share our experience and where the internet to go with transcripts so firstly for the type of work we're doing with volunteers as opposed to postdocs and academic projects the interface was not easy for the occasional volunteer users you can overcome those interface challenges with good facilitative training and supervision but in the real world a volunteer always has the choice do you use transcribers to get the first pass and then you improve it or do you do it from scratch the motivation of most of our volunteers has been to learn to read 17th century script not to create a high quality academic result for other people to work with so we'll come back to that in a minute our facilitators who are essentially the editors when they're editing work that has been generated via transcribers and then via our volunteers typically won't use the transcribers output they'll go straight to the manuscript page and they'll correct it so thoughts on transcribers from our perspective I emphasize a collaborative crowdsourcing we're much more similar to the shutout keep Amsterdam and I'm very interested in where Mark is going with that interface looks very attractive so if you're working with pedagogically trained volunteers but are not doing all the volume you really need to build transcribers carefully into your work processes and you need to recruit volunteers differently if you're using transcribers they need to know that's what they're doing it's a slightly different experience you sell it to people differently and I'll be very interested whether Mark has Mark's volunteers at the Stutz Archive were recruited to do an amazing job on those wonderful manuscripts now I believe he's intended to use the same volunteers but with a slightly different orientation they're going to be correcting HDR material and that is a very different activity we're very different rewards and incentives um with high volume volunteers my high volume I mean somebody is going to do 20 pages or more um I think it's all viable with using transcribers but I'm very attracted to the idea of a different interface so some general thoughts um I think it is interesting for reading transcribers as they're going into the future with the new cooperative to think about culture governments and data structures at other large uh image oriented initiatives and I flag up the Trieth Consortium which I've been working with and also the Lagos Regulito system although transcribers has been set up for kind of IT9 users like myself I nevertheless think in terms of incentivizing people it is actually useful to actively share codes and algorithms for reuse rather than just the HDR engine which you never actually see I think it would be useful to publish timely public statistics um for manuscripts by period by writing style which portion of the wonderful database in transcribers is actually closed or open we make all our data open to everybody and also to actually see those error rates and I find I have to come to conferences to actually see the data I'm surprised it's not they are worthy for everybody to see and to add to um as we think about our future project science literacy we really would like to understand what sort of development time is going to be available to sustain the current platform and new material conscious of the time uh so what we're doing in 2019 we are going to greatly increase our ground base we have got six million words we're not going to put all six million words in but we can get a much bigger ground base and we want to see whether we can get to a character error rate of five percent which on 17th century material and has this pretty hard to read would be wonderful and you need to get to that level if you can innovate type of quality as we have to correct it keywords spotting much more interesting for us we probably then wouldn't make the transcriptions available we would hide it but we would use keywords spotting essentially through searchability of images and then finally as I said we're very keen on exploring the interface which is being created by picture right with the stats up here in Amsterdam that's a very humbling briefly the science and literature initiative there's a little club here I'm looking for our kind of technology partners patents yep I've got a gift in that little gift you see tees you either see other letters or one of the things we're reading patents our initiative is to create a data set of a million manuscript pages over a five year period which are going to annotate signups and mark the issues and signatures from a research historical perspective or a wisdom perspective this is about exploring historical literacy on the schedule which has not been done before the actual approach which is contested that has been around for 40 years and but in addition to answering research questions in historical literacy what we're really driving at is working with archives to develop tools and processes for an increasingly automated metadata extraction and linkage from digitized algorithm transcription pages and we want to stimulate the development of machine learning capability in multiple archive centers of excellence we're partnering with the national archives in the Latin Kingdom and we collaborate also with the Swedish national archives and we would like to add further archives we're interested in the Dutch partner and the French partner and the Spanish partner and we hope we can help foster interest amongst the story in machine learning we're working and as this I'm pleased to say I can report from being in California on Monday Kangor which is a subsidiary of Google had already committed to run a program machine learning competition with us uh they've now decided they would like to work with us over the period of the project series of competitions simple and we're structuring the project essentially you could call this collective intelligence you do not anticipate that in five years we will ever go into fully automated metadata extraction and synthesis we imagine we'll always be monitored interacting with various automated techniques and what we're doing is to give up those processes from a human computer interface point of view as well as purely from the technical point of view I'll just finish up uh triple if we think it's very important we want that in the middle we would like to explore with transcribers and how we might work and uh also with a computer and with Google sorts of things we're going to be using to learn for feature extraction lots, sludges, deletions, they all turn to something about the pop med they may turn to something about the possible literacy or so are they uh we're also going to be using shape, consistency of characters etc so from historians when the purpose of itch remained uh and it is a bit of an elevator I've said well you know hang on all this data uh but you know what if we just want to go down to the individual signature or mark and again there's nothing in what we're doing but stocks you can bring that we it's very much for historians and linguists to work with this material we are building an infrastructure for people to use uh and again one of the great things about TRIP-IF is you can create manifesto for peers and you can actually specify the type of material that you want to look at and here you've got an example of an anchor various anchors that we use by semen also enhancing uh in our material so with that I think I will wind up uh specifically with transcribers for this new project we're very interested in keeping this around text vary, certification, uh linearization, interlinearization, key word spotting, less so than the HDR and we're also interested in the scanner on the phone uh and there's some archivists who have called and not using their own massive cradles and everything uh but we wish to work with archivists and also citizen archivists who go into the small rock and use their phones to take the photo doesn't have to be a high quality photo it just needs to form the basis of them extracting some data. If you want to know any more about the project you can contact me I'm a community organizer of Scientist Interesting and I'm working with Dr Mark Held who's a social historian and a specialist in literacy. Thank you very much. Thank you Colin. Any questions or comments? Thank you I was wondering for the signs of literary literacy initiative uh is there a particular historical time period that you're interested in? Yes again I should emphasize we're at the beginning of this project so if new partners come in whether the historians wishing to use the content or technologists or archivists wishing to think about tools and work on the processes we're open for discussion. At the moment we're framing it the mid 16th century from to the end of the 18th century so it's 150 years here essentially early modern. Thank you very much for this presentation uh but you're going to get like one more thing after my day how do you want to keep volunteers interested in that job because it may become more and more boring you know to correct the dimension. And that is a very good question and I think that is part of what we're going to test. I very carefully said what motivated it so we have worked over the last six years with volunteers and with Twitter we've also collaborated with Bath Spire University and there's recently with the University of Warwick we've used different models to find people engaging and I think we do understand what motivates people to work on physics transcription. We don't yet understand what might motivate people to work on such a project so the answer is I don't know yet and I think it will be an issue. So let's thank Colin and the quick, very important to our store presentation.