 Welcome to this event entitled Developing Computational Skills for Digital Collections a New Programming Historian Series. We're going to showcase the results of a recent project which was a partnership between the Programming Historian, the National Archives and JISC and was made possible by generous funding from both TNA and JISC for which we were and are very thankful. My name is Peter Webster of Webster Research and Consulting and it was failed to me to be project manager for the project and your chair for today. For those with visual impairments I'm in my mid to late 40s wearing glasses, mostly gray hair and a beard that's following suit and I'm wearing a sort of a light colored check shirt. The project then has developed a set of new tutorial articles to be delivered as part of the Programming Historian exemplifying best practice approaches to the interrogation of digital collections and they have a specific focus on large-scale unborn digital content. The project aims to help address a noticeable skills deficit among researchers in the analysis and use of these kinds of digital collections held by cultural institutions and we expect the first of these and there'll be seven in all to be published in the next few weeks. Our aims today though are threefold. The first is to introduce the project that gave rise to the series. Second is to present three of the forthcoming tutorials of themselves and finally to have a chance to reflect a little on the project itself on the challenges of creating training materials of this nature and its implications for the future. Without any further ado then I move straight on to the first part of our session which is a reflection on the aims of the project from the two organizations that commissioned it. We're going to hear from Paola Marchione who is head of product content and discovery at JISC but first we'll hear from Joe Pugh who is Digital Development Manager at the National Archives. Joe over to you. Thank you very much Peter. So hi I'm Joe Pugh. I'm as Pete said I'm Digital Development Manager at the National Archives. I'm a dweeby looking brunette in glasses and yeah very briefly I want to just tackle the question of what you know why we're doing this. I work in the Archives Detective Development team. I'm the person responsible for producing the National Archives digital capacity building strategy plugged in powered up and in the in the research phase for that when we were speaking to Archives we found that they had a they had a digital problem both at the supply level and the demand level so when we're thinking about digital collections and digital preservation there's an issue about what Archives are receiving so that the pipeline from their parent organizations is often rather fragmented or blocked they have they have trouble receiving the records that they need and they it's very challenging looking after them but they also have a demand problem and that is to say there are not many people knocking on their door saying tell us about your really exciting digital collections we want to use them and because they have this demand problem that makes it more challenging for them to solve the supply problem it's very difficult for them to lobby to have more digital preservation capacity because they can't demonstrate an audience need so we were interested you know we we put a lot of resource into solving this supply problem or you know we we commissioned working with the digital preservation coalition for example on um not just to know how you know a you know a set of tutorials aimed at archivists in order to help them increase their digital skills but I was also very interested in solving this demand problem on doing something about it now obviously an intervention like this is not going to magically transform humanities researchers but I was very interested in what was the most effective thing that we could do on a limited time scale and budget and I think that there wasn't any doubt in my mind that it was working with the programming historian that was going to maximize the impact of whatever we did because I think the programming historian is absolutely one of the most outstanding digital humanities resources that I've seen produced over the last 20 years I you know it's something that I would use myself which I wouldn't and and indeed have done which I can't with hand on heart say about every interesting looking digital humanities you know it it clearly fulfills a genuine use and it has done for me it's made me a better programmer I wish it had made me a better historian too but you can't have everything um the the the the the strength of the platform for us was partly that there was an audience there already you know as far as I was concerned it was the place to put these things we weren't going to have to worry about building that audience later which is often an afterthought with these things anyway and have you got the the time and the money and the effort and the energy to to actually do that um it's it's also that I feel very strongly that the content or the program historian is pitched at just the right level so you know the amount of hand holding involved is neither too much or too little and it genuinely takes you right through the process of doing these things such that if you're unfamiliar with them you're able to um you know get get get the thing built and adapt it for your own research need in many cases um and I also think that the um you know the quality of what's there is so high and so one of the things that we did in the course of this was very much to kind of stand back so I'll talk power and other colleagues will say a bit more about this but um the program historian is a is a peer review platform so having decided what our broad area of interest was that obviously we were interested in increasing um use of digital collections and we wanted to have a call that was going to focus um on that we very much stood back and said you know let um uh let this process work itself out it will be very inappropriate of us to um get involved in a peer review process and the high quality of the content that's there makes us very very confident that the resources that are going to be produced are going to be of uh uh that they're going to fulfill the um the need that we set this out at the beginning that they're going to be of high quality and that they're going to be um of use and useful and indeed you're you're going to hear about um uh from at least uh one of the speakers later on a resource where I'm like I will probably be using that myself uh within the next few months as soon as it goes live so um that that that's that's the rationale I don't want to I don't want to take you through the details of our strategy not not trivially interesting but that's the um that that that's the through line of why we thought it was important to carry out this work and why we thought that Pergam Historian was was absolutely the right partner to to um to conduct it with and I'll I'll I'll I'll pass over now thanks very much and thanks to everybody who's worked on this project particularly Peter who's an amazing project manager thank you very much indeed Joe Paola do you want to uh do you want to uh come in next here yes thank you very much um Peter and Joe um my name is Paola Marchione I work at GISC I'm a head of uh product in a team called content and discovery um for those of you who appreciate um a little visual clue I am in my 50s got short kind of pixie um bleached blonde white hair depending on when you see me um I wear glasses and I've got a black uh sleeveless louse um so apologies if I reiterate a little bit about what Joe said um in terms of our reflections and the reasons why we were so keen to be part of this project um I guess that shows good alignment but um for those of you who don't know not too familiar with GISC um GISC is the UK digital data and technology agency um for the tertiary sector we work and in my team we work primarily on um higher education uh but we cover federal education skills research and innovation and um in content and discovery with for years and traditionally GISC has done a huge amount of work uh to promote the creation the use interrogation of digital collections so we focus very much on um data sets that come from libraries special collections archives um and also within our research and innovation uh strategy um there is very uh much an emphasis on uh on around the deployment the use the reuse of all type of assets that make up the research estate and we see content and collections in particular in the digital form been very much part of that um and as Joe has uh pointed to and we've heard a lot in this conference and certainly this is something we hear a lot when I and colleagues speak to our HE members both in terms of the libraries and the practitioners and the scholars um and the archives that there is this gap really between the huge amount of content that is now available of course it's never enough there's there's more that we can digitize and we can make available um but the so between that and the level of skills that now are needed to interrogate and to make the most of that of that content I've heard more than once the the the expression about the the DH long tail so the digital humanities long tail there's a lot of um practitioners and scholars out there who might not necessarily call themselves digital humanists but are having to deal with data sets uh in their historical and humanities um uh scholarship uh the workflow so uh so this whole area around the skills um was very very important to us um and when we started talking to uh the National Archives and to Joe about what potential interventions we could make uh I think after a half an hour conversation we it was very obvious that the program in historians was one of our key stakeholders rather than us trying to get together and create new resources ourselves um again it became very obvious that we should speak and work with those at the core face of it those who are already um uh so whose job it is to um to to support skills development uh and we really liked their approach the fact that it really works as a as an open access peer journal as as um Joe referred to and I really personally liked the international um ambition and and breadth and scope uh the fact that articles are translated into uh different languages so I've only got praises for um for the team so I'm just going to um to end with thanking uh James Baker and Adam Krimble who were the the the first uh uh contacts at programming historians uh but of course the whole team the authors uh and and Peter Webster who as Joe said has been an excellent uh project manager so I'm looking forward to hearing um from from the actual people who've done the work thank you thank you very much indeed Paola thank you Joe uh you've set us up perfectly uh for the session this morning um what we have next then are uh short presentations of three of the seven uh tutorial articles that make up this project um the uh before we plunge into them just to remind you that do please add your questions to the q&a function including things that may already have occurred to you in relation to what Joe and Paola have had to say uh but our first presentation uh of the tutorial that is to say uh has has two authors um uh John Reed is Associate Professor of Spatial Spatial Data Science at the Center for Advanced Spatial Analysis at University College London and and we'll hear first from Jenny Williams uh who is a doctoral researcher in the same center at UCL uh and their title today is Comparing Automated Document Classification using word embeddings to expert assigned due to decimal classifications so uh Jenny if you'd like to the floor is yours thank you Peter um as Peter said my name is Jenny Williams um I'm five foot tall have fair hair I wear glasses and I'm in my late 50s um and I'm wearing a white top um and yes I am a PhD student in CASA at UCL and my research is in collaboration with the British Library um I'm using an E thesis online repository containing more than half a million PhD theses from covering over the last 100 years um in the UK uh I can't possibly read them all and so I'm using natural language processing to try to identify communities of what we're calling early innovative activity um and we're looking to reveal what researchers are actually writing about uh using the title and abstracts of their theses thank you over John. Hi uh I'm John I am down to my last linen shirt based on the weather uh that we've been having this past week uh so I I'm in black to contrast with Jenny uh I also have glasses but nowhere near as dramatic and cool as hers so uh I'm six foot two blonde um and I will get started now with the presentation um so I uh you know my experiencing research has been that the best way to really learn how something works is to try and teach it and one of the things that I've been wanting to do is to really understand um what Jenny's been doing and so I took the opportunity to write a tutorial for the programming historian to kind of you know in a sense enrich my own understanding of Jenny's work now what we hope to achieve with this tutorial which I'm kind of giving a high level overview of is two things um one is to briefly explain how natural language processing and in particular word embeddings open up new ways of engaging with archives and the second way was to test kind of how that approach measures up against the expertise of the human on document classification tasks so one thing I want to be really clear about is we see this as sort of the kind of thing that might augment existing practice I hope no one comes away thinking that we're under any delusions that our approach can kind of somehow replace the expertise of you know librarians and archivists so as Jenny mentioned our research makes use of the British libraries ETHC's online or ethos data set technically it's metadata because it stores data about theses things like the title the author date of submission etc so the vast majority of doctoral theses have metadata records in ethos the first is in latin from about 1812 the data quality as you might imagine varies widely because obviously ethos has not been a resource since 1812 so some universities been going back and entering lots of if you will old data into the British library's data store other ones have not been quite so progressive the main way that most people would interact with this is through this web interface I have a screenshot of actually my phd thesis up there but you can also download the data in bulk and that's what we've been working with um so ethos has already been used for a sort of manual keyword and topic selection um people have used it to look into kind of novel chemical compounds post-colonial relationships between supervisors and their students and career progression and dementia research but we're kind of interested in engaging with it rather than sort of selecting small subsets actually engaging with the corpus as a whole and so if we want to do things like categorize dissertations things get a lot trickier because there are clear evidence of changing practices in the completeness measures that are shown on the screen as a reminder though the ddc which stands for due decimal classification is fairly consistent throughout so that's kind of become our our baseline reference point so the nice thing about the ddc is that it's a kind of expert label so that's a you know expert label is the term of art from machine learning where it's part of the training process so for instance google's image search and street view platforms didn't spontaneously learn how to recognize you know different types of burrs or crosswalks you know they were trained by experts which is to say if you filled in one of those image-based captchas to verify their human being then you were in a sense part of that army of experts so we selected four disciplines using the ddc that we thought would present a range of challenges for an automated approach um so what we'd be looking to see is um you know bearing in mind the ddc itself is of course is not perfect we'd be looking for a fairly good alignment between kind of what how we classify things in an automated fashion and how experts classify things kind of one at a time so our sample is 48 000 uh records and that was fed into the pipeline what is a pipeline well so it's three stages of data processing cleaning um learning and analysis in the cleaning stage we're looking to standardize documents um so that that we want to then process further in the learning stage the computer learns and i mean that in a statistical sense about the corpus and in the analysis phase we take what the computer has learned and convert that back into kind of types of insight that we humans can you know comprehend and benefit from so i'm going to briefly talk through these on the slides and then go back to illustrate kind of what those outputs look like so the purpose of the cleaning stage is really to radically reduce the overall vocabulary by standardizing word forms finding meaningful kind of terms of phrase and stripping out high and low frequency words i'm happy to talk through any of those things in more detail in the q and a um then we move on to the learning phase and that's really driven by the learning of context and to a computer context means the words that surround a particular target so in a historical corpus we might see for example you know queen x ruled for 25 years king y ruled for four years and from this the computer would you know begin to see that queen and king are used in similar contexts and therefore there's some kind of relationship and across a large corpus we might also find that she is used in the context of queens and he in the context of kings so using this kind of that that the idea that was put forth by firth uh back in 1957 that you know you shall know a word by the company it keeps the computer uses a process called word embeddings to ensure that those kind of those kind of contextual relationships are preserved and it was really word embeddings that were the breakthrough in about 20 well it was published in 2013 that has led to kind of the rapid advances in natural language processing over the past decade but even word embeddings are still too much in their raw form for the kinds of analysis that we want to do so we use a technique called umap to reduce the dimensionality which kind of distills the information down and still further and after that we can cluster the documents and finally try to measure how well we did so now I just want to show you kind of what that process looks like this is kind of how it works now let's see what it looks like so here um this table shows uh three randomly selected documents or bits of three randomly selected documents before and after cleaning to give you a sense of how this works so you can see things like capitalization punctuation into a lesser extent sort of plurals and tenses have gone um together with high and low frequency words such as you know a and the and so on you'll also see some underscores and that's where the computer has found statistical associations between pairs or triplets of words across multiple documents one of them so then the word in what the word embedding process does is to convert that raw text or that clean text rather into a vector so that's a row of data in this case it's a row of 100 each each word is converted to a row of 100 numbers um that vector embedding is the technical term is done using something called a neural network and what the what that process does is try to ensure that words that are used in similar contexts end up with similar numeric representations so here again are some examples so on the left side we have a term taken from the corpus the next three columns are the first three of the hundred dimensions that we calculated using the word embedding algorithm so there's 97 more columns of numbers which i'm not going to show for obvious reasons these numbers don't mean anything to us but they do mean something to the computer and finally on the right hand side we see terms that are near meaning they have similar embeddings to the term on the left hand side so you know this is scientific data so or scientific records so accelerator is near to terms like beam cern and other things having to do with for instance high energy physics while national health service is unsurprisingly near to nhs public sector and public health so we have given the computer no guidance as to how to determine these relationships we haven't given any hints and this is just what is emerging from the context in which each word is being used and then um it's a bit mind bending but it actually turns out that we can take the vectors for each of those words in a document average them together and get a useful representation of the document so then each document can be represented by a hundred column vector and now we still can't plot or understand that so we can use umap here to project those hundred dimensions down to just two which is what the visualization here shows so each of those dots represents a document colored according to its original ddc classification so the documents are for the most part well separated by ddc so that's a really promising sign but there's so much data here that's almost hard to make sense of what's going on and you might already be picking up on the idea that there's some social sciences over in the over in the sciences at the top level and obviously some economists are thinking yes we're over with science uh um with biology and physics on the right hand side um so another way to represent kind of the performance of our work is this confusion matrix if every document was assigned to a cluster that matched with one ddc exactly we'd only have numbers on the diagonal so those are the bold numbers anything that's not in bold um is a case of our process in the librarian having a you know a difference of opinion so on the whole at the top level we're about 98 percent accurate next level number 90 percent accurate those off diagonals if you know we might assume there are a mistake because it's not how the expert classified it but it's worth asking if a human under any number of kind of practical pressures and constraints is is always going to get things right i certainly wouldn't be the person to tell you which subdivision of physics or medieval literature a text was most appropriate for so here we're just looking at word clouds just standard technique covered in another programming historian tutorial to pull out keywords from the misclassified documents and what's clear from looking at these two examples is that it's entirely debatable whether the computer is in a sense wrong so the forty two physics phd's that are clustered with the social sciences look a lot more like the interest of social science we have economics looking at the natural environment identity energy generation so i think this is kind of pointing to the really interesting potential for computers to augment a classification process not to replace it um so kind of wrapping up fairly quickly where do we see this kind of approach being useful to the you know collections and communities obviously born digital archives you know in our discussions with the bl about text mining um you know they've talked about issues around you know they're being increasingly supplied with for instance personal hard drives as part of a request i don't know about any of you but the thought of somebody trying to make sense of my backup drives is quite terrifying um you know also these more structured documents and so these sorts of techniques can be applied to any corpus um and we don't necessarily need to use that ddc that's more you know we happen to have a data set where we can kind of test and show the show the utility but in principle they they offer ways to browse uh you know a corpus that's evolving over time uh where we don't have a sense of what kind of structure is in there and give us ways to kind of find natural groupings at least from the computer standpoint within that corpus and i think i might be over so i should probably um i should probably wrap up there thank you thank you very much john i mean as it happens you have another minute or two if there's something particularly you want to add but if not we'll move on uh well i just one last thing i think it's just really exciting there's um there was some research done uh recently using this same technique but in a more specialized way uh looking at material science where they found that um they were actually able to anticipate discoveries in material sciences by by kind of two to three years by finding associations again in this case between terms that would then be picked up later as a as a discovery so obviously the payoff in the physical sciences of getting like a year ahead of other labs could be enormous but i think you know it also opens up more collaborative ways of sort of peering into the future to understand the evolution of the discipline like you know geography or or or anything else so and that's really kind of where we link back to the sorts of things that um jenny's doing and her really exciting uh phd thank you ever so much john thank you john and thank you for and jenny as it were in the background of that of that of that i needed foreground of that presentation that's um it's fascinating that i can see all sorts of applications not least in my own my own work for which i'm grateful um if you have questions for john and jenny do please put them in the question answer session and we will pick them up as we as we go on a little bit later but now we move on to the second of our presentations which is from uh shahan vidal goren who is president and founder of kalfa which specializes in text detection and automated analysis technologies for manuscripts in oriental languages and he is also a doctoral student as well as a teacher in digital humanities at lake hall nazion al dix chart in paris uh the title for today is a i for automated transcription of historical documents and so shahan if i can ask you to share your slides uh the floor is yours uh thank you thank you very much peter and uh thank you very much for the organizers for this roundtable and for the invitation to present the the lesson i have made for programming historian and uh i take the opportunity to thank all the gist and programming historian team for their work help and uh assistant for to write the lesson uh indeed my presentation will focus on something quite well now um i think automatic transcription of historical documents uh that is a key step in any digital humanities project and can is access to collections in digital libraries um but my presentation will especially be focused on under-resourced collections uh under-resourced languages etc because the problem is quite different from common languages and what i have tried to highlight in my tutorial for programming historian is key things to have in mind to create AI model for such documents uh basically what are the good practices so everybody here know everything about OCR and HDR i think so nothing to say about that thanks to AI it's very easy to reach uh very good accuracy uh very good recognition rates when you can provide big databases to train your system or when you focus on specialized model instead of generic models and for instance we have very good models for latin and medieval french uh at Ecole Nationale des Charts and basically the common pipeline with such a deep learning approach is when you are not satisfied by the results we generally say okay it doesn't work we should add more data and if it still doesn't work okay we should add more data and so on and this pipeline is simply not sustainable when dealing with under-resourced uh cases uh because the language specificity because a lack of specialists to annotate documents uh because existing databases are not relevant uh for your problem uh because you have very specific specification because you have a lack of materials etc lack of time uh to transcribe manually documents etc so the reason in fact is quite simple how uh OCR and HDR works basically a good recognition system has to identify in a document region of interest so it's the first step of the layout analysis and in my example here in green it is a main text in red margin alias and in blue tables if it is accurate then the system will detect inside these regions the lines and after detection of lines one model will recognize the text and sometimes the recognition is assisted by a language model and um so this pipeline is very efficient uh for common script and layouts and as you can see for each step we have we have to create dedicated models so a model for the layout analysis and a model for line detection a model for text recognition and sometimes a model for um bringing some context with language informations so at least three models and it already takes a lot of time to bring and uh annotate enough data for one model so three models it's simply not so simple uh for different cases so some examples of situations for which uh we are in a non-resistant cases because of the scan quality or state of conservation here it is a microfilm in the left because of the complex layout so even if it is a printed document in the middle the complex layout is um difficult to overcome or also uh because we have a mix of margin alias and the main text uh in the two examples of the rights with greek and arabic scripts so uh for each situation we will have to create dedicated a model to reach a good recognition rights for that uh we have some platforms uh to help researchers and institutions to create data to train new models and uh you certainly know transcripts and uh iscriptorium in particular but um as i said for a lot of cases and in particular for some languages and scripts we face to a lack of data or even a lack of specialists and uh generally these platforms requires a significant amount of data to be accurate with new kind of documents um and generally uh they also require knowledge in machine learning and another problem is state of the art architectures are highly designed for latin scripts recognizing basically characters in documents but when you are working with arabic documents or very cursive scripts you would prefer to recognize words instead of characters and intuitively it's it will be more accurate and um generally it is and finally um creating data in fact what does it mean and maybe it's a it's a main question for for such a task to create relevant models um we have to define a very precise set of specifications what do i need uh what do i want as an output uh for which purpose do i want to do a critical edition of my manuscripts so i will do a diplomatic transcription or i want to allow keyword search in a digital library so i will try to read all the abbreviated words in my manuscript etc etc and i think it's the main the main problem to overcome so on this example i have a train a very accurate model so uh when i say very accurate there is absolutely no mistake in the transcription so the character error rate is zero percent but uh even if you don't read Armenian you understand uh well that we have some problems columns are not detected so we have a uniform block of text uh words are not separated abbreviations are not read etc so basically even if i have a very good recognition model the text is completely unusable so it's not accurate so my tutorial my lesson was is focused on defining needs for a transcription project um my case today is a patrologia gray clay in which we have to manage a complex layout and very poor quality of printed documents and i have chosen this document because i think it's maybe more readable for everybody than Armenian or arabic manuscripts but in fact the problem is absolutely the same so the problem of the quality of print of these printed documents leads to a very high level of ambiguity for the recognition of characters or recognition of words and so we will have to do some choices in the unicode transcription uh we will for instance not try to detect all the characteristics independently uh but we we need to recognize each different combination of each character and also in my example we only need to recognize and to detect the red colon and not the blue one and the green one etc so we will have two things to do create a specialized model uh to detect the greek text and a specialized recognition model to overcome these uh difficulties and another difficulty here is curved lines so we will have to to deal with curved lines so uh we will choose a baseline approach that allows us to follow the curvation the curved line and to extract the line with polygons instead of bonding both so the lesson used the calca vision platform that is quite similar to the other platforms i have mentioned but uh this platform includes in particular in a real-time AI phytoning that means generic models available online on this platform are automatically specialized on your specifications during the progress of the projects basically we just have to proofread automatic analysis to let the model phyton themselves uh on your task and uh on top of that the platform used neural architectures more designed for under-resourced cases it doesn't rely on a single approach uh but with a specific specialized approach for each kind of uh documents or scripts and what we observe with a very small amount of data only 10 pages character error rate is around five percent so that means for 100 characters only five are wrong basically and the detection of greek regions is also relevant after tront images uh proofread after 10 images it was approximately 85 percent of good good accuracy of a good essence score but i spare you the details um and the lighting script is quite quite um quite good but uh not so good but uh because we are only interested by greek it it doesn't matter in fact and the recognition the detection of the line you can see is almost 100 percent so uh so we can move forward very quickly to transcribe uh these documents of the patrologia uh gaika so basically that means on the platform we have uh generic models that has been trained with oriental documents with printed docs documents handwritten documents archive all manuscript etc different languages and these models are specialized according to your specifications so it's basically what we call uh fine tuning and so this is the results we reach at the end of the tutorial so in yellow you have the mistakes and you can see even if the character is not well printed or because the quality of the scan is not very accurate etc the recognition of the main things the main text is well well done and basically the the the next step would be to add some data um on top of these 10 pages in order to maybe have a character or rate less than uh four percent or three percent but if you intend to do some keyword search uh in such a document this accuracy is already enough so to summarize a bit uh to overcome difficulties raised by um under those documents uh fine tuning is one of the key of the success um if we are able to define uh precise needs it is generally not a good idea to begin to create data for creating data because it will force to create more data a huge amount of data we without being sure to reach a very good model at VM so we have seen that we can reach a satisfying model with a very very limited data set 10 images it's nothing in deep learning and it's the conclusion of the tutorial in fact um defining a very good set of specifications to define needs and create a AI model but basically this kind of approaches open new perspectives for under resource collections in digital libraries and data generated for these lessons are available in github and model available online so thank you for your attention some references to highlight to highlight uh what I said in different scenarios with Armenian manuscript Georgian manuscripts arabic manuscripts and also to manage uh latin abbreviations or different set of specifications to see what is what what can we do with such a thing thank you for your attention thank you very much indeed shahan for that which is fascinating and uh looks to be a really really very important piece of of equipment as they're all in the in the scholarly toolkit as it were um do please everyone if you have questions uh for shahan or for anyone else that's spoken already do please add those to the question answer function uh the third of our three presentations of our specific sets of the new tutorials is from yam ryan who is a postdoctoral researcher and university of health sinky in finland uh the title is building interactive applications with r and shiny so yam if you want to uh share your slides the the floor is yours thanks very much so okay you should all be able to see the slides now uh hello yes so um my name is yam ryan and um uh i'm a postdoctoral researcher at the university of Helsinki and to give you an idea of what i look like i have sort of mediumish brown hair and i am Dean shaven uh i'm 37 and i'm wearing a uh plain white shirt and so this is my presentation and building interactive applications with r and shiny so just to give you a little bit of background about um where the sort of the tutorial came out of so it was as has been explained as a response to the the gisk national archives and programming historian called computational analysis skills for large-scale humanities data and and so i developed a tutorial called building an interactive applications with r and shiny which is currently in the final stages of review on the programming historian website and hopefully will be published on the website fairly fairly shortly so in in this presentation i am going to just talk a little bit about why i think it's a useful skill for humanists working with large-scale data to have and then i will just talk a little bit about the technology behind the and the tutorial shiny and give a very brief kind of walkthrough of the tutorial itself and then if there's time at the end just very kind of briefly reflect on some of the challenges around making a tutorial with this kind of content so and first of all why why do i think it's useful to have these kind of or to teach these interactive application skills so interactive elements or user interfaces can make scholarly outputs more accessible and readable they can be particularly useful to share large-scale humanities data and this is partially because it kind of allows us to switch between broad and focused views which is also kind of a difficult thing to do and working with and communicating results from large-scale large-scale data it can also help to communicate uncertainties or ambiguity ambiguity is in in in data by allowing the the user to sort of sort of gather the wrong kind of perspective on the on the on the on the data and or or research results that are being shared and we see that more and more research outputs include interactive elements of some kind and and a inter it's sort of an intermediate language which i'll explain a little bit like shiny makes these these kind of interactive interactives relatively easy to produce and this was kind of the motivation behind creating the tutorial was to to to to teach the skill which is sort of relatively easy to pick up and and allows users to make interactive applications or visualizations so to kind of give a couple of examples of the kind of things i'm talking about so lots of people would be familiar with the sort of interactive visualizations found in data in data journalism that have been kind of quite common for some time and here i'm showing a sort of animated gif of a interactive visualization from a website called the pudding i think it's a analysis of of Spotify and data and but more and more we're finding that sort of more kind of traditional scholarly outputs including interactive visualizations or applications and and this this next slide has got a kind of an animated gif of a website called Tudor Networks of Power which allows you users to browse through and explore the Tudor state papers and to give a couple of examples specifically using shiny this is a interactive application built by some researchers at the University of Utrecht to browse through certain elements of historical newspaper data and another way to use shiny is to actually embed the applications into project websites so the example here is a from a project called mapping the gay guides which embedded a map of their results within the project website itself and so just to sort of briefly give an overview of what shiny is so it's a package so packages like a sort of set of functions or commands with a particular purpose made for programming language and shiny is a package for the programming language are and which can be used to create interactive applications and so the the basic sort of shiny application has got a couple of components it has a user interface and so this contains the visual inputs and sliders and buttons and so forth and also contains the output so the outputs of whatever code happens in the background and the second component is a server which contains the back end code for the application itself and so usually what happens in a shiny application a user will adjust some sort of sliders or toggles and on a web page the server listens for changes to these to these sliders or inputs and updates and displays whatever sort of relevant outputs are desired by the by the application and so the the benefits of working with this so one very important one is that it works with an existing popular coding language are and so are is a very kind of general purpose coding language and it allows you to use all of the various functions of that of that coding language but make them interactive and in a relatively easy kind of way it's free to use and it's open source although you do need to host the app somewhere which generally can cost money unless you're just using it for sort of personal use and what I found is it's very good for making sort of prototypes or exploratory apps or for making internal tools within the sort of a research project. It does have some drawbacks so it's you need to have some knowledge of the programming language behind it are and so it's not sort of possible to learn without knowing some of that and it's sort of fairly restricted so you sort of won't be able to build sort of very artistic or creative interactive visualizations that you might see which require more sort of customizable javascript or something like that and as I said the hosting you should cost money as well and so to move on to a very kind of short preview of the of the lesson itself and so the the users or the the readers will learn how to create a very basic shiny application and then it'll be walked through the basic principles of what's known as reactive programming and and they'll sort of learn how to work with the basic construction of a shiny web page user interface but it doesn't teach you how to publish the application to the web or sort of basic coding with R other than what it's just about necessary to make the application itself. So to give you an overview of of shiny as I mentioned earlier there's sort of a couple of main elements there's the user interface and this is where the visual appearance of the app itself is stored and then it's got server component which contains all the code and the logic and which would be used in the application itself and then the third element is a command within the programming language to run the application and once you once you sort of put all these three into a particular type of script and run the command then the application will will run in a web browser and can be can be sort of run locally and tested and so to go into a slightly more detail about these various components the user interface is sort of like basic building block of a shiny application and it's built around a grid system of rows and columns and allows for customizable layouts. So one typical example in the one that's used in the tutorial is a sidebar layout and which contains a main panel and then a side panel which contains things like user controls and you can see here that it's built by nesting a number of commands within each other which then correspond to the various parts of the web page which you will afterwards populate with things like sliders or maps or whatever it is that the application is going to do and and these these controls themselves are known as widgets and they kind of form the user input for most shiny applications and these are sort of interactive elements like sliders, selectors and so forth that's given a couple of examples of them here which allow the application to be controlled or various options to be ticked or selected and so forth. And to give you an example so this is one we use in the tutorial itself this is a slider input widget so as you can see it's just a slider with two ends which you can drag and it updates a couple of values underneath and these are the values that you'll use to make changes to the code and the eventual output and so the example use case used in the tutorial itself so I use a dataset of metadata for all British and Irish newspapers held by the British library it's about 30 000 titles or so and this dataset can be used to chart the growth of the press over time understand the library collection practices and perhaps we can use it to sort of help around shape shape around the standing of things like industrialisation demographics and so forth and so this is a freely available and public domain dataset available on the British library open research repository and so the the question that we're looking at in the tutorial itself is to look at the change in the growth of the press over time and and specifically its its geography and this went from a industry that was based mostly in capital cities to something that was regional and then eventually a local newspaper industry over the space of 400 years and the tutorial walks through creating an interactive map and allowing user to explore these changes and so the the tutorial itself has got three coding steps first you the users will create a very simple user interface then they'll create a reactive dataset of places and which contains a count of the appearances in the data as well as their geographic coordinates and then this will be used to create an interactive map using another our library called leaflet and so the first step is to create a user interface and so we add this sidebar layout to the UI component of the application and this creates an empty website and then we put the various inputs and outputs into this and so you can see again here the here is the basic layout of the of the application of the sidebar panel and the main panel and the second step is to create a reactive element so a reactive element is a special type of objects and within our which listens for changes to a related input and so in this case it will be the start and end date that we've picked with the slider and then if any time it finds a change it detects a change it reruns a relevant code and so in this example the reactive element is going to be a dynamic data set of newspaper counts so you can see as I change the slider here and the count of newspapers updates itself in this kind of mini data set on the right hand side and and so this reactive data set is then can be used for other purposes so pass to whether our code or passages packages and for this for this tutorial and for this application we use the leaflet package which is a kind of a very useful well-developed package for our which allows you to create interactive maps and and so again if we update the slider you can see that the map updates a number of points on the this interactive map underneath and and it's got some very basic interactivity of zooming and hovering over places and but it's quite easy to add additional information to that and that's sort of the the basic steps of making the tutorial and there are lots of other uses for shiny so it's not just by making visualizations but it's you can make quite complicated applications quizzes and and and write to persistent storage and even make slideshows so there was some interactivity in this slide which was made with shiny and then so the benefits I sort of mentioned at the beginning I'll just briefly talk about the challenges because I'm guessing I'm almost out of time and so some of the things I've been had to sort of grapple with while making the tutorial and readers are more likely to have some coding skills so less likely to be starting from scratch and so pitching the lesson at the right level is is something of a challenge and giving enough information about the underlying coding language are without without spending too much time teaching the very basics and in in in general the things like this with with interactives which are reliant on live code there are more issues and challenges around hosting and preservation in comparison to something that has a static output and as they rely on live code there is more likely to be compatibility issues as time goes on and and it's also the case that tutorial sites and naturally are built around static screenshots and and code and and so there's less of an infrastructure there at the moment for teaching interactive elements in this way and and yeah so that's the end of my presentation thank you very much for listening and and also thank you very much to the organizers of the conference but also to just the National Archives and the programming historian and and everybody that's been involved in the project and in reviewing and editing my tutorial itself so thank you very much thank you very much indeed Jan for another fascinating presentation of what's going to be an important resource there's gentlemen we have 20 minutes or so for for questions we have there are a couple in oh no sorry there are now more than a couple in fact in the chat and I but I can I before I plunge into those can I invite the rest of the panelists to switch their cameras on if they're happy to and then if they have questions or comments that they want to contribute they can just they can wave their hands at me and we'll go on from there the first question then actually let's go let's start let's start off with the fact for the first one is for Jan and this is from from Becky Scott is a question about what are the pros and cons that you see of learning are as opposed to learning say python what what are the what are the advantages and disadvantages you can see there I think this is the eternal question but I I mean I think it's just whichever one you and it works well for you I mean there there there's sort of probably some lists of things which are better on one and better on the other and I I like R because it has this very widely used IDE so the the the kind of standard interface that many people use and is quite easy to work with and if you're coming from a non-programming or a non-command line background it was much more of a kind of a gentle introduction sort of like that I know that python is better for some things for sure for machine learning and yeah so just it's data the art and AI stuff and python is kind of the the much more popular one and but another thing is that you can actually so you can use both of them at the same time so that's there's there's ways now of of combining R and python code and in the same scripts which I do sometimes and so you don't necessarily just because you learn one doesn't mean that you can't use some of the functions or skills from the other one if you need to thank you very much indeed um actually what there's one we're actually here I think I think it's a question a well a question from Isabel Holowatty which I think is for both Jan and Chahan which is about preservation and sustainability or of these projects we're talking about to what extent are these applications tools and outputs future proof and Jan do you want to start and then I'll come to Chahan with that yeah yeah that's it's a good question and then I think it is it is a problem and particularly because these are sort of interactive tools and and and they rely on on live code I mean the something like shiny does have a benefit over maybe less kind of open source and interactive applications in that yes the code is open source and it's it's possible to sort of create a environment within R which means that you can kind of you can sort of recreate exactly the conditions where the application was made in the first place and so just using something like like shiny and and R has some advantages over maybe more kind of bespoke interactive visualizations and but yeah it's it's it's it's difficult it's it's uh it's not that easy to preserve them which you know is obviously particularly important for research outputs so they sort of live on some sort of live hosting site and if that goes down or somebody stops paying for it the application might stop working and so it's it's a it's a interesting challenge and I think it's one that and libraries and so forth will probably have to think about more as time goes on and as people create more and more of these kind of interactive applications which are becoming relatively easy to to produce and becoming more likely to to be part of research outputs thanks Jan Chahan do you want to jump in on that as well what what does sustainability of this look like to you yeah in fact thank you for the question in fact because of the evolution of AI these tools are forced to evolve also in the future but the main issue to easily is access to this collection is the amount of data required and approaches are already sustainable they have been adopted by all teams and tools now data format are completely interoperable so it is clear that for OSER and HDR we have defined a lot of things that are sustainable but I think the most important thing here is not really the sustainability of tools but the availability of data once you have annotated your documents then you can use them to try new models even if technologies have changed but the sustainability of the model itself yes of course you will always find new developments that leads to a non-useful model but if you can keep and can store the data with a full description of the specification of your data set the sustainability is here thank you very much indeed we've got a couple of questions for John and Jenny the first that came in is from from just tpool which is if this were to be used in a catalogue search contact do you foresee any issues with word embedding making searches that are too broad for the for the for the average user in other words are they likely to be delused with thousands of potentially relevant resources John I think you were going to take that one oh no I think I'm going to answer that one okay um well I haven't really used it in a catalogue search context but I'm guess I'm figuring that that's really a user interface decision depending on what you actually want the users of your archive to achieve when they get their results and you could use the interface to kind of prompt them to include additional keywords you could combine it with a keyword search or you could supply them with additional documents that are related in similar terms using these embeddings so you could use a combination of things and you can constrain the word embedding discovery documents by saying I only want to see a hundred or you can look at the distances between the documents because the whole thing about the word embeddings is that it makes everything mathematical and so you can then use the similarity between documents to say well these ones are really close and these ones are further further away depending on how much stuff you want to give back to the user and one of the advantages of this over keyword searches is that keyword searches miss documents um in research uh when the same when the same word isn't present in the other paper and that kind of doesn't help with things that done years ago that might be in the same context but they don't use exactly the same language um so yeah you know they they kind of enhance each other but the word embeddings are much more contextual than just using keyword document searches but at the same time if you constrain everything you could say what how far you want to look or how close or combine two together in the catalog search context thank you Jenny John do you want to add anything to that yeah well I think just I mean actually because Jenny just brought in the element of time so this speaks to the other question the Q&A about um you know change over time and and the kind of impact that might have um I mean that is an area it's not an area that we're actively working on but it's an area that certainly within the digital humanities I know um you know Barbara McGillivray who's at Kings has done work on this um you know how word embeddings change over time and it makes sense that you can pick up this kind of thing because it essentially the context in which a word is used changes um you know it's uh and therefore the embedding now what you need is a way to kind of say you know before and after or or or kind of some way of separating those two time periods but it's not impossible they're also more sophisticated you know the the area in which we do our work has progressed a lot beyond word embeddings into kind of there's um you know especially driven by kind of Google and Facebook where the same word can get different embeddings based on being used in different contexts so for instance word embeddings cannot distinguish between you know the the simpler version that we've used can't distinguish between a bank as in a financial institution and the side of a river um but the benefit that we get is we can run this approach on a laptop so you know Jenny's been crunching the data for the entire uh the entire ethos data set on you know just a modern laptop whereas we need something a lot more powerful to you know to to to get those those more subtle embeddings so the answer is yes it can be done um it wasn't our particular research interest so right now we're being kind of ahistorical but the data we're working from is from 1980 onwards anyway so and during that time there's been change but not the kind of change you would see between you know uh DC's written in you know 1912 and 2012 for instance. Can I follow that with a question of my own which is actually to convert a point that was made in the chat into a common stroke question which is actually um it's actually where so we've talked a little bit about about the idea of quote-unquote misclassification in in all of this which of course is is and you don't for a moment really suppose I think that that that you know you don't give it the kind of pejorative sense that that word might mean um could you expand a little bit on on that sense is actually kind of where is you know is the interest here actually in in the quote-unquote misclassification to an extent yeah so I think that there's a couple things that are really interesting about this um I mean one one is and I believe it was Andrew's point is of course that um an archivist is taking in contextual uh information that sits in outside of the text right so for instance a phd coming from imperial is very unlikely to be say classified as social science even if it's a uh educational you know physics in education um simply because the context is imperial as a science institution not a social science one speaking very generically um I think what you're thinking a little bit about you know and expanding on what Jenny was saying is this idea that you could use it as a prompt to say this document looks like this you could also think that it's a way of approaching interdisciplinary texts so here's one where it's kind of expert classification and its textual classification differ these ones you know may deserve more attention or that might point to for instance an an emerging area of research um that sort of thing where you get those overlaps I think those are kind of points of real interest that sit don't sit neatly within their kind of disciplinary boundaries they misbehave and are therefore misclassified but as I said in the in the chat I mean that in a technical sense not a you know a fundamental truth of there is one true you know correct classification for a document thank you very much um can I follow can I um do folks do please keep adding more questions as they occur to you and I and the panel if they want to wave their hands if we can also jump in um but John is you know could what will be the next could this be made even more quote unquote accurate is there is there more refinement you can put into this that or actually are you are you really working with you're working with third party uh technologies and applications which which kind of limit you know you get what you get in this sense or is there another step that could it be improved if that's the right word yeah well I mean I'd like to let Jenny speak to the technical improvements that are you know many um I think in this sense what one of the things that we would like to see happening and in our discussions with kind of the ethos team at the British Library is one of the reasons that they're interested in this is can we start to close the site because you know they have for instance lots of documents that have no classification information can we feedback well here's at least a high level classification that could help improve you know the structure and retrieval of documents and things like that so there are ways to even now begin to feed that back everything that we've done um is open source and in python and eventually well obviously it'll be on the programming story and tutorial website um but you know even the full work is in python but in terms of technical improvements I'd like to turn that back to Jenny and let her speak to kind of more specific things that can be done yeah I mean in terms in terms of the word embeddings they've gone from very simple simple things where you have a huge matrix of vocabulary and you say oh this word is in this document and this word is in this document to being to them moving on to contextual um word embeddings which we've talked about where the context of words is is kind of calculated within the corpus that you put into the programming like so you if you don't put everything in the world then if you only choose like a subset you're only you're getting the context within that subset so now there are embeddings called birth embeddings in particular which are the latest um embeddings and you know they're very complicated to code it's not it's not very straightforward most of this has been we've used python libraries and fought and battled to get them to work the the stuff that we presented to you is the stuff that works really well but there is other stuff that's more complicated that takes a long time just to get to work and you know it's all changing every year something new comes out that's better than the last one so yeah it is it's it's really evolving all the time and trying to keep up with technology as much as anything else that's our biggest kind of challenge um and we did we did do quite a bit of work on phrases um as well trying to capture new well new phrases old phrases any phrases really across time um so yeah that's another area that people people are moving in on but it really is a case of you have to keep up with the technology all the time and yeah to get it to be useful more and more useful thank you very much indeed Jenny I see James Baker is is waving at him that I should introduce James but before I do um uh there's still a question in the chat from Isabel Holowatty I think John picked up on some of that but Isabel if you want to kind of refine and rephrase that and post it again if there's something more specific you want to to dwell on that well that would be great but James James who we're going to hear from in a minute is is part of the core programming historian team James you you fire away yeah and I'll probably introduce myself in a minute when I come in a minute properly um I just want to quickly go from the conversation about simplicity to kind of some of the things Shahan talked about and Johanna I recall in in your your application when you submitted to our call for papers you talked about things like minimal computing and you and you talked a bit today about kind of the value of small and I wondered if you could reflect a little bit on like what you think methodologically particularly when you're kind of teaching an approach or even just applying in research the kind of the value of starting small with kind of simple AI processes that can be run on local laptops supposed to going straight for the sort of the the brute force approaches that are kind of more common in in in some areas of computing because I think your your article is a really great exemplification of some of those those values sorry that's that's for me is that it was for Shahan oh sorry sorry I misheard I apologize because I agree with you but I am natural to to understand everything so in fact concerning the amount and the working with few few few things and few data indeed it's important in the future to to work with a few amount of data but not only for OCR and HDR but actually it's the main problem of deep learning you bring a huge amount of data and you hope everything will work with this huge amount of data if it doesn't work you would have to bring in new more data and at the end you will not know if is neural network has learned something or if you have provide too much examples and so it's something like by heart is not learning something but it's just doing exactly the same that he has seen in training so it's one of the key step in the future for for deep learning I completely completely agree with that and in my opinion for HDR and OCR in particular one of the key key things for overcoming this this thing is to work with specialized dedicated architectures for specific languages I mean it's a nonsense to to work at the character level for instance when you are working with a very cursive script or Arabic documents because there is no character in fact to to recognize so if you are working with the word level or with the sentence level it's more accurate because a human is doing something like that when you are reading Arabic etc so yes creating more this more customized architecture will be the key step to overcome the amount of data in the in the future or another instance for if you are working with Chinese you have something like 30,000 different characters to recognize and if you rely on massive approach you have to provide 100 or 1000 samples of each character so it's not possible with such a with such an alphabet so yes we have to think about all the approaches thank you very much Shahan I think we're getting close to to the end of this this session I think if there are I don't see any more questions coming in zoom which case what I will do is I'll move straight on to our final presentation which is given by James Baker who you've heard from already who is director of digital humanities the University of Southampton but also part of the the core programming historian team this is an opportunity to reflect on on the project as a whole from the programming historian end of it as we've already heard from Jisk and the TNA perspective James you don't have any slides I don't think but so over the floor is yours yeah I don't have any slides because actually I was constructing what I want to talk about really what I was hearing about these really wonderful papers and thinking about how they came to fruition and how they've gone through our processes and and how now they are very close to publication we can think about what have been the implications of the having these kind of things as part of the programming historian so yeah I'm James I've been the program historian team since about 2017 I'm a trustee of the programming stories charity and in my late 30s white skin blonde hair gradually inevitably receding black room glasses mostly smiling with a blurred background and I don't know the color of my t-shirt because I'm rather colorblind slash deficient maybe it's gray maybe it's blue not really sure and it's lovely to be a DC DC again and lovely see so many familiar faces in the chat as a little bit of framing for going to the reflections this is the first project of this kind of programming historian has been involved with since its founding in 2018 that is actively working with with funders and supporters and this particular project on developing computational skills for work with large-scale collections started in august with a couple of papers we did some selections in the autumn and then we the submissions came around January and all these articles are now on there on their pipeline to publication as many as you will know and you'll have seen today the historian bit in programming historian is rather broad as are our audiences we have about 1.5 million readers per year from people in Indian cities known for the tech sectors often look at the python tutorials to glam sector colleagues who are running peer learning workshops or code cafes to rapidly expand the communities of readers in Portuguese and Spanish speaking South America as we diversified into into those languages which I'll come to and and based on that kind of tradition and the kind of who we are as a project I really have kind of three main reflections on how this project has kind of maybe think about what we do our model of being the programming historian and how we work and how that might shift or change the first of these relates to Isabel's question on sustainability so we clearly as a publication been around since 28 2008 have a challenge associated with articles on large data sets because they often need to interact with data services rather than they do with data sets we have been burned before on this by an article that relied on an endpoint from a large institution that we thought we could trust that where that endpoint went offline and the sustainability thresholds that we've built since then in response to that means we choose to be careful with both the data and the kind of code that are used in articles that are proposed to us and as you mentioned of course I'm I don't mean sustainability in a climate crisis sense here but we do care about that as well and our static build is low resource intensive both for our creation of it but also for the users of our articles and I think that's that's a really important part of our architecture so we had to tread really carefully basically in this project and selected articles at the corporate paper stage that use services like ethos we've seen or archivo.pt which is the Portuguese language web archive that we hope we can trust and I think that that's been really important for me that we've kind of managed to retain some of that sustainability and how we work the second reflection I main reflection I have relates to something Shahan said that we should not become a code book that is just attached to computation we should not go with that drift towards those kind of big large things services like Google colab we believe there is really a role for us in space increasingly dominated by ephemeral code books attached to cloud computing before this particular project landed my favorite article that kind of exemplified this is Zoe Saldana's article from 2018 on sentiment analysis for exploratory data analysis which you can find if you go to our lesson archive and type in Zoe or Saldana or sentiment or something like that you can easily run the code in this full article length piece in five minutes but the real value of that article is in the analysis of the outputs and the discussion of those outputs that kind of prompts the reader to reflect and consider what it is that's happening during the course of this sentiment analysis and I kind of wondered before we kicked off this project whether the articles that we got would sort of drift us more towards being kind of a code book model with thin description pulling holding it together rather than having a kind of slightly more focused on methods I guess but the authors I think are really demonstrated that tutorials and the analysis of large data can take the form of tutorial length articles that entangle kind of reflexive methodology with getting stuff done and you know if you're someone who comes to us just want to get stuff done you can kind of skip through cut the bits of code out and kind of just like get things moving but if you want to really think about the underlying technologies and how they're interacting with the data which these articles are using and the methods used by the author I think that remains very valuable and we still broadly speaking as you've seen today have an ethos that's around people submitting things to us that things that they have done as opposed to things they want to show other people how to do and I think that's that for me remains a really valuable part of what we do and I hope these articles continue that what I know these articles will continue that tradition and it's interesting to me it hasn't sort of drifted us away from that. Finally the third of my main reflections is that I think the project has underscored by virtue of trying to produce these articles simultaneously so we had a call for papers we then our authors know and they almost all submitted roughly around the same time in January this year. It's really underscored to me the extent to which publication services that have things like peer review that we do that do do copy editing and that also like we do have translation work embedded in them do take significant labour and time and we often forget that I think in in sort of open access and we can forget sometimes that things being produced are are significant investments and and the reason I say that is because of course we've been working for a long time and publishing this model for a while and bringing in translations for the last sort of four or so four or five years but I think having lots of articles at the same time have really shown and exemplified the kind of that model because when you just have articles in rolling form you don't see that in that batch sense I guess the kind of the effort and the kind of the the real time it takes to produce quality work and this is especially true as our peer review processes focus on article sustainability as I mentioned which really brings in a lot of different kind of elements in the peer review process but also that it's really focused on writing for global audiences in ways both that support the reader who is in say India for example who may not understand a particular kind of colloquialism from North America but also that enables translation and as I said we are a multilingual publisher and in order to enable translation we we sort of we encourage our writers to write in a certain way and then we sort of edit those articles through peer review to kind of push them towards a more kind of global writing style two of the articles published in this series are also published for the first time in a language other than English and so we're doing this multilingual not just starting with English but starting with all the languages and trying to feed them so they can interlace with each other and we currently publish in English, French, Spanish and Portuguese so it's one of our challenges and indeed one of the articles has been already being significantly localized during its translation with slight revisions to where the data sets work and slight revisions to the way the kind of the case studies work so I think this model as you know is really powerful for us because localization can be incredibly important for communities to really kind of get to grips with with a particular article but just to kind of say that I think the investment the time it takes is very different to kind of just publishing a code book or two minutes so before I hand back to Peter too with the final words of closure I just want to thank our funders once again just continue and I think Joe and power in particular for their enthusiasm and towards the program historian Peter Webster for his leadership and Tiago who's also in the call today for holding everything together during the course of the project and to the authors for their hard work to Peter Finley from disk I think is lurking on the call as well for sort of sorting out the project setup phase to all our institutional partners we have over 30 paying institutional partners from across the world and you can even if you're a library person buy your subscription via just subscription manager now which is great and you know those supporters enable our work to happen and to all our trustees the team and the staff at the program historian can really intervening as and when needed during the course of this project because a complex project like this does mean it even though you resource it and you allocate particular time to individuals it always spills a little bit around the sides into the wider team need to be involved so thank you to everyone for such a wonderful project that's produced I think some some really wonderful series by really wonderful articles forming the part of this special series. Thank you very much indeed James May I second his thanks to that long list of people who've made this project what it is can I thank everyone for joining us today in this call I hope that hope that it's been useful and enlightening my add my thanks also to the DC DC team who've made sure that everything is run so unobtrusively smoothly today and do please look out for the new articles as they come available on programming historian.org in the next few weeks or and or follow the Twitter account with a handle at progist p o g h i s t and then just remind the main speaker to thank finally all of our speakers for their admirable time keeping clarity and and for such interesting presentations thank you very much indeed