 I'm going to be introducing the first two speakers who are going to tell us a little bit about handwritten text recognition. So we're going to have Jacqueline Somburg and Caroline Picoski from the McGill Library. Jacqueline is the outreach librarian at Roar, the Rare and Special Collections Oslo Art and Archives, and Caroline is the metadata and electronic resources librarian. They will talk about how they have applied handwritten text recognition on a couple of McGill's archival collections using the core text platform and the kind of practical applications they've encountered, the technology, the sort of the challenges and the best practice in terms of using this technology to access archival collections. So I'm going to now hand over to Jacqueline and Caroline and I believe we have a video from that. So good morning or afternoon wherever you're joining from today. Jacqueline and I are delighted to be here with you to share our work on enhancing the discoverability and accessibility of handwritten text, applying handwritten text recognition or HDR to McGill University Library collections. This presentation will describe our process for applying HDR to two of our archival collections using the core text platform. So our next slide now is going to show that first and foremost, the fundamental goal that has driven our work has been to make our collections more accessible. We're keeping all of our stakeholders and users in mind from faculty and students to researchers both at and external to McGill, as well as members of the public. The value of these unique collections is tied to their use and discoverability. So ensuring that this collection is as findable and accessible as possible has been the why behind everything that we've done. And this is why we were committed to trying to apply HDR to our collections as one additional access point to these materials along with cataloging metadata and the digitized images of the materials themselves. So with that in mind, I will now turn things over to Jacqueline to speak about the context and the specific collections we've been working with. So thanks, Carolyn. Here at McGill, we do have four rare collections under the umbrella of a war. That's very special collections. The Ulster Library of the History of Medicine, the Visual Arts Collection and the McGill University Archives. The library's been collecting materials for a long time. We started in the 1850s and now through purchases and donations, the holdings are rich research collections, including a great many manuscript materials in over 250,000 items unique to the world. So we invite the curious everywhere to discover our collections, but we have found that handwritten materials do present unique challenges and unique barriers to access. To this point, handwritten materials have really only been searchable by human eyes, requiring time consuming transcription and double checks or crowdsourcing and double checks to generate searchable transcripts. So for this pilot, we were excited about the potential of AI driven handwritten text recognition and we chose to explore the options because really handwriting can be a very major barrier. Our librarians have had experience working with students in the classroom setting and it's been eye-opening to see that even history students who received the most training and working with primary materials can be put off by the time and effort needed to make handwritten effective use of handwritten documents. So we're excited about the potential with HDR that would enable new search techniques. So we chose two collections. The first one is actually this one, the Doncaster Collection. It's 12 manuscript recipe books and over 1300 culinary and medical manuscript recipes. It's dates from the 1790s through 1840s and mostly from the Doncaster area of South Yorkshire centered on Hootman Pagnell Hall. Why this one? Well, it's entirely manuscript and it's fairly small in size. So the scope was appropriate as well as the content. It was also the subject of an active research interest at the time and as such, it was a good test case for transcription functionality because that project was generating the transcripts of their own that we could use as a comparison for the AI generated transcripts. So we ran with this and we published it in April of 2020 at a time, unfortunately, when all of us were focused on the growing COVID-19 pandemic. The second collection that we worked with is a collection of fur trade materials here at McGill and this collection includes records that document the finance accounting and administration of the Northwest Company. And the Northwest Company was active from the 1770s through about 1821, at which point it becomes more recognizable worldwide. It merged with the Hudson's Bay Company, which has more name cache, I believe internationally. In many ways, the histories of those two companies are the history of European colonization in the landmass referred to as North America. Now users can take a detailed look at the impact of the fur trade across the country through the lens of the documentary evidence, account books, ledgers, estate documents and correspondence. You can see James McGill's signature here. Although headquartered in Montreal, the Northwest Company extended its reach as far as the Pacific Ocean and the Arctic Ocean. In these records as well, Indigenous knowledge and Indigenous peoples were really a principal part of that fur trade and they were also, or sorry, they are also therefore present in these records, albeit indirectly. You can see on screen here that some different Indigenous groups are named in these documents. They are named by exonyms, to be sure, but they are noted in the account books and the material also attests to the critical role of Indigenous peoples in the fur trade. They were the primary providers of furs, as also the crucial sources of knowledge for the European fur traders as they established their networks in North America. Letters of several of the company partners expressed the complicated nature of the relationship between the peoples. I want to take a moment to acknowledge that McGill University is located on land that has long served as a place of meeting and exchange amongst peoples, including the Haudenosaunee and the Nation Bay Nations. Roar honors, respects and recognizes these nations as the traditional stewards of the lands and waters on which McGill University stands today. Here we meet to study and exchange ideas, but the legacy of those Indigenous peoples is here in this place as well as in this collection and records of their knowledge and influence are listed here on screen in tidy black and white. As the archivist Tom Nespitz writes, Indigenous peoples provided technological, agricultural, military, cartographic, economic, medicinal, weather and wildlife information. They are named, described and extensively quoted, sometimes in their own languages, in the records. They were sketched and photographed and filmed, and the archives of their knowledge helped to create the archives of the Europeans that they encountered. This collection in particular is one of those in which we see the influence of Indigenous peoples. It also represents commercial, political and social interactions between Anglophones and Francophones, and you can see that in some of employee contracts, which you see here in French edited and redacted and take a form contract and annotated by hand. So this was one of the challenges actually for HDR recognition, but with that context, those are the two collections that we chose. I'll pass it back to Carolyn for a dive into metadata. Thanks, Jacqueline. So our goal was to take the Doncaster and Furtrade collections and import them into the Cortex platform, and I'm going to take some time now to review the metadata considerations that were involved in the import and how we tried to make decisions that would allow the metadata to function as additional access points to the materials once they were within Cortex. The question, I think, a good starting place is to ask, where were these collections to begin with? And if we go to the next slide, you can see a representation that the collections do exist physically, of course, at McGill. In terms of this work, I was mostly concerned with how they're represented online in our archival catalog, which is built with E to M. On this next slide, we're going to take a closer look at the James McGill phone, which is part of our Furtrade collection, as Jacqueline has described. And as you can see, there was already a lot of detailed metadata in that online catalog associated with the font and the individual items in these collections. And luckily this meant that the groundwork was really there. There wasn't a lot that I had to do to prepare this metadata for ingestion to Cortex. Now we're going to take a look at where things ended up and where all this metadata is now. You're going to see a snapshot of the mapping exercise that we did, which was primarily set up by my colleague, Megan Chellew, which was to help us get ready to perform the important to Cortex. So at this point, we were trying to pick and choose the most important or most used fields from E to M or Adam to carry over into Cortex, keeping in mind the user's experience and looking for ways to enhance their options for discovering the content via whatever different pathway they chose. You'll see sort of the final product of that mapping exercise, which was the final field mapping that was built up in Cortex. And it includes all the fields that we opted to include on each item's display in Cortex. We had two different field mappings, so one for Doncaster and a separate one for Furtrade. And while building the field mapping, we also spent time deciding which fields should be controlled vocabularies and which would be text fields. So the controlled vocabularies in Cortex allowed us to do things like create font-level landing pages and allows for greater browsing opportunities overall for users. And later on, you're going to see the controlled vocabularies. They sort of show up as little grayed out terms that are affiliated with each item in Cortex. If we go to the next slide, you'll see a representation of the work that I was doing to prepare some of the exported metadata. So essentially, I worked with metadata exported out of the archival catalog. And then I edited that using a combo of Excel, Python, and as you see here, open refine. And if we go to the next slide, you'll see an example of an Excel file that's ready for upload directly into Cortex. I dealt with just one font at a time in my case, so I only ever did about a maximum of 50 items at a time, but it would be possible to do more. And I made some changes after this initial file upload within Cortex because that's an option using batch editing and some of their other features. So I would say as a final note on metadata, the workflow we have is fairly well established now for future metadata transfers of the sort. And we would have a standard in mind for doing transfers at least into Cortex. Tools like Excel and open refine, as well as Python, and as well as the batch editing that's available in Cortex, I'll ensure that this metadata is easy to work with and rework as needed. So at this point, we're going to actually take a step out into the Cortex site and take a bit of a site tour through our online platform and the online exhibits that we have. So we're starting on the home page of the Miguel Cortex site. Cortex simplifies and automates our workflows for transcription, but we also use their tools to build exhibitions to accompany this material and to make the material available online. So the search box that you see on the home page will search across all of our collections. But as we've mentioned, we started with one collection in particular, which was Don Caster. So we'll take a look now at the Don Caster specific landing page. Now that we have two collections in our Cortex system, we're making use of things called list pages, which is a website template that will build pages based on your controlled vocabularies or collections. This function makes generating pages simple, and it still allows you to customize the raw search options and appearance. And now we'll actually take a step over into our second collection page, which is the fur trade landing page. And it's possible for users to also do more of a browsing using the find function if they would prefer to do that. So we're going to take a look within fur trade at how you could browse through our different phone pages. And we're going to jump in directly into our James McGill phone, which is one we've been discussing throughout so far. And finally, we also want to take a look at our fur trade exhibition site, just so you can see what that looks like to bring these collections to life all the more. So if you click into the exhibition site, we're going to jump over to the second slide just so that you can see an example of how you can toggle to the full screen to get a better look at an image. And you'll see that there's also a nice image caption that's associated with the image so that you can get a bit of a description about it. The interface does let you embed other external links. So what we did here was embed a contemporary map showing the location that this fire plan corresponds to today, which is quite interesting if you want to walk down to Old Port, you can see where these buildings stood. One of the other functionalities of the exhibition is that it is powered and linked to all of your assets in Cortex. So if you click on this related page, it takes you out to the actual contract of Jacques Ratel of L'Assomption of Voyager and who signed his name there as I zoom in and out a little bit too quickly. His contract is there marked X at the bottom, but that's what the interface looks like when you look at an individual item. You have the summary here. You have the metadata listed below with those controlled vocabularies that Carolyn pointed out in her comments. You can click through any of those and it'll take you into the list of related assets. And then the other thing that is important is, of course, you can look at the transcript right here. And the other difficulty that our collection posed was, of course, it's multilingual. It's in French and English. And it is manuscript, but also tabular manuscript data. So I worked with Cortex to select samples for testing their HDR accuracy. I'm going to flip back to our slide deck now. And what we did was try to hit all of the major categories of material and to have a clean and messy example of each category. So clean correspondence, clean ledger, a clean map, a clean daybook, etc. And then what we did was create a clean and corrected transcript based on the transcript generated by the AI. Testing showed us that the generated transcript online, as if your sees it here, doesn't actually show the line breaks that are in existence if you download the transcript. So that's one mismatch between the viewer experience and the downloaded transcript. And we found that the accuracy of detecting column breaks was actually quite high and character recognition is very high. But with the tabular format, the breakdown is in generating a transcript that makes sense in order because the line breaks do tend to really confuse the AI. Another thing on this one that causes issues for the AI recognition of the handwritten scripts is actually the page curve at the binding. So we newly discovered the importance of very flat and very clear digitized images. We use the historic handwritten scripts and HDR provided by Cortex does have different scripts that are better for different time periods. This is the corrected transcript in process that we provided to Cortex so that they could test the accuracy character by character word by word and line by line. And they did a lot of work and have all of the documented accuracy ratings for the test samples that we gave. This is an example of correspondence in a mediocre personal hand which actually belongs to James McGill and the generated transcript is there. The accuracy ratings were variable depending on the legibility of the script, but overall impressive. And it opens up a keyword search across the entirety of the collection. So the one thing that breaks down of course is tabular data, especially with non-text characters. So things like ampersands or apostrophes or this sort of thing, they really did confuse and break down the legibility of the generated transcript. So for us, the project showed us that the hard work put in to make our archival collections available even through A to M is well worth it because that prepared the way for transcription workflows in this new software in Cortex. We've gained knowledge of the strengths and weaknesses of this particular AI-driven HDR throughout the pilot and it gives us a new lens to look at our collections. We know what type of material, return the greatest character and word recognition, and we know also the inverse. What materials are better lasted to manual and human transcription at this point. I'll close now with some words from my colleague Ellis Ng who curated the fur trade exhibition along with us. As librarians and archivists, we can't possibly know all the potential uses of the material we care for, which is another reason why HDR and other developments can help overcome barriers to research are so valuable. We want to empower people to make use of the material in whatever way they want rather than dictate the ways in which it can be used. Thanks for listening. Thanks. Thank you very much, Jacqueline and Caroline. Really, really interesting presentation. I'm sure we'll come back to some of the issues about transcriptions and the challenges. Some of this actually came up in this morning's session with a programming historian and some of these approaches. Our next speaker is Jenny Ban from the National Archives and Jenny is a Head of Archives Research. She has over 25 years experience as an archival practitioner, educator and researcher and has been interested in the intersection and interplay between archives and technology throughout her career. She's going to talk to us about AI and the glam sector and what it might mean, the latest craze for the glam sector. So thank you, Jenny, and I'm going to hand over to you. Thank you, Paolo. The title of this presentation was in... Oh, I must introduce myself. Sorry, I am a white woman. I'm quite short. I have brown hair. It's tied back. I'm wearing a blue and white, blue and blue pattern top, which is the same top I wore on Tuesday if anybody was here on Tuesday. So that's me. And the title of this presentation was inspired by an article by Terry Eastwood, which was called Nailing a Little Jelly to the Wall of Archival Studies. And in that article, Terry Eastwood set out to dismantle an edifice of ideas to see of what it is really made and whether it will stand up to close scrutiny. And in this presentation, the edifice of ideas under scrutiny is that of artificial intelligence, because in my opinion, this is something that badly needs pinning down. These days, it is all AI this and AI that. And I, for one, am increasingly put in mind of the emperor's new clothes. So what follows is a personal reflection on the way I currently make sense of AI in the game. And I will be asserting a kind of definite view, but I do so not because I want to persuade you to the same view, only because I want to encourage you to take steps to inform one of your own. When I'm asked kind of what is AI and I'm asked to define it, I choose to sort of do so in the terms on this slide. And as you can see, those terms are informed by looking back into the history of AI and to a summer workshop held at the Dartmouth College in New Hampshire, which is commonly regarded and by commonly regarded, I mean, IE by Wikipedia as the founding event of artificial intelligence as a kind of academic field. Now, the exploration that started at that workshop is still very much ongoing, and this hypothesis is still far from proven one way or another. But what is certain, however, is that as is the case with quite a lot of academic exercises, the knowledge and techniques developed as a result have found fruitful and profitable application in any number of real world contexts. And in this respect, another quote I particularly like is from Ralph Cornis, who wrote in the pages of the Records Management Journal way back in 1989 that artificial intelligence, which many saw as the wave of the future, will arrive by osmosis. Other branches of IT will steal its clothes. It is already starting to happen. Osmosis can be defined as the process of gradual or unconscious assimilation of ideas or knowledge. And it is this unconsciousness that I see as the problem, given what we are starting to realize about the potential applications of AI to remain unconscious of that assimilation seems to me to be asking for trouble. One of the first sort of new areas to spin off from the quest for AI was that of robotics. Now, this field has achieved considerable success in proving the hypothesis that one feature of intelligence that can be described in such detail that a machine can simulate it is the ability to follow a clearly set out procedure to get that is with the program. Indeed, machines can do so quite literally such that they are able to do so arguably more slavishly and consistently than any human ever would be able to. And because they can simulate this feature, they can and do perform routine and repetitive manual tasks, albeit under very carefully controlled conditions, because they are still not good at simulating that feature of intelligence that allows us to react to something unexpected happening, like a bike suddenly veering in front of our self-driving car. More recently, another spin off from the exploration of artificial intelligence has been the field of machine learning. And the feature of intelligence that machine learning simulates is our ability to process data to iterate models that allow us to make judgments or predictions on the base of similar data in the future. So to some extent, then, this could be seen as learning by experience, but in the case of machine learning, that experience is exceedingly limited in range. But even so, the machine simulation of this feature has advanced to the extent that it already mimics the almost unconscious internalization of the arrived at model, which leaves many experts unable to explain fully why and how they have come to those conclusions or by those insights. Then again, the simulation also mimics the well-known problems of reaching any conclusion on the basis of data that is inaccurate, incomplete, and or unrepresentative in some important respect. Training and refining these machine mental models takes a lot of time, a lot of energy, and a lot of data. But models have now been created that do allow machines to simulate perhaps more advanced features of intelligence, such as the ability to parse natural language or to recognize an object appearing in an image. So as the phrase goes, fake it until you can make it. But while the machines may now be able to fake it, actually what we make of it is still up to us. Well, what then should we make of it, I guess, is the next question. And to start to answer this question, I feel it's instructive to think about what others are making at it and what many people are making at it. In particular, perhaps the big players in technology is another service that can be sold to us. So as to why we might want to avail ourselves of this service down the right, you can see a list of the sort of things we are being told we can achieve if we choose to employ power of these new services. So in the glam sector, we may be less interested in fraud detection or combating financial crime, but achieving the last two in bold, I would say fall very much kind of within our sights. So knowledge mining, according to the sales pitch involves uncovering latent insights. This is absolutely one of the things we want to facilitate. As glam organizations, we know that our collections are crammed full of latent insights. That's partly why we put so much effort into preserving them such that this latency can be realized. Document process automation on the other hand involves turning documents into usable data at a fraction of the time and cost by automating information extraction, particularly perhaps in the more archival parts of the glam domain. We have more documents than we can possibly process using manual methods, but it's also kind of generally true, I think, that the holdings of glam organizations quite often are not usable data in that sort of digital data science way of things. So that's big technology players. What about Glam's? So here we have the results of a project which show that knowledge extraction and metadata quality are two of the main reasons why Glam's are interested in AI and they're also of course the areas where its application has been found to be most useful. Now these two are of course related because knowledge extraction or mining, whether it is AI assisted or not, is always ultimately dependent on data quality. And this connection may also explain why the potential of AI is seen perhaps most as pertaining to the application areas of collections management and discovery and search. So this is where Glam is currently focusing its attention in relation to AI with possible applications such as audience analysis and machine translation still of interest, but not as much. So nailing a little jelly to the wall then, let us narrow our focus accordingly and to look at what we are currently making of AI in respect of one, turning our collections into usable data, and two, uncovering latent insights. First then, turning our collections into usable data. One thing that all Glam institutions are very aware of is that spinning straw into gold is quite a resource intensive enterprise. The added value of our material curation and care ensures and allows others to realise all sorts of values from that material, be that cultural, aesthetic, emotional, informational or evidential. Ensuring and allowing this outcome has always involved material preservation, along with the collection and curation of lots of additional intelligence, information or metadata that allows the relevant material to be found, understood and sourced when required. With the arrival of new digital ways of processing, however, to be considered usable, all this material and information is now expected to be accessible in a new form, which is amenable to that type of processing. We are entering the world of, so to speak, collections as data. So spinning collections into data is a sort of constant background task and reality for Glam institutions and we are starting to make something of AI in this regard. For example, it is through the application of machine learning that projects such as transcribers have allowed for leap forwards in handwritten text recognition such as we've just been hearing about. This handwritten text recognition allows for the spinning of such text into a digital form. Once so spun, this text in a digital form becomes amenable to another application of AI that of natural language processing. One project to try to make something of this was the cybernetics thought collective project at the University of Illinois. Working on the digitised papers of leading cyberneticians, they use these techniques to automatically generate metadata such as percent sentiment scores and you'll see some of these in a minute. I've introduced this metaphor of spinning straw into gold because it also allows me to talk in terms of a gold standard. All Glam institutions want their collections to be both usable and used and yet the gold standard of what is needed resource-wide to achieve the levels of usability that are needed to attract the levels of use that are needed to justify the level of resource expended in the first place just keeps expanded and I often find myself sort of asking where will it end? So this then is me on a bad day. I want to be helpful. I want to make my collection usable but at what point does this become too big an ask or too impossible an ask? At what point do all those who are gaining the use of and value from all of this usability I'm generating need to sit down together and renegotiate where the work of and resource for usable ends and that of and for use begins. On a good day though I do recognise that this renegotiation is already happening so it's happening for example in the way that academic funders and institutions are starting to acknowledge the need to cost in the work of so-called research technicians and by maintaining their own research data repositories to take on some of the cost of ensuring that value once gained is not lost again by sort of but remains usable it's maintaining its usability it's available for reuse. So returning to the original question of what we are making of AI in respect of turning our collections into usable data I think it is clear that we are making some progress with applying it to accomplish some sort of specific tasks in service of that goal such as handwritten text recognition but that in others such as perhaps generation of metadata our efforts remain at a much more experimental stage. So we turn then to what we are making of AI in respect of uncovering latent insights as I mentioned previously uncovering latent insights is to some extent what the glam sector is all about certainly perhaps in terms of why it is seen to have public value so what then are we making of AI in this respect one answer that question has already been hinted at that we are only doing this indirectly that what we in glam are making of AI are the means to automate the turning of our collections into data that is usable digitally because we anticipate that this will lead in turn to the uncovering of latent insights and numerous digital humanities such as the one featured on this slide would seem to suggest that our anticipation is justified but perhaps a better question to be asking is what latent insights could AI help glam to uncover that are of more direct use to its practice rather than to that of research my answer to this is that in its simulation of internal mental models on which the few base future judgments or inference and its blunt surfacing of the uncertainty or if you prefer the subjectivity of that process in statistical and numerical terms certain applications of AI force us to confront the fact the same is true whether these models are held by a machine or a human being we've long been aware of the biases uncertainty and absences within our collections but perhaps we might now be able to develop tools to help us become more specifically aware of them to quantify and communicate them more effectively to allow more informed use of our collections possible perhaps in raw informational terms the machine generated sentiment score presented here is is not that useful but its presence and the questions that generates about the degree to which we can be confident in any judgments drawn from the material being described is nonetheless to my mind still helpful so to conclude then my view is that AI is an ongoing academic exercise to explore the hypothesis that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it in exploring this hypothesis techniques and technologies have been developed that do simulate certain features of intelligence which once simulated allow for the application and development of machines to perform tasks which could previously only be undertaken by human beings and in the glam context the main task we are currently looking for machines to perform is the automatic datification of our collections be this conversion into a digital form or where material already exists in a digital form the conversion of it into more structured and refined digital forms and once in these forms the material undoubtedly becomes more usable that is to say usable in more ways e.g. over greater distances or at greater scale but whether automated or not such conversions do not come cost or risk free and we may all want to play with the shiny new toys but we must not let this distract us from these costs and these risks thank you thank you very much Jenny very well made points I had a number of questions triggering in my mind about the expectations also that we have when engaging in these the outputs of these technologies but again we'll leave that for later so our next speaker is Stephanie Decker who I think will be speaking on behalf of a number of colleagues on the projects that they have worked on Stephanie is a professor of history and strategy at the University of Bristol and a visiting professor at the University of Gothenburg in Sweden and she'll be talking about a piece of research that with her team they have carried out with funding from the HRC on contextualizing email archives and I have to say that a part of the team members who have put together the presentation but I don't think are speaking with Stephanie today are Adam Nix from the Birmingham Business School, David Kirch, University of Maryland and St. Thileta, Kupili Venkata from the National Archives or I believe ex-National Archives. Anyway, Stephanie over to you. Thank you Paola. Just to say I don't normally sound like this I'm on my third day of COVID so my apologies for that. Maybe starting with the visual description you asked for so I'm a white woman in my 40s I've got brown hair chin lengths you see me in my office with books in the background a wear a black jacket and a brown turtleneck. So as Paola already said I'm presenting here on behalf of our research team so two of my team have joined us today that is David Kirch and Shanti Kupili Venkata. Adam unfortunately couldn't make it because he would have only been able to join later. So we've been working on a series of projects the first one been contextualizing email archives together with the National Archives the second one a project led by David in the US on the MCODIS tool which is the one we developed in the first project so they're largely our projects dealing with the same issue the same problem which is what we'll be talking about today. So for DCDC we've called this digitally curious context beyond the search window really what we're interested in is how will people actually interact with born digital resources when they come to collections and they seek to actually find material for research you might also be for example um other users in in the glam context maybe trying to understand what is in a digital resource and the project and its questions really come from our own problems its research is trying to to figure out how will we make use of digital resources in the future okay so our project takes a user perspective so this is somewhat different perhaps from from some of the presentations earlier for us the questions really how born digital sources are going to be vital for future researcher draws on on historical ideas so so looking backwards in time that might not only be historians we all work in a business management school context might also be cultural studies literature lots of people want to go back to material in digital or non-digital um collections and our main problem at the moment is that a lot of these born digital collections remain inaccessible so so as a team we've written a paper on this about the dark archives and what will happen if some of these archives actually become accessible but at the moment for legal reasons um researchers have very little opportunity to actually understand how they might want to research these resources that maybe some archivists and and owners of collections already have but may not be able to put in the public domain that means they may not quite understand the ethical implications but also the practical implications of how we would be using those resources because most of us have just been trained on the use of physical and pre-digital sources so other than people who work in a particular computing science of big data context for many of the rest of us how will we actually make use of digital sources is not at all clear and people have a lot of preconceptions but not necessarily a lot of knowledge opportunity to test it so our research project tries to look beyond the current issues with preservation which we understand are very significant um arguments we make how they are shaking up traditional archival processes just because of the size and and some of the privacy concerns with digital resources our privacy in particular for us here in the UK and Europe GDPR as a particular issue means that simply a lot of it can't be made accessible or can't be quite worked on in the way that maybe in the future we'll be able to. East Cheyenne has made the the point that really to to address this problem of access might require collaboration between um those that manage the collections and those that ultimately want to use the collections and this is sort of the spirit in which we've tried to address some of the questions and then our work um and looking at access issues um existing work for example from the Wellcome Trust to sort of argued users want optionality but they still expect some curation from archivists because they're still inexperienced in the use so that doesn't necessarily answer the question of how archivists will be able to provide that curation keeping the size of some of these born digital resources so our key question in our project in both projects really has been around how users will actually engage with born digital material once these access issues have been navigated because from our understanding working with these sources that the problems don't end with the moment that we actually get access to the digital resources and at the moment we don't know that because so many of them are not yet accessible so our first research project contextualizing email archives was very much focused on exploring this gap between the current efforts to preserve and then the means by which researchers might actually want to engage with them so so what happens after they've been preserved so for this we collaborated with the National Archives however we actually used data that isn't in the National Archives because these are emails of a failed us.com company that David Kirsch is our collaborator has already worked with in a research context made these available for the research via the Linguistic Data Consortium so linguistic data already tells you this is large-scale what's called non-consumptive use so it's very much about modeling interactions and it's a large-scale data analytic use so we collaborated with the TNA with a digital archives specialist especially Jean-Tillatou Coupilli-Vancasse and our aim was really to look at how we can make email archives available for search and study whilst drawn on these relational network properties of the format that have been well researched but actually also making use of them for people who might want to actually read the emails and know what people were saying to each other and we're particularly interested in emails because we think there's specific issues here the network resource so you could say emails replace correspondence that's not really true it's more like almost a telephone conversation perhaps because it's the back and fools so an individual email outside the context of its thread can be hard to interpret so email is emails also are in the plural and also who's writing to whom who's in the cc and who's in the bcc is it's not just content it's not just information it is actually also context to understanding the meaning of a conversation so when we see non-historians or non-qualitative researchers look at emails they often look at aspects such as frequency networks timing and sequencing language or content all of them in isolation we see all of these as context we actually need to you know to interpret a historical source obviously now we particularly look at organizational email because we believe that they're going to be very useful historical sources and for those in particular you need the individual in the network aspects a couple of other assumptions we made is really that you know how we approach something is not how everybody else will approach it so we need to accommodate really diverse research questions users will probably work iteratively and might have to sort of tacit and somewhat messy approach that many historical researchers have you go in with a general question what we find determines how you rephrase the question people will come with different levels of experience and ideally you want to offer them a relatively complete access to a whole quarter so they understand what's not there as well as what's there and we try to sort of think about how does what we are proposing differ from how you might have previously accessed these sort of resources so previously you may have relied heavily on the archivist you might have come in and say okay i want to know what's in here on e-business trends in the early millennium it's definitely something that there's this source we looked at covers you might look at finding aids or a catalog structure probably ask the archivist if they know the resource you might read closely go back to targeted search so you iterate backwards and forwards between a structure and what you find and how you begin to understand how the structure works for you now in a digital resource these structures are unlikely to be there or are not likely there in quite the same way so say you come in with a search query you know your classic way looking at a text resource might be to look at keyword search so you look at e-business trends in the years that doesn't necessarily give you the intersection of all of these themes so the prototype we developed in the first project which we call M-Codist is trying to be a search tool that sort of closes this gap from from like the simpler search approaches first option within that tool would be to use phrase matching which allows you to draw all these terms together but you know there's also the problem of what if people didn't refer to it as e-business trends in that particular collection or that particular time so the M-Codist plus version which uses attention-based content encoding is trying to also bring up the synonyms and all these terms that were used maybe relatedly and which the tool should help us identify so not knowing what terms to look for the tool supposed to fill that gap though this this is something that we're currently in the process of testing how well that actually works and in order for this to work better we are trying to bring context in so to really think about what do we know about a source and how can we use that to actually help such a tool to work but this we need some external sources of context in addition to the emails we talked about something like an organizational chart if we're looking at a business context or an organizational context known relationships between individuals you know are they all the same managerial levels as the CEO and the PA are they market or geographic events you know an outside I might not know but this this and that happen in the source and this is likely to generate a lot of emails that might be interesting but they might not know the precise terms how can we make sure a new user can find that and that's what we're trying to do with the M-Codist tool so M-Codist stands for email contextualization discovery tool and the idea is really that we can draw on different sources of context to help people find better results in emails in our experience the results you find when you search emails not always that great not always quite what you're looking for as a researcher in the second phase of our project so we're on to the second project at the moment we're really interested in how research would actually use this tool because you know it's all nice that we designed it but A does it do how we designed it and is that actually what other researchers want so we are currently working on a web-based version of the tool for early august to then have an opportunity to review user behavior so we're making it clear we're logging the activity to see how user friendly how well can we navigate the tool we're requesting user evaluation through a short survey and in particular we're interested how users navigate the empty search box because what we're assuming is there's a googleification of search right you ask google anything you put it in the box and it throws out an answer because it's got the totality of the internet now obviously where the digital resource for the search box you know these resources are limited and if you use the wrong terms it's not necessarily going to give you any results so this is one of the problems we anticipate but we also want to see how will people phrase a query when they do not actually know what's in the results because that's the fundamental problem of coming to a new research area coming to a new archive you never quite know what's in then without the traditional finding aids potentially for digital sources it'll be even harder so for this MCODIS test version we're using the Enron email corpus largely because obviously Enron is in the public domain we don't have any of the privacy issues because all of this was made available and has been online for well over 10 years the Enron data set is also interesting in a different way because Enron obviously if you know what Enron is you know Enron because of the accountancy fraud and if you come to an email data set and you want to read the emails about fraud how do you type your query I mean you put in fraud and you expect to get all the emails about fraud I mean realistically you know people don't say they're committing fraud while they're committing fraud so how do you find these emails so that's a good sort of idea around how do you how do you phrase any conceptual query that isn't a keyword per se how do people do that and what does the tool need to do to actually help them do that secondly the Enron data set is interesting because it's a little bit misleading it was actually collected as part of the investigation into Enron's involvement in the California energy crisis and that predated the accountancy scandal so while some of the emails relating to the accountancy scandal are in there actually the collection was before that scandal broke so this data set doesn't quite contain what people think it contains and you actually need to know pretty well how this data set came to be in a public domain to to know this and most people weren't so again and how far can a tool help people navigate a resource that they think they know something about but probably don't know that much about and that is not as a result going to throw out quite the results they're expecting so this is sort of our ongoing projects M-Codist in particular is something that I think we have on github so that's the link you have on the bottom of the page here but that's not currently the the usable version so we're working on on a sort of web hosted version to to use from from august onwards but if you're interested in the coding underneath I said it's under github contextualizing email archives and you can have a look there so this is our presentation I said this is on behalf of my team I'm Stephanie David and Shanti are here and Adam can't make it but Shanti is our technical lead can also answer any questions you might have on the more computing end of things all right thank you very much thank you thank you very much Stephanie again another really interesting presentation that has a lot of links to the ones that we've heard before if I can just invite all the speakers to come back on video please and again as we've been saying throughout the presentations if you have questions please pop them into the q&a through the q&a function and we'll just go through them are we all back okay thank you just before we go into individual questions I I just had a question for all of you thinking in different ways at the user side of things so I was wondering if you had any thoughts or any reflections so for example the first presentation in terms of the you talked about the challenges of the HDR so applying that technique and I was curious to know about what users have thought about it what kind of feedback you've got in terms of the the the results and how people have been able to discover the text but equally Jenny from your point of view I think you made a really good point about striking the balance between the amount of resources and what can then be expected as outputs again what what's your understanding of the sort of the approach and the reactions of of patients who might want to then use those data sets and equally for the email projects I'm not sure if others are now able to use the data sets that you have put together and analyzed but so that kind of user perspective first and then we'll go into something specific for each of the projects Caroline and Jacqueline do you want to go first sure I'll I'll take a stab at that one of the things with this collection is that tracking usage we can have quantitative numbers generated by google analytics tracking on the data the website itself to see how many people have viewed but we don't really have a mechanism in place for a qualitative feedback on the user experience it's not something that we have actively tested so we don't necessarily have that subjective user experience apart from our own internal testing with our team on using the platform and that sort of thing and I know that Adam Matthew and Cortex have a booth and we could probably all go and sit them in the exhibition hall and ask more about other collections too and and what sort of testing other collections have done for Adam or Cortex hosted collections because we are not the only people who have signed on and piloted and tried out Cortex tools but for our point from our point of view we focused in this presentation and generally more on the construction side and less on the user perspective so all we have at the moment is the quantitative numbers which we'd have to report back on but I think Caroline do you have anything to add to that no yeah I think I think that's right I think the focus of our project was a bit more like kicking the tire so to speak of just testing out this software and how it would apply to our collections from a very tech like sort of the more tech point of view but we haven't done as much of an analysis of the user the user experience I would agree thank you um Jenny do you have any comments yeah and I think it's interesting in terms of um I see myself as a user quite often and and actually what I want to use AI for is to kind of you know before the user before the end users do search and discovery there's the archivist search and discovery and that's about kind of trying to get a grip with you know moving these records from live business I'm thinking more of an archivist here that's my background from the live business systems into the something that is the archive system and I think that's quite interesting because email is quite an interesting example from this because what we're actually doing is we're kind of moving the entire live business system with it so that all of the kind of contextual information that Stephanie was talking about is kind of built into that system and we can port it through but how can we perhaps use that um you know it if if a user is interested in the same sort of thing that an archivist is interested in which is kind of understanding the workings of a company understanding how the kind of functioning of that organization has led to those records or is documented in those records then we have a kind of common purpose and we can probably use the same tools to do that so I think it's quite it's quite useful to have that kind of um renegotiation I was talking about renegotiation of boundaries and I think one of those is actually about kind of recognizing that quite often we're doing the same sort of we're interested in the same sort of thing um and we want to kind of get the same intelligence and the same insights from this data um and so I think that's quite an interesting development and it will you know it it's it's easy when it's kind of maybe well it's easy Stephanie it's not easy sorry but it's easy when it's like in an email sort of system but when you're trying to bring that sort of coherence and and that context between an email system what's going on in you know other formats in in SharePoint in whatever it becomes harder and how do you kind of gain that insight and that coherence over over the whole of it so I think it's really exciting I mean I'm really excited by some of these what's happening in some of the live business systems and the sort of things that are being created that could actually be ported through so that the user has access to them for research purposes at a later date but how what we port and how we port it that's the question for me thank you that's absolutely right Stephanie do you have anything to add yeah you asked about the availability of the data set so um I mean we're basically working here with two email data sets the first one from the first project is um a company we cannot name and is under pseudonym so we're calling it Aurora Tech and obviously all the individuals have to be anonymized I mean this data set is available on the Linguistic Data Consortium um but that's really available as a data set and I think the problem that we found is well that's great that you have a data set available if you don't know what to do with that data set as a person curious about what's in there that's very problematic plus you're not actually allowed to use it in quite the same way so um as part of the first project we created a website that's a little bit of an exhibition of these emails so that's what I posted in in the chat.com archive and we used the first prototype of the tool to kind of find our way around the emails and tell a couple of stories about the business and what they were doing and what were the things that generated all these emails but bear in mind all of this is anonymized so we anonymized all the names and we anonymized the company and we're trying to give as much context as we can and we use narrative to give the whole thing a context which is the dot-com boom and bust and use the story as as a sort of as a way of showing the organizational side of being maybe not one of the winners of the dot-com boom which is where we see the historical significance of some of these resources because all that information is in email. The second one is the Enron email data set now that's been available for years and again I mean many a researcher has downloaded it and if you don't do large-scale computational analysis it's not super user-friendly so we're really trying to design the the tool to make it more findable and we know a couple of people so Adam on our project has worked with the Enron corpus and we know other people really struggled making use in qualitative sense of those emails so I think David posted earlier that once we actually have the test environment ready which we hope to have ready in August we're happy to share that with people because we really want to see whether that makes it easier for people to use and whether it works the way we're hoping it does so data sets yeah sometimes they're there but even when they're there they're not terribly useful to researchers and that's really the problem we want to look at that's right thank you so people will be looking forward to the tool becoming available on a larger scale. Going back to the HDR project we've got the question about the sociability and how do you approach sociability with regards to archaic language or spellings is it something that you've come up against? I'll take a stab at that one as well and then perhaps throw it back to Stephanie's team because their tool actually is solving a lot of those questions in that or it seems to be addressing that when you're looking for something but you don't necessarily know the language of vocabulary used in the corpus in which you're searching. For our part our colleague that I quoted at the end of our talk Ellis we also work together to create a search guide that's at the beginning of our fur trade collection and then and we didn't do the same level of work with the Doncaster project but for the fur trade collection in particular we focused that tool on the exonyms versus the names that indigenous groups called themselves as a self-identified so that was in some ways the closest stab that we have towards archaic language because for the moment all of the indigenous groups in this collection are identified as the names that the francophone or anglophone european fur traders and merchants as they were identified by those people groups not by the language specific names that those groups use to identify themselves. So if you're searching for something you may not be using the right vocabulary as they was used in the collection. So what we did was create a search guide and pointed to other resources where you could find appropriate search terms. We also pointed out to the pretty well documented search while the search documentation provided by Cortex itself because they wrote the best manual and using their own tool so rather than pointing creating new things we used the extensive documentation that they provided as well but as this question I saw it come up in the Q&A and I was like oh as I was listening to Stephanie's talk I was like they're addressing part of this in of course a different corpus but that tool and the way that it functions is something that probably would be worth looking into for the archaic language question as well. Thank you. Thank you Jacqueline and from this I just wanted to go back to something that in a way you all spoke about but Jenny I think you powerfully made the point about the level of effort of resources of time that goes into these things and obviously it's a lot of labor that also went into the HDR project and in the email archives so I was just wondering whether people can comment a little bit about it whether there is a sweet spot of you know a good balance I suppose that the short answer is probably not but yes when is it that we feel we have perhaps reached the point in which previous efforts are feeding into future efforts and becoming a bit more effective and efficient and when is it worth doing and when not and what kind of expectations also we are setting up for ourselves our organizations and our users eventually so perhaps Janish can I start with you? Yes no that's fine yeah I mean I think it's this kind of usability arms race you know everything everything I've got in my collection is usable most well that's not true most things in my collection are usable but it may be that you have to come to queue to use them right everything is usable so it's kind of what level you're the usability so I don't know that there's a sweet spot I don't know where it is where you put your resources always a choice those choices you know exclude some include others it's tough so I think we're going to just you know that's a constant negotiation and that constant negotiation is going to be have to be ongoing I think where I kind of try to see more optimism perhaps and this is possible in this archival space is that things are now being born digital and it may be that I can you know if I can get upstream and I can solve the problem upstream so that we don't have to then expend all of this resource when they're coming through so that we don't have to do all of this conversion again and again and again we can bring things in in the right way that means that we can use them in the right way if you see what I mean so I think in some ways that's where I kind of see the hope that I mean I do think that you know there's a lot of a lot of money involved in sort of big tech and there's a lot of money involved in sort of current business systems and that's where the money is and that's where people perhaps have the resources to develop some of these techniques and to develop these kind of sense making kind of AI that allows people to kind of get a sense of their material and insights over there and if I can kind of use some of that artifact and bring them across so that I don't have to do that then they'll be available for others longer term so I think that's perhaps where I see a hope so is there a sweet spot no is there a possibility perhaps in some some kind of aspects of GLAM particularly where we're kind of moving into that you know when the stuff that we're dealing with is born digital is now born digital maybe we might be able to leverage some of that thank you Stephanie as well as your team your other colleagues have now come on video David and Shanti please chip in if you would like to share your views yeah I mean I'm sure David and Shanti have something to say on that as well but I thought the for us the key question is not just a sweet spot but also where's the responsibility right is that with the people who curate the resource is that with the people who are going to search for resource and at the moment the people who are actually going to be using the resource as in not the people managing the resource who also use it obviously they're very very different in terms of technical ability and this is really why we went with a digital curious you know people may have some idea but they're not really computing scientists and maybe in the future this is not going to be a problem because the classic training for everybody will be that they're an information management specialist in like 30 40 years from now I don't know but in the meantime it's also a question of should an archive give us these tools or should we as researchers come with that tool and I think that's this this thing about both sides of the search room we're not clear who's supposed to be doing what for whom and as Jenny said who's going to be included and excluded by these choices so I'm just seeing if David or Shanti would like to come in on that probably I could go sorry go ahead Shanti probably I could answer that to Jenny a bit this is already all these type genes they are doing this because they have got huge amounts of processing power and huge amounts of capabilities to do the kind of with the latest technologies and research kind of thing we are having we are getting these open source packages and open source resources to use for us so one of the one one such resource that is the bird timidings we have used for our search facility in order to work with so it is happening but again it depends on what the end user requires and also like it is how have you processed the how we ingest our archives in what way that depends on the what the end user wants so like I mean on this understand until we know the end user like we have to get a feedback from the end user to really do the processing at this source so probably it is a loop and it would take some time to stabilize thank you David you were about to comment yeah I just I think to underscore both this the digitally curious piece right that this is where I think it is not just a sweet spot it is a big kind of mass of opportunity and challenge in the middle here and I think it is why both were kind of excited but also a little daunted about what lies ahead and just underscore Santi's point you know when you know there is now all the all these variations on the BERT embeddings so when we started a couple years ago BERT was the state of the art and Santi was the state of the artist and now of course it is you know Robert Robert and Bertha and Roberta and all that you know so the that that kind of the if you will the kind of forefront of the technology keeps moving and and then you know does that change our capability how important is that our focus has really been on the user trying to center the user and user experience the kinds of questions that users want to ask so hopefully that by focusing on that we don't you know kind of lose the the the lead dog if you will in the technology game definitely and the issue around skills that has kept on coming up a number of sessions literally all the sessions have attended at the DC DC conference and keeping up to that Jacqueline and Caroline did you have any other comments okay the only thing that I thought of in terms of the yeah sorry the only thing I thought of in terms of going back to the question of the sweet spot of balance between the effort that you put in and the reward for the the work and where that balance is I think the only thing that comes to mind with our particular project is actually the point that Caroline made about the archival groundwork and the descriptive groundwork that was done and how that made this particular application forward compatible much less work than the initial description so we give an enormous shout out to the people who originally created the finding aid and descriptions for all that collection but that's it's a lot of work but now that collection is usable in a way that is future compatible as well so that was my only other point that there is a sweet spot the effort pays off when you take that as a clean sort of usable data set to use for future projects