 I think it's time to get started. I'm Cliff Lynch, the director of CNI. And it's my pleasure to welcome you to this project briefing session that is part of our spring 2020 virtual meeting. Our virtual meeting is now about midpoint. It will run through the end of May. And there's plenty more to come. Today, we have a really timely conversation here. Mike Furlow from Haudytrust and Stuart Lewis from the National Library of Scotland will be talking with us about initial steps to building a global registry of digitized works. This is something that clearly has been badly needed for a long time. It's challenging in the sense that it's not immediately obvious who should take charge of such an effort or how to do it sustainably. But nonetheless, it's really important and certainly recent events and the closure of most of our libraries our inability to get our physical volumes during the pandemic has really underscored the importance of efforts like this going forward. We'll hear from Mike and Stuart, and then we'll take questions at the end. There's a Q&A tool down at the bottom of your screen. And I invite you to enter questions as they occur to you during the presentation, although we'll deal with them all at the end. But certainly, there's no reason to wait putting them in. Diane Goldenberg Hart from CNI will materialize into existence at the end of this and will moderate the Q&A. And with that, I'm going to turn it over to Mike. And it just remains for me to thank our speakers very much for doing this presentation and to thank you for joining us. Over to you, Mike. Thank you very much, Cliff. And thanks to everybody for joining us this afternoon or this morning, depending upon where you are. I am really sorry that we were not able to meet together in San Diego at the end of March as we originally planned. There is one positive thing that came out of us not being able to meet in person. That is that Stuart Lewis is able to join us. Stuart and I have worked together on this project over the last year and a half to almost two years now. Stuart, you want to say hello and just wave around? Yeah, good evening from Scotland. So I'm going to start this off and then Stuart will come back on and we'll trade off on topics here. Let me give you a quick roadmap for this presentation this afternoon. What we will do is take a few minutes at the start to talk about a project and the background for this project to develop what we are calling a global but with quote marks registry of digitized texts. We'll explain how this global digitized data set network came to be. And I'll talk a little bit about the reasons that brought us together. And at that point, Stuart will join us to talk about the research we did into why we need such a service, what it could be used for, the use cases for it. I'll come back and talk about some data analysis and aggregation that we did as part of the project. And then Stuart will wrap it up and lead us into conversation about what a sustainable project and a scalable data set for this service might look like. Let me comment about scope to start with. This was a project that we did for a year. We had one year to a funding. So we had to scope this pretty carefully. And as a practical matter, we focused on texts. Most specifically texts manifested as books or book-like materials, that is monographs and serials that have been digitized at your library through your own local efforts or through mass digitization programs over the last 20 years. So the project was led by our PI, Paul Gooding, in information studies at the University of Glasgow. And I should have said at the outset, thanks to Paul for letting us borrow some slides that he had used in previous presentations here. We've added a few things and pushed them around. The network itself included HathiTrust. I was a co-investigator for the project. It also included the National Library of Wales, the British Library, and of course, the National Library of Scotland. Research Libraries UK joined on to this project as well as an advisor, as an interested party, given the interests of the research libraries in the United Kingdom and this particular effort. So we came together, as I said, with funding. It was from the Arts and Humanities Research Council. They announced a program in the fall of 2018, titled US, UK Collaborations in Digital Scholarship and Cultural Institutions. And this followed on from several earlier programs and efforts that they had. And the real focus for them was to look at digital scholarship. How could existence of digitized materials support researchers and their needs? But the impetus for this project began with more of a library management kind of question. And it came from Stuart initially. In the fall of 2018, he got in touch with me and said that the National Library of Scotland was exploring its digitization strategy. And it was trying to figure out how to prioritize materials that should be digitized. And it was curious to know whether HathiTrust could help to avoid duplication of effort. Really were there ways for us to work together to help identify things that were held in Scotland and the National Library that had already been digitized and therefore did not need necessarily to be digitized again? And my response on this was, let's absolutely work on this because we at HathiTrust had a really strong interest ongoing in understanding the scope and extent of mass digitization. The collection that we hold in HathiTrust is substantial. It represents a very large portion of what's been scanned through mass book digitization programs. I'm going to exclude for the moment the kind of work that has been done by Perkwes, Gail, Adam Matthew, and many other vendors, really high quality work and at fairly significant quantities. But when I was thinking about mass digitization and putting this slide together, I was thinking about work like Google, Internet Archive. We have in HathiTrust today about 17.4 million volumes. That's a book on a shelf. That's about 8.8 million titles, both book, rather monograph, and serial. But that's not everything that's been scanned, and it's not even everything that Google has scanned. The HathiTrust collection is largely representative of North American mass digitization programs, not so much of Europe, not so much of Asia. And so to get at Stuart's question about how to understand the scope of digitization and how to identify those materials that had already been scanned, that really required further work, and that's what led us into this project. So at this point, let me hand it over to Stuart to talk about how the investigation got underway and the work that we did in the following months. Thanks, Mike. Yeah, so as Mike said, we had just one year of research funding from the AHRC, so we had to be quite careful about the scope that we had. Obviously, there was no way we could build a global data set of digitized texts in that time. So we had to look at what we could do. So we really tried to initially look at two questions. So the first one really being how feasible is it to aggregate these records? We all know that aggregation of metadata on one hand sounds very easy, but actually the devil really is in the detail. So actually, how easy or feasible would it be to do that? And then secondly, if it is feasible, where would the value come to users? Who would the users be? And how would they use it? And then ultimately really, would the cost of developing such a resource be outweighed by the value that it could bring? So we can just go on to the next slide, Mike. Thanks. So we had a number of objectives and deliverables that we wanted to look at or we wanted to deliver. So the first was we wanted to undertake a trial matching of data from the UK libraries who are partners on the existing Huttitrust data set, so actually could we, obviously it's not global. It's a tiny, tiny proportion, but can we at least aggregate our different data sets together? We then held a number of workshops to explore what would be the benefits and who would those benefits be, what would they deliver to people? We then wanted to actually deliver a data set, prove that we could aggregate this data, so create a data set that we can publish. And then we wanted to look at, so if this is successful, how could we actually move this forward in terms of actually concrete steps? What might they look like in terms of creating then a truly global data set of digitized texts and any services that they could deliver as well around them? We'll have the next slide again, Mike. Thanks. So when we first submitted this grant proposal, we really sort of concentrated around what we thought were three sort of main use cases of this. So obviously the primary and most simple use cases is simply I'm a reader. There's a text I would like to read, a known text. Is it digitized somewhere? Can I access it? Right now, we can use search engines to a certain extent. We can search the big sources. We can go to Hattitrust, we can go to the internet archive, we can go to the British Library, we can go to national libraries. But at some point you just have to give up. So if we had a single data source, then people would be able to search that and find out very, very quickly if their item has been digitized or not. The second use case we were looking at is all around digital scholarship and scholars who are looking for sort of corpora of texts that they can work with. So I'm a scholar. I want to look for anything that talks about London in the 1850s. It would be so much easier if I can just search all libraries, find all those digitized materials and download them in one go, rather than having to search all these different libraries and put that set of texts together. So that's sort of another big use case. And then the third one being around, generally really actually around collections management. Obviously the one we looked at there was as Mike mentioned earlier, if we're doing our own mass digitization programs, how do we, for example, reduce duplication? So we're less likely to duplicate works that are already openly digitized elsewhere. Or indeed there's other sort of, other, you know, whether it's print retention things, other collections management uses for that data as well. So we first did this through team meetings in Chicago. We did a lot of sort of brainstorming around the sort of traditional agile user stories, you know, as a X, I want to do Y so that I can achieve Z. And so we came up, we really sort of expanded those three initial use cases quite a lot. And then we took those to a workshop in London in June last year. We had an audience of things about 30 or 40 people from a whole range of backgrounds, sort of some scholars, some librarians, different groups, archivists and so forth, metadata experts, different people and just looked at what their use cases would be and then sort of undertook, you know, an investment exercise. Everybody was given a sheet of colored dots and really saying, you know, where would you get most value from this? And then obviously we were able to use that group also to look at discussions about things like feasibility, stakeholders, how we could move forward and so forth. And so yeah, really sort of five themes emerged out of those sort of expanding on the use cases, one around, you know, efficiency, cost, impact, value, you know, what is particularly around collections management, how can that help us within the library sector, discovery and access? There was a lot there around readers. There was questions that came up about provenance as well and really understanding. So it's great seeing a big list of digitized texts but actually where have they come from, you know, that's, and how did they get there? That would be very important to some users. There were the research case studies as well, particularly around digital scholarship and creating corporates of different, you know, texts from different libraries. And then finally actually how, what else could we do with this? You know, some of the more exciting parts actually came around when you say, well, now we have 20, 30, 40 million texts digitized, can we actually do things at a scale that we never could before? Some of those can be very simple, you know, can we load them into our discovery systems as a single data set so that we can make some of our print collections available digitally as well. But can we, are there things that we can do with those texts that we can only do once we have aggregated them? So for example, once we have the OCR, does that make it much easier to do matching and deduplication than it does if we're just looking at metadata, for example. So we looked at those areas as well. And then finally, we undertook a survey as well of the sort of wider community as well. Obviously we wanted to get many more voices than those that could attend the workshop. So we did an open survey, we had a number of responses. So that brought in sort of more use cases we hadn't really considered before around teaching, thinking about where this data set sits in relation to other services. And again, like with the workshops, they did seem to be a real clear interest in pushing this forward as an idea. But at the same time, there was some caution came about, not everybody understood the concept. There were sort of a lot of questions very valid about, so you're talking about a global data set, but you've only got a very small subset here and things like that. How do we do balance between larger and smaller organizations and different types of collections? And obviously with any discussion like this, it all comes down to metadata and quality and the analysis that allows or doesn't allow. So there was those sides that we had to consider as well. And so yeah, back to Mike to talk about then sort of based on those results sort of what were we able to actually do with our own data and aggregating it together. Thanks Stuart. In this part of the presentation, I'm gonna be summarizing the work of several of my colleagues at Hathi Trust, Natalie Fulkerson, Josh Stieberman, Martin Wehran, Heather Christensen. There is a lot more to say about the work I'm gonna be talking about here. They could give a much more detailed briefing on some of the analytic work that they did. And I will show you a link to a blog post where you'll be able to find out more about this part of the work in just a couple of seconds. But you recall the original inquiry that led to this project, right? It was how do we determine what has been scanned and how can that inform digitization strategy? So for librarians looking to analyze collections, there was a real issue there. It seemed to be about the ability to cluster titles, identify that they're equivalent and deduplicate, right? Avoid that duplication. But duplication isn't always a bad thing. For a general user or a scholar relying on such a resource, you would expect that some of them might want to find duplicate copies to hedge against the potential for errors in digitization, missing pages, defacement, something like that, or even to examine textual variants. So part of what we were exploring in this part of the work was how could we better do clustering of titles? How could we potentially better inform and represent such clustering in a resource like a registry? And I will say that in a year, we can only really skim the surface of this. The first thing we did was rely on an existing process that we have in HathiTrust to analyze holdings against bibliographic data that we hold. So to say very briefly, all of the HathiTrust member libraries give us information about their physical holdings. And then we periodically analyze those holdings against the HathiTrust catalog. And we use those matches that match data to help inform our collection development, our shared print program. It supports some access services such as our current emergency temporary access service. It informs fee calculations and so forth. So typically what we're doing is looking at identifiers in records to cluster and match on it in order to do this. So our first step was to gather metadata from our partner libraries in this project. And you can see here a pretty wide range of quantity of records provided to us. The British library provided us only records for items that have been digitized. And I believe all of this was from their Google partnership. And just many more records that they could have provided, many more items have been digitized by the BL and what we were working with. National Library of Scotland in Wales provided us with extensive records for their collections as well as for print records, print holdings as well. And then HathiTrust had about 17 million items at that time. As I said, we have always tended and HathiTrust rely on identifiers and specifically an OCLC record number to be able to match bibliographic metadata against holdings. We've relied on this because the vast majority of the collection of HathiTrust has this identifier in our bibliographic data. And it's very common in our member libraries. But our member libraries are largely North American. And what we determined very quickly in this project was that these record numbers, these identifiers are less common to rare to non-existent in the records that were provided to us by the national libraries. They seem to be more common in the records for print or rather undigitized materials that were provided to us. When we were able to do matching or rather when we were looking at this, it ranged from 1% of the records had OCLC numbers in them for the BL to about 30%, I believe, for whales. So it's pretty wide variant there. And obviously that's only gonna get you so far if you don't have the presence of these identifiers. We have at HathiTrust also done some work with other registries and looked at other identifiers to try to provide these kinds of matches. For example, LCCNs, ISNs, ISSNs and so forth, ISBNs. The problem and really the problem of the OCLC ID too is that these identifiers really date only to the 70s at the late, maybe the 60s in some cases. You're not gonna find them in very early 20th century or 19th century records unless they have been updated unless they have been modified throughout the period. In our collection, for example, only 15% of the records in HathiTrust have ISBNs. So while we did some record matching, using that it was not sufficient to tell us a whole lot about overlap. So obviously normalization is gonna be a significant undertaking for any registry that like this that gets developed. I'm gonna talk real briefly here, very briefly about four other methods we explored to do item, title and record matching. The first two methods relied primarily on string matching in titles. That has a benefit of being quite precise, not very computationally expensive, but variation in cataloging practices, variation in titles and so forth might exclude true matches. You may not get all of your matches in other words. We also explored some statistical methods including a back of words approach and some machine learning and machine training approaches to help us identify potential matches as well. That turned out to be quite promising. We did not have enough time in the project to take it to fruition and do it at a very large scale. So that's work we're gonna have to come back to. But that type of work also brings other challenges. It's computationally expensive and it does not eliminate the need for human intervention. So what briefly, a quick summary on this is that in addition to, you know that the analysis is gonna have to happen or other metadata normalization is gonna have to happen in order to do this kind of analysis. So duplicate detection, clustering, across heterogeneous metadata sources, it's challenging. There are a lot of trade-offs to look at with this. And I think the experience that we had with this suggests that you're gonna need a multitude of approaches, right? Maybe a cascading approach where you use identifiers where you can, then you use other methods like machine learning, if you can afford to do that, where you might need to do further kinds of matching with more ambiguous results. Now, as Stuart said, we did make a promise as part of this project to deliver a record set of aggregated metadata from the partners. We refer to this as the proto registry, which is really a fancy word for a data file, a flat data file that we're gonna publish. There was no functionality promise. We did play around with some of that to visualize it, but it was not something we could maintain. We scoped this part of the work to make sure that when we publish the metadata as an aggregation, that is the records of all of these digitized items from these four collections, we wanted that record set, we wanted that metadata to be fairly consistent, fairly complete. So we looked first to a model, again, that we employ at HathiTrust for publishing an inventory known as the Hathi file. We looked for common fields in the records and then aggregated based on the information, pulling them from those fields. And this table gives you a very quick overview of what minimal metadata was included in this, as well as a DOI that you can go to get the data set and play with it to your heart's content. And that is all I have to say about data analysis and aggregation at this point. Instead, what I'd like to do is turn this over to Stuart to wrap up at this point. And you're muted, there you go. Yeah, thank you very much. So, yeah, just in terms of the conclusions and future works of quite quickly, a point Mike always sort of really drums home to us. You know, registries are critical, but undervalued infrastructure. You know, we forget about them sometimes, we forget about the infrastructure we do need that will really bring more value to what we do. So, you know, this could be quite an important piece of infrastructure like that. Through the engagement work, you know, we get a sense that there really is a need for a resource like this. And as I think, you know, Cliff had mentioned in the introduction, maybe the current situation, the current pandemic actually gives it even more value than ever before. The fact that, you know, our students are researchers, they can't get to physical libraries, but if it's hard for them to find what has been digitized, then, you know, could this be part of that answer? And then really, you know, for us as we look forward, you know, what might a sustainable project look like? You know, as I said earlier, things always come down to metadata, but equally, you know, there's always that question of, well, what's good enough metadata to actually make this happen? You know, does it actually matter, for example, whether we can do data matching? Does it matter that if somebody searches for something, they find five copies of the duplicate, but the system doesn't know their duplicate, you know, this sort of thing, and the cost value of doing data matching, you know, it's complex, but how would it, you know, it would bring value, but at what cost? How do we scale this? You know, obviously, we were working in the sort of low tens of millions. How does this scale when you, or indeed what is the scale of the problem right now in terms of worldwide digitization and the amount of items out there? The big one, obviously, business models, funding, sustainability, buy-in, participation, and these are actually really, I suppose, what we're very interested in, almost from the Q&A, would actually be your views on this, you know, as the sort of partners who would be ideal participants in such an endeavor. You know, what are your thoughts around this in terms of should we be building it? How do we build it? How do we fund it? How do we get your participation? Obviously, the issues around updates, keeping it accurate, that sort of thing, there's the questions around other communities, you know, there's sort of work in other communities looking at similar problems, so things like the AAAF discovery, and that sort of thing, you know, it's part of the answer, not all of it. And then, you know, the big, big one, what the work we've done so far is far, far from being global, and the challenge is then of multi-languages and character sets and all this and how we bring that together and does it merge in the same way when we go beyond anglophone collections. So really, as well as, you know, it'd be great to have answers and questions and answers. Thank you very much for coming. It's been a pleasure to present to you. We're really interested, I suppose, in your feedback as well, just in terms of this as a project, this as a concept and what you think of us sort of going forward in that. So thank you very much. Wow, thank you. Thank you, Stuart. Thank you, Mike. Lots of grist for the mill in that talk. Really interesting and such an important issue indeed, especially right now. And we already have folks who have questions, so let me just jump right into that. And just again, type your questions into the Q&A and Stuart and Mike will answer those live. So our first question comes from Melissa Levine, who asks, this approach is focused on published materials. Any thoughts on how to extend this concept to other kinds of materials? For example, special collections type materials, museum archival collections? Do you want to give Mike? I can say, yeah, thanks, Melissa. We didn't really spend much time worrying about that. And it's not to say that it's not an important problem, but just keep our focus was entirely on published works. It's not to say that it could not be done, but the type of mice, at least my own limited experience with metadata on archival materials is that the aggregation of that might be even more challenging given the, there's not the kind of standardization on the data across the board that you find in bibliographic metadata for published works. So I think that's one issue. So I wouldn't say it can't be done. I would simply say that we did not have the opportunity to spend a whole lot of time thinking about it. I think it's clear that it would obviously be valuable and potentially even more so because of the difficulty in locating archival material. And there's not a world cat for that in the same way. So Stuart, do you have anything else to add? No, no, absolutely. That's, yeah, as you say. All right, thank you, Melissa, for that question. Before I move on to the next question, I just want to remind everyone, go ahead and type your questions and comments into the Q&A as Stuart invited our attendees earlier to share your thoughts, your ideas about potential applications, potential partners and that sort of thing. So this is definitely a forum for comments as well. So our next question comes from Kristen, who asks you, can you talk us through how this works on the user side? For example, archives usually require registration of some sort which allow for stats on the types of users, et cetera. Are you gathering that data for users or do you plan to? Shall I answer this one? So, yeah. So this was certainly something we ran into with the prototype that we developed because a lot of people sort of tried it out and they searched, they came up with some good results and particularly those in the Hathi Trust, they were going, but I can't access that, but I can't access that. Obviously because a proportion of what's in the Hathi Trust due to copyright, due to fair dealing is for Hathi Trust members. And so something that certainly this would need to do which we didn't actually do in the prototype but one of the big lessons we learned is really we need to know sort of, as we aggregate this data, we also need to know what the access options are for each item, is it open? What sort of license has it got with it? So that when we present this as a faceted search, you can sort of tick the box that says show me only open access, creative commons, reusable items so that you can facet down that way. So that if people really, they want to know or maybe there'll be a box that says I'm a Hathi Trust member and then it will open up sort of more content. So I think, yeah, having that but at a basic level so that we can at least facet on that and sort of lighten up or darken different parts of the collection depending on sort of who you are, where you are and what access you might have and how, yeah, there'd be sort of work required then how we make that sort of usable but also understandable I think to people so that they understand what they're searching and what they will have access to because we all know nobody likes getting search results that they can't then access. And so yeah, that'd be a very important part of that. And just I suppose in terms of the second part in terms of statistics, I mean, again, the phase one project really that we did the one year AHRC project was really looking around feasibility. It didn't look into sort of how, if we did this live, how would we do things like gathering statistics and things? We did get feedback around that saying people would actually value more statistics not every live is able to collect access to statistics to the level they would want particularly maybe where materials are getting used in other ways around digital scholarship and things like this and this might allow, it would just be another sort of indicator of usage but I don't think it would just be in sense user statistics of the site I think rather than it necessarily, yeah, gathering data about users as such. Great, thank you. And thanks, Kristen, for that question. And we have a question from Robert who asks, are you developing or leveraging standard criteria or metadata for faceting on open access, licensed, et cetera? That could be very, that could be very helpful for improving user experience, for example, via SimplyE. You want to jump on that one, Stuart? Yeah, I mean, I can do, I suppose. Again, this is where we've done a prototype to prove we can do this but we haven't looked into the details then to say, okay, we've learned we need rights in here, what would be the best way of doing that? And obviously, as Robert says, something like SimplyE would probably be an ideal standard to do that. So I mean, it's a great suggestion, thanks. And obviously, if we do move forward, then we can look in those different areas where we need to sort of make decisions and that would be a key one. Yeah, that's exactly what I was going to say. I think it's clear that we would need some kind of, some kind of standardization and in use of a normalized vocabulary for that, so. Right, right, yeah, makes sense. Thanks for that question, Robert. Just wanting to remind everyone to type your questions in the Q&A. Robert has a follow-up here. Let's see, he comments, mapping OPDS to your approach could be very effective, a very effective thumbs up to the answer. So I wanna go ahead also and invite anyone who might like to make a comment live or have a question, wanna interact directly with Stuart or Mike. If you want to do that live now, you can raise your hand and I can unmute you. We have a couple more minutes here, we can do that. There should be an option to raise your hand and that will signal to us that you would like to make a comment or ask a question live and we can go ahead and unmute you for that. While we're waiting to see if there are any other questions to come in, I just wanna remind everybody that this is part of CNI's Spring 2020 virtual meeting. Thank you so much for being here. Thank you to our panelists for presenting today and I'm just sharing with you there in the chat box, direct link to the schedule for the rest of the meeting which will go on through the end of May. Plenty more offerings to come, including a couple more sessions this afternoon if you're in our time zone. One right after this implementing effective data practices and after that statistical consulting in the library. All right, seeing no more questions coming in through the chat box, I will simply thank our presenters and our attendees once again. It is our pleasure having you here and we really appreciate your being here with us.