 Alright, I think it's about time to get started again. Welcome back. You've joined the second day of the virtual event of the CNI fall 2021 member meeting and in this segment, we're going to be starting with project briefing from our colleagues at Harvard. Just to put a tiny bit of context around this. When we got this proposal in for a project briefing. I really thought that this was important to put in our portfolio of synchronous project briefings. As you'll see, this whole issue of identifying what's really in the public domain and can be openly shared is is a very thematic issue it came up yesterday. There are a lot of discussions of the evolution of Google and the book scanning project and it continues to be devil organizations like a hottie and hottie members who have been doing a lot of very expensive, excruciating we slow manual review. And to the extent that we, we can see automated methods for speeding this up. I think this is a potentially great interest to our community so I'd like to welcome our colleagues from Harvard. Stephen Abrams, Suzanne ones, Kyle Courtney and Ming Tao Zhao, who are going to fill us in on this work and I, I think that Suzanne is going to lead off so I will turn the presentation over to her at this point. Welcome and take it away. Thank you very much, Cliff. We're very excited to be here today and thank you all for joining us for our discussion of this particular experiment and project we do think it's very promising and we look forward to telling you more. And I think that Stephen may have already put a link in the chat to our wiki site which will have the full report and a lot more details that we can cover in our time with you today. But just to introduce the team again I'm Suzanne once I'm the associate university librarian for discovery and access at Harvard library with me are Stephen Abrams our head of digital preservation. The Office of Scholarly Communication Copyright Advisor and Ming Tao Zhao systems analyst and application developer in our imaging services unit and so we're going to talk to you a little bit and Stephen if you want to move it forward there we go. I can talk a little bit to start about why we undertook this project although Cliff gave us a great team off spot because it is exactly for the reasons that he mentioned that we undertook this experiment. However, we've been creating digital versions of items from our special collections for over 20 years, and we've made millions and millions of digital objects as a result, and the whole goal of this work has been to make these items accessible to the world of scholars. And so we selected the items we digitize very carefully and we've used processes designed to favor items we believe to be in the public domain, but it's not always clear which individual items within a larger collection are in fact within the public domain and our general office has always advised us that we really do need to do an item by item review in order to determine specific rights before we can say that something is or is not in the public domain. So, you know, to be very specific, we have now more than 16 million bibliographic items that are already digital or digitized. We would like scholars to be able to use them freely for their own scholarship and for their creative endeavors. So far, because of the manual review involved, we've only been able to share 90,000 through the digital public library of America. And we have, you know, over three million items that were published more than 130 years ago and like I said we've been very careful about how we select things for digitization. We believe these are likely to be in the public domain, but to do the manual review that we have been advised to do would take us 570 years to do. So I think our scholars are a little bit more eager to get their hands on and to be able to freely use these items than waiting for 570 years. So we needed a better solution. And with the vision and legendary enthusiasm of our colleague Wendy Google, we created a plan to test an automated process for determining the right status of the items in our collection, and then applying the standardized right statement, which I think are really poised to help us all move forward in standardizing the use of these collections. So I'm going to hand it over to Steven and he's going to talk you through the experiment that we ran and the solution that we found. So, Steven over to you. Thank you, Suzanne, and hello everyone. As Cliff sort of introduced us I think this is a really common problem that we are all facing. And we're very happy to share the preliminary results of the work that we've been doing in trying to shift away from complement rather manual review with automated information at scale by looking at pertinent cataloging metadata. In doing so, we've got a couple of key imperatives. We built into our process basically the ability to dial in an acceptable level of risk. We can basically do that by adjusting various threshold dates that will be looking at in a few slides from now. We also try to design this process so that it comes up with a benchmark quality or level of confidence that is no worse than we are currently seeing with our existing manual review. Once we've determined the status of this material, we are going to be going and marking the relative cataloging records with standardized rate statements. So that we're actually capturing this information in a public and a standardized and both human as well as machine readable way. So that there should hopefully in future be be no ambiguity in terms of the discovery process. So more information is available at at this link which I also put in the chat and this will show up on the final slide in case you haven't had a chance to note it down. Basically, what we've come up with is an algorithmic decision tree. 18 decision points that will lead to 19 instances of one of four standardized copyright statuses as promulgated by right statements.org either no known copyright, no copyright in the United States copyright undetermined, or in copyright. And I realize that this this diagram is very, very small you can't really see it but if you follow the links to any of our reports are on the website. You'll you'll be able to sort of zoom in on that and we will be giving a little bit more information about this so far. Obviously this decision tree is probably quite similar to others that you might be familiar with Peter hurdles groundbreaking work in this regard. This current work. I've been involved in this long enough to remember when his his first chart fit on one page. And now I think it's about a 20 page document. There are a number of other things it all should look quite similar. So basically, we are looking at this catalog metadata. There are, for the most part, five prominent metadata elements, publication status, whether it's published or unpublished, the creation or publication date, if published, if it was published under corporate or individual authorship, and then birth and death dates of the author. All of this metadata is, we are, we are pulling it from a central discovery catalog that we run that attempts to normalize native mark metadata archive space and visual metadata from JSTOR forum all gets normalized down into mods. So it is it is simpler for us to pull it from there, rather than trying to go out to all of these different native catalogs. In doing this, we very early on adopted a very explicit policy that we were always going to air on the side of caution. We are prepared to accept a certain amount of false negatives. In order to avoid any possibility of false positives. In other words, there's probably material that is in fact in the public domain, but it's it's on the it's on the edge, and it was a little ambiguous so we're going to assume that it is in fact has is possibly in copyright, because what we want to avoid at all costs is saying something is in the public domain when it really isn't because that is the thing that is leaving us open to, you know, legal and other sorts of reputational liability. And I think we've been able to do that quite well. So our is an initial proof of concept test. As Suzanne mentioned we have about 16 million bibliographic items. There's actually that's that's a subset of about the 25 million items of all forms that we do have in that central catalog but for this concept test. We have to look primarily at the bibliographic information because the frankly the metadata there is a little bit more reliable so it seemed a little bit easier to do. From the 16 million we created a test set of about 60,000 that we believe to be fairly representative of the larger set. So we're going to go through the algorithm using a primary criteria of 100 130 year ago threshold for publication. In that case, there's other nuances around that but in general, something published over 130 years ago should be somewhat obviously in the public domain. That's what we've come up with. We're showing the numbers for both material that has been published for material that is manuscript or or unpublished a small number of items where it's it's publication status was actually unknown. What we're looking to see here is about 20% of the materials we looked at. We are pretty confident is in the public domain, or at least in the public domain in the United States. And Kyle will be talking a little bit later about some of the intricacies of dealing, trying to dealing with, you know what is an international environment with lots of conflicting national legal jurisdictions and rules. We also ran this whole test twice, once with the 130 year threshold that I'm showing here. And just to see what would happen. We used a slightly more conservative 140 year threshold. And we were actually a little bit, but pleasantly surprised to see that it caused very little difference. We had a very minute fraction of things that were previously not in copyright fell into copyright, a slightly larger set but still less than 1% of the of the full test set moved from no known copyright to the slightly more restrictive status of no copyright in the US. So that was, they said, we were a little bit surprised, but that's a very pleasant surprise for us for us to go through. Probably clear to all of you watching that there's a number of issues that questions we might want to consider. First of all, we are using a relatively small sample size, less than 5% of the total number of bibliographic items. However, we do feel that this is a relatively representative in terms of the type of material, the distribution of dates and so forth. And so we suspect that the, what we might see as if we move on to doing the full thing is we'll probably see a lot more metadata variation for undoubtedly be more sort of weird errors that'll crop up. But we don't expect to see anything that would cause us to radically reconsider the design or structure of the algorithm itself. So significantly, we are making a great reliance here on metadata that was never intended for the purpose to which we're putting it. You know, all those, there's lots of date dates in the mark, but they were put in there primarily for purposes of discovery and for other sort of administrative reasons they were, they were never intended or collected for purposes of copyright review. Similarly, anyone who's a catalog or knows. There's lots of missing incomplete, a lot of non uniform practice. And frankly, there's a lot of just incorrect things that are obviously incorrect and I'm sure there are things that are not obviously incorrect in there. Just to give you a bit of a flavor, a date you would tend to think you know a date is pretty pretty simple thing we deal with dates all the time. But it is just a subset of the variety of ways in which dates or date ranges can be represented these all. These are all exactly actual examples that we've that we pulled out. And there are many, many more that we could sort of come up with. So that's, that makes things a little bit, you know, in terms of implementation that thing make things a little bit more complicated. So what are we doing to sort of mitigate some of these problems. Well, to begin with, we have developed an extensive set of algorithmic rules that we are using to normalize the variety of metadata descriptive practices. And we've also built in a very conservative set of rules for recovering from identified errors, and how we are going to go about interpreting things that might be ambiguous. We always interpret things in the in the most conservative possible manner. And as far as tolerance, we've built into our automated process. All of those date thresholds are parameterizable. So there are more or less there are four key date thresholds to related to publication to related to the author's lifespan. These are the default values that we are doing that we've developed in conjunction with with our legal advisors. And they're all easily changed. As I showed you earlier, we ran it with both 130 and 140 years as our is our initial publication threshold, and that's pretty easy to do. It took about 30 minutes for us to process those 60,000 items, which sounds pretty fast. If we ran this against the full 16 million, we would estimate it would take about six days, which seems a little bit long. And I'm sure there's optimizations we can make, but compared to a 570 years. This is like lightning lightning fast. Now I'm going to turn things over to my colleague, Kyle, who's going to take on the more thorny issue of the legal concerns that are raised by all this. Thank you. Good afternoon. I'd like to point out I have the easiest and best job taking three minutes to describe all the legal complexities of this project. So I won't hang too long but really and truly these are just considerations for assessing risk. They do not derail most of the works that are clearly and affirmatively in the public domain therefore naturally low risk, but for the moment let's consider a few. If you're aware of the twin books case you might be booing at the screen right now. This is a terrible decision which has been highly criticized in the courts but not overturned, and it focuses on foreign materials and us copyright law. In that case the court stated that the US copyright commences with publication overseas, regardless of whether that publication overseas complied with us copyright law. It messes up our clock to a certain extent with regard to some of this so it's it's not a great decision. There's other ambiguities there that can help us determine whether foreign works are determinatively in the public domain but some other things might be subsequent publication of manuscripts I started as a manuscript. We have it as a manuscript we digitized the manuscript, but it was signed over to a publisher and published later in some capacity, which could alter the copyright calculus. This is a possible copyright renewal that was actually a requirement for a while you had to renew your copyright in order to get it. Many of us use the Stanford renewal database to find this information and fortunately by most of the studies, very little works were actually renewed comparatively to the copyrighted materials that existed at the time. It's a small percentage inserts and materials. This is when a work was published you're there's an insert right. This is the physical world, a full doubt map, a series of photos. They have may have been published in the book with permission for that. And so we were wondering is the use limited to that publication and so maybe that could be beyond the scope of our thing. I'm less concerned about inserts and some of my colleagues, but that's my opinion. And then last geo blocking. This is you know we investigated it we looked we looked at this certainly will continue to but some cultural institutions and other organizations have looked at blocking certain IP ranges for materials and countries that might have in very terms that may be longer than the US or or different realities of what is required to get copyright. And this is where twin books or other litigation may have raised the risk level for something to defermentively not be in the public domain. So that is my three minutes on those important points, and I turn it now back over to my colleagues to consider some other points and finalize. Thanks, Kyle. And we do want to make sure that we have time for questions. So I will, we'll also endeavor to quickly cover our next steps and then we can move into that phase of the thing. We're going to be covering pepper, pepper Kyle with legal questions and Ming Tao is here to answer some of the technical questions which we've really glossed over. So, on the next step so we are going to apply the algorithm to all 16 million bibliographic items Mike I'm glad to see in the chat he thought six days was a reasonable amount of time for that. So we'd like to then use that to release our rights clear and public domain corpus of our digitized materials. Hopefully that will be in the Harvard digital collections and it'll be very clear and easy for scholars to know that they can use it reuse those items as they wish. And then we're also at the same time going to implement the addition of standardized rights statement at the time of cataloging for our prospective records. So that going forward, we'll be able to use that information automatically in the future. And then, you know, we can then aspire to apply the algorithm beyond the sort of easy mark records to our other forms of cataloging to catch the rest of our retrospective works. So we are, you know, expecting to add a little bit of work in the process as we apply the algorithm to the full corpus. We are going to do some manual random sampling for quality control to ensure that the algorithm is performing, you know, at such scale at the same effectiveness in our small pilot study. So that's one of the ways we're addressing the, the scale issue that Stephen mentioned earlier. And then we're going to also investigate ways we can make it even more effective and do some more automated determination. So right now some of the things we were just going to have to manually determine that we would like to investigate integrating with external databases for better author birth death information and investigating the automated identification of involving copyright notices. So those are some of our aspirational goals. And so just in summary, we feel like this really has validated the idea of an automated approach to rights determination, it's technically possible and we feel like the results really are reliable. And again, we'll continue to test that and do some manual testing as we go forward, but, and we'll let you know if we hear otherwise but it really feels like this is a very real reliable process for moving forward. And then in the chat and again here on the slides is the link to the wiki where we have the full report. So that has a lot of detail in there and I encourage you to take a look at that at your leisure. And then we wanted to make sure we thanked our colleagues who are not with us here today. Wendy Gogol again was really the visionary to start us off on this work. And so we had a virtual and Jonathan Hort provided us with a ton of legal analysis and consultation, all all errors are of course, and Vanessa venti and Robin when there were wonderful data banglers and metadata gurus for us, and did it tremendous not to help us figure out how how our systems work and how we could work with the metadata that they hold. So with that, we'll move on to the questions phase over to the chat. So there's an initial question here asking about whether our intention is to make this a shareable open source tool. In theory. Yes, everything that we've done. We would be certainly open to doing it. However, the nature of our implementation means we're not sure how, how useful it would be to actually give us friend for us to give you our code. As I said, we are not running against our native catalogs. We're running against a custom central discovery catalog that we built here that has that normalized mods in it. And that's, that's the source of the data that we're pulling. So, unless you had, unless we share that catalog with you and you populated the actual code itself I don't think would be as as useful as you might think. So the, the report, the set of the algorithm itself, the whole set of normalization rules, which is all available at. You'll, you'll find it here that is freely available and and and we would encourage people to look at it to review it we would like to hear back if, if, if it makes, if it seems to make sense. If you have other opinions that would be a wonderful conversation for us to open up as we feel certainly that it would be very very useful if we as a community could slowly converge on a set of, you know, common criteria that we could all be using to making these kinds of determinations whether whether it's manual or automated. So there's a question from Mike furlough regarding the renewal data in the Stanford database. So Kyle I don't know if you want to take that one. You did mention it a little bit but if you want to just, I mentioned it as a methodology for assessing items which we think fall inside that but as, as Michael points out here. I think taking less than 95 years is most likely. I, you know, I have a lot of things to say about that, but for purposes of this project. It's we're just, we're just kind of putting the vault on in copyright. The Stanford databases is not as big and a vast as possible as I said there's a low amount of renewals that were in there and it was subject to a small fraction of the total copyright it works. So it could be integrated later but I think that's for down the road we need to solve our, our, our longer term problems down and try to find as much low risk public domain as possible in there but that's a great question. Eric Mitchell has a very great question for us on if we have any insight on how we might improve metadata quality over time. Well that is an excellent question. Not at this time. I think there are a variety of experiments that we have in our metadata management group to do that processing to improve metadata in bits and pieces but we have a lot of metadata so to really standardize it and clean it up is is is an extremely challenging process. The tool that Steven mentioned, our library cloud database does attempt to kind of normalize metadata which is a way to kind of, I guess, working around the metadata that can be very consistent and messy. And that that being said, when we do encounter either errors, you know, either recoverable or unrecoverable errors in the in the process. All that is getting logged, and the idea was always at some point that we would feed that information back into our cataloging groups. You know, they like all of in all of your institutions I'm sure they're they're facing such an enormous backlog, just trying to keep up with with new acquisitions it's, it is difficult for them to set aside time to go in and clean up things, especially a lot of these things which are really really old, and probably not rise to the top of their priority list. I'm just going to chime in here this is Diane. It doesn't look like we have any more questions in the chat. So thank you so much to all of our panelists very much appreciate you're coming in, talking about this wonderful project there's clearly a lot of interest. Folks want to add questions or comments in the chat. As we move along through the rest of the meeting please feel free to do so and hopefully our chat our panelists will chime in with responses. And with that I see we are reaching the end of this time so thanks again to all of you very much appreciate you're taking the time to come and talk to us today. And thanks to all of our attendees we're going to break for just about five minutes, while we make our way to transition to our next session which will be octopus with Alexandra Freeman a project out of the UK which is helping research record for science so stay tuned for that and thanks everyone. We will see you in about five minutes. Thank you so much for that presentation that was really, really nice and really encouraging because this is such a problem. I hope you will come back and tell us what happened after you run the 16 million records, especially if it's successful, definitely. All right, thank you so much.