 My name is Melissa Levine. I'm very glad to see all of you. I am the director of the Copyright Office at the University of Michigan Library. Thank you for joining us about what we have been having. We've been having sort of a larger conversation among the three of us and some other colleagues across the entire profession about publishing and using historic copyright data online in new modes. And we have several different projects looking at how to make that data more accessible and more usable. This is John Mark Ocklerblum. He is the digital library strategist and metadata architect. It's really hard, like I just feel like saying really smart people who do stuff. But there's an actual title. John's going to report on his IMLS-funded project at the University of Pennsylvania where he is looking to publish a comprehensive inventory online of first copyright renewals of the 20th century serials in order to make it easier to establish their public domain status and how putting that data online has also enabled interlinking with rights, registries, crowdsourced bibliographic databases and wicked data. He's the only one on the panel who had, John is the only one on the panel who has a Wikipedia page. So what he says is worth listening to. We don't have a microphone. They said they told us to use this one. It doesn't mean. You can come closer. There you go. I will try to speak up. I am more soft spoken than I mean to be. So my other colleague is Greg Cram. He is the associate director of copyright and information policy at the New York Public Library. And he's going to be reporting on a project he's been doing to work on the structure digitization of original registrations and discuss how this can be used for rights determination and more. OK, so maybe. OK, so I'm sort of setting the stage. Most of our time is going to be spent with my two colleagues. This is a story about how we have been trying to do some very difficult things for a very long time. We are talking about collections, collaboration, really this ongoing conversation, because it's really just driving us all nuts that we can't deal with what should be a fairly simple question as to the copyright status of works in our collections. It's very frustrating. The library, I would say the library community in particular, but museums and archives are very committed to getting it right. And they're very respectful of the law. In some ways, this is reflected by the number of grants that the IMLS, for example, has issued that deal with some aspect of this problem, including some of the ones I'll talk about today. I also participated in a project to produce a book called Rights and Reproductions that was edited by Ann Young at the Indianapolis Art Museum Museum of Art, which is a great resource for museums, libraries, and archives. This conversation, though, has involved a lot of tenacity. And every time I just try and put this issue down and let somebody else carry it, we just can't let it go. OK, so maybe. OK. So I'm going to show you a few things just to, again, give you context. This is a screenshot of the Library of Congress's Copyright Office website. You can, and have since, I don't know, for the last 20 years or so, you can search Copyright Office registrations on their website, but it's only for 1978 on. So there was a decision made, I think, early on in the Copyright Office's investment that that's what people really need to know about is all that new stuff. That old stuff nobody really needs to know about. So it wasn't invested. And this is what, so this is the second half of that page, and it gives you how you would search. So you can search by title, name, keyword, record, registration number, document number, command keyword. So it's not a very granular kind of search. And it doesn't hit a big swath of the period that really matters to us. Now, I'm jumping around a little bit in time, but in 2007, the Stanford Libraries, hi Mimi, launched what was really a pivotal project called the Copyright Renewals Database. And they created a search tool for searching renewals of books published in the United States. So it only covers books and it only covers renewals in part because if a book wasn't renewed, it's probably not in the public domain for this given period. So this was not only a really incredibly useful tool, but it was an incredible proof of concept and a sense of possibility. Oh, they didn't teach us button pressing in law school. Okay. And stepping a little further back to think about context, the Project Gutenberg Project, which goes back I think to 1970 or 71. The Stanford Project relied in part on the Project Gutenberg data for its project. So we're all kind of standing on each other's shoulders. And then the project from Stanford was really important in developing what became the Copyright Review Management System. We call, I called this session Hiding in Plain Sight. The same image is, you'll recognize it as one of the colorblindest tests, that metaphor. So my sense is that we have a lot of this information, we just can't interpret it, we can't see it. So this project, if you're not familiar with it, involved three grants from the IMLS, a tremendous amount of cost sharing. I don't recommend that part, but it's really complicated. But it involves 17 libraries, over 60 reviewers at all of the different libraries. And we won the American Library Association's Patterson Award for Work on Intellectual Property. And I'm particularly proud of that because it's the first time the award has been given to a group. And I think that's really significant and goes with the theme of my talk. So the Copyright Review Management System Project was possible only because we had the interface that we were able to develop based on the scans we had in HathiTrust. So the way we actually, there were actually human beings who we looked at an interface where you could pull up a book, a cut, you know, an actual scan and make certain determinations as to whether it was or wasn't in copyright. So we were really just looking at books. And again, 1923 to 1963. And the project continues at a much more reduced scale, currently is part of HathiTrust's operations. My colleague, Christina Eden, now works with Hathi. And these numbers are, I think, from November one. Total determinations, almost 315,000 works so far. Okay, so all of this has been going on over the last many years. And Greg and I have been talking for at least five, probably more of these years. There's been some progress. The Copyright Office has articulated certain modernization efforts. We got really excited about this about 10 years ago. So we're waiting. They created something called the Virtual Card Catalog as a proof of concept for registration cards, which are in this format, like a card catalog, for works registered 1870 to 1977. There's another tool called the Catalog of Copyright Entries, which I'll explain momentarily. In that timeframe, Zooniverse, if you're familiar with the crowdsourcing tool, has become another tool, another possibility for things we could use to possibly transcribe some of these things. We're looking at how machine learning might help with some of this. And when I explain what the CCE is, this will become more apparent. And Greg and I in particular have had an ongoing conversations with two colleagues at George Washington College of Law, Zvi Rosen and Bob Brandeis, who are obsessively looking at this data. So this is a picture of Z and Greg, I want to say in 2016, so the conversation continues. This is how all the data is organized in these card catalogs. So the Copyright Office's Virtual Card Catalog is just that. So if you've ever tried to do a copyright search, you basically have to understand a very 19th century kind of process, and you need to move around these materials. So frankly, even if you really know what you're doing, there's a degree of not entire confidence in what your results are. So the Virtual Card Catalog looks like that. It's a Virtual Card Catalog. So you can quote go into these virtual drawers and do a search, which I don't mean to be completely dismissive. It may be, I hope, a good first step, but all it does is replicate the existing analog universe. So there's a little bit more imagination and investment, I hope, that comes with this. So the point is, it hasn't gotten a lot easier, even though we're developing more and more tools. And I think we're getting there bit by bit. We still struggle with copyright determinacy and we still struggle with this concept of orphan wars. So I mentioned the catalog of copyright entries. The catalog of copyright entries, are any of you familiar with it? Okay, so I should explain what it is. Okay, all right. So these were basically ledgers that were published by the US Copyright Office each year and they were distributed to federal deposit libraries through this 1970s. So if you were a physical deposit library, you could see a ledger for a given year and then there was a separate listing for each category of works, for books, for serials, for visual arts, and so forth. You still had to look at, you still have to move around the ledgers in the same way you would have had to move around the catalog records. And there's some question about whether these entirely match up, like whether this is absolutely, like you have the same level of confidence. Nope, this is definitely a joke. It would be good enough for government work. So the ledgers have been scanned by HathiTrust and there also are, you can see them in internet archive and you have them in the online book site. So this, I just randomly, not randomly, I know HathiTrust, so I work with them a lot. This is a screenshot of what the top part of the page looks like in Hathi. And then the, because all the materials in the public domain, it's viewable. So you can click on any one of these options and a typical page will look something like this. Now, the good thing about this is that it's typed or printed. It's not handwritten and many of the decades of the cards that are at the copyright office are handwritten, they're typed, they're lots of different formats. So they don't lend themselves, I don't think, easily to OCR. But this probably does. At the very least, there's a consistency that allows for transcription. So Greg and I have been in a conversation, I'll tell you about his tact for this. But what I had started wanting to do was do a Zooniverse-style transcription project and starting with maybe one of the categories that was smaller. So there's probably, visual arts is probably where there's the fewest things actually registered and then exhaustively do that particular category over all the years and see what that looks like. The Zooniverse tool would, I think most of what we would end up investing in is, frankly, the publicity and the engagement to have people engaged. But frankly, a lot of that also could be managed by graduate students. So I wanted to do, this is a crowdsourcing thing. I also saw, actually at a web-wise meeting a few years ago, there was somebody named Mary from Dartmouth, Flanagan, I think. Who was talking about the psychology of gamification and how she had developed several tools that were geared for transcribing or describing archival materials. So there'd be a picture of a lumberjack and people would say lumberjack, tree, ax, and if you had hammer, head, sharp, that would be an outlier and get kicked out. So I was thinking about how these tools could be used together to accomplish several different things. Greg's taking another path, but he's actually doing something and I'm just talking still, so that's good. I sort of tabled this and then last summer I was at the Huntington Library and I saw this project, which I believe was also supported by the IMLS. The Huntington has a collection of Civil War cipher code books. Okay, so they're not readable and they did a Zooniverse transcription project so that you can actually access and search these materials now. And I thought, this is fate telling me I better get on the stick because this is exactly a proof of concept of what I'm talking about. The last two things I wanted to say is I am talking to some of the people at the University of Michigan School of Information to see whether, given the consistency of the CCE, we could do something with machine learning. I'll look back at, maybe you can do that for me. Maybe Stanford can fix it. But it seems possible that you could get to 40 or 60% of the work, at least as a first run through some kind of machine learning tool. I don't know, but you certainly could do that while you're doing the crowdsourcing and while you're looking at quality confirmation of what Greg's gonna tell you about. So the last thing I wanted to mention was, there we go. Okay, so just to give you a final sort of technical sense of the categories in the catalog of copyright entries, they cover books and pamphlets, including serials and contributions to periodicals, periodicals, dramatic works, including works for oral delivery, so speeches. Let's see, music, maps and atlases, works of art, science and technical drawings, and so forth, and commercial prints and labels. So I don't think we're gonna solve the entire problem for the 20th century, but we will get really, I think, a long ways into this. If I can do a search and find every book about John, every movie John made, every poem that was written about John, and then be able to make some determinations about the interrelationship about those things. The other last detail is that this includes assignments and transfers of copyright, which is a really important component of trying to have the best guesses to where the rights might reside. So with that, relinquish the clicker as you are. Well, so you just saw from Alyssa various categories of copyrights that the Copyright Office tracks. I'm gonna be mostly talking about periodicals or more generally serials, which might basically include periodicals that aren't quite so periodic. I'd like to talk anyway about a project funded by the Institute of Museum and Library Services that's designed to bring a lot of that stuff to light from the 20th century. And I also wanna show how the data we've compiled in the projects can contribute, and how we're actually starting to link it to a much larger set of useful knowledge about our cultural heritage. So I'll be talking about a guide our project just released in draft form to help people determine whether serial publications, journals, magazines, newspapers, and so on, are under copyright or in the public domain. And you can use it to identify material in the public domain as late as 1989 so you can do things like digitize it. And I'll talk about this data and I'll talk about the data that supports this guide. Specifically, as Melissa was saying, we have a complete inventory of all periodicals published up to 1950 that have copyright renewals. And we have this now increasingly in structured machine-readable form. And that set of structured machine-readable data includes both that inventory and also a growing knowledge base of machine-readable periodical information that's increasingly linked with other knowledge bases. And finally, I'll discuss some ways we'd like to work with others in taking advantage of this data and growing it. Having this data online could help a lot in doing research. So my wife works for the Science History Institute. She writes a lot about various scientists from the 20th century. She was recently working on an article about Alice Hamilton, who was one of the people who established the field of industrial toxicology. That's basically studying the effects of chemicals on people who worked with them or lived with them. She was one of the first people to sound the alarm about lead and gasoline. And she had a long career, both as a scientist and as a social activist, working with groups like Hull House in Chicago. Now, we have pretty good access to research libraries where we are, but there's definitely some gaps in what's really available online by or about Hamilton. So you can find a fair bit about what she published in your early career because before 1923, copyrights weren't in effect now. And you can also find some websites that have been fairly recently about her. But it's hard to find a lot of the things that she published or that were published about her during the later part of her career. And there's a fair bit out there. A lot of it is in periodicals of some sort. She published in scientific journals and she also wrote in magazines for the general public. And now this is a common problem if you're interested in the 20th century between the rise of commercial radio and the rise of the worldwide web. Copyright or often uncertainty about copyright keeps many sources from the era offline and largely unavailable to people without easy access to research libraries. Basically, if something was published after 1922, it might be under copyright. And all too often things that might be under copyright aren't online. But there are various ways that works published after 1922 can be public domain now. They can be government works, as Melissa mentioned, that are not subject to copyright. And Hotty Trust has made a lot of that available online. They can be works published before 1989 without a copyright notice. And that actually covers a fair number of interesting sources that weren't necessarily published commercially, but they include things like newsletters and bulletins of scholarly societies and zines and many other publications that are sort of on the margins and are quite interesting for that reason. And then finally, there are works published before 1964 whose copyright was not renewed. And as you've seen, there are a lot of these. And I have to give a lot of credit to Hotty Trust and their copyright management system, as you saw from Melissa's slide, they've opened up more than a quarter million books published after 1922 that they've identified as public domain, usually because no renewal. And they've developed an efficient system for identifying those books at scale. You don't otherwise open to a quarter million of books unless you've got that. But they haven't opened up serials past 1922 like they have books. And I can understand why, because if you're opening up a book, you just often have to find out was the book renewed. You look it up in Stanford or whatever and you're done. But serials are harder, because each issue of a serial might have its own renewal, but each contribution to a serial might also have its own renewal. And that could be every story, every article. That's a lot of copyrights you might need to check. Now, fortunately, there are ways to avoid to have to check every last article. What saves us is the fact that relatively few periodicals had any renewals at all. I found this out back in 2006 when I looked at periodicals published between 1923 and 1950 in J-store, and found only a small portion of those periodicals that high-slice and yellow on the right there had any issue renewals. And only that tiny red sliver at the very top renewed all their issues. And it turns out that the picture doesn't actually change that much if you add in contribution renewals. So the number of periodicals that have either type of renewal is small enough that we can go through all those volumes of the catalog of copyright entries that cover periodicals. And we can make an inventory of all the periodicals that had an issue or a contribution renewal and record the date of the first issue or contribution renewal to appear. And that's what we did. Well, myself and a couple of interns with IMLS support. So we found out after we'd done that that not only is a lot of pre-1964 scholarship un-renewed, but also a lot of the stuff that's in pre-1964 newspapers and an awful lot of the stuff that's in special interest or local interest serials. And even when we're looking at popular national magazines where we see renewals in larger numbers, there are still significant portions that were not renewed. So how do we tell what's public domain and what isn't? Well, here are three articles that are mentioned in Alice Hamilton's Wikipedia article. They're all from that pre-1964 era where the works had to be renewed to stay under copyright. But were they renewed? There's two here from the Atlantic, one from the New York Times. Let's take that first one from the Atlantic, the April 1933 article about Hitler. Well, as I said, we have a guide we've prepared. It's now available in graph form at this URL. You don't have to scribble it down. We'll release the slides. But you can basically follow through section by section a checklist to determine various questions you can ask to see whether the things in the public domain. So let's walk through it using that 1933 Atlantic article. First of all, we have a couple of sections just to make sure what you're interested in is in scope. So we're only covering serials published in the US. We don't wanna deal with non-US publications or things that weren't published because those have complicated rules and we don't wanna get into that. And for US publications, our guide only covers things after 1922 because before then, it'll be definitely public domain already. Just like starting next month, everything from 1923 will be public domain already. And I've been waiting for that for a really long time. We'll also throw at anything from 1989 or later because unless it's something uncopyrightable like a government publication, we can assume it's under copyright. Okay, then we have a few sections that I'm not gonna go into detail that see basically if there are any special cases we have to do copyright checks for. So if it has like maps or drama or music or any of those special categories you mentioned, we have not systematically surveyed those categories. So if the periodical has that, yeah, you have to do some extra checks. Fortunately, most periodicals are not like that. There are also a few things like annuals, distinct loss heroes and things like that that might have been reviewed, that might have been renewed as books. So you might have to do things like search the Stanford database for that as well as doing periodical checks. Most serials do have images of some sort and there are some of those that do introduce complications like syndicated comics actually did get renewed a lot if they were nationally syndicated and they appear in a lot of newspapers. But other than that, the renewal rate for images on their own is so low in the time period we looked at that in practice you don't have to check them separately unless they have their own copyright notice or a credit that indicates they came from some other publication. So let's go to the heart of the checklist. This is a simple date check. Look at the date of the issue or the volume or the article you're interested in. Remember we're looking at an Alton Hamilton article from April 1933 in the Atlantic. Now do a quick look up to see if that serial has any copyright renewals on or before that date. So if you're looking at something published in 1950 or later, you can search that copyright office online catalog that Melissa showed earlier. And if you're looking at something published in 1950 or earlier, for 1950 you have to do both, sorry. You can look up the serial's title in the renewals inventory that we've produced. And if it's not in that inventory, then there wasn't a renewal for that serial. So let's take a look at the inventory. As I've said, and I hope this is somewhat readable, we have a list of periodicals that introduces everyone that had a copyright renewal before 1950. And the address again is on the slide and you don't have to copy it down. But it's arranged alphabetically. So if we're wondering about an April 1933 Atlantic article, we control F or scroll down to Atlantic. And there it is. So we see there that the first issue of the Atlantic to be renewed is March 1934. So that's a good sign for our 1933 article. But we also see that the first contribution to the Atlantic with an active renewal is from 1923. That's not so good. It suggests that maybe what we're interested in was renewed. Now, if I were trying to copyright clear a whole bunch of stuff at once, like CRMS does, I could just stop here and move on to the next item, fine. But maybe I'm really interested in this article. And if I am, I can take some more time to check things out. That's what the next couple of sections in our guide are for. So they describe a few other checks you can make that zero in on the item you're interested in. Like was there a copyright notice on the publication? I mean, there is for the Atlantic, yes. And for most other commercial publications. But there might not be, as I say, for stuff that's more outside the mainstream. Or, more to the point, was there a renewal for the specific issue or article you're interested in? Well, we point to some data sources like the online catalog of copyright entries that Melissa showed that you can search to see if there's a renewal for your issue or article. And you couldn't even go through and survey to compile all the renewals for your serial over a particular time period. That's kind of time consuming, but it can be done. We certainly don't have the time to do survey all the serials in our inventory like that, but we have surveyed a few of them and we're happy to take data from other people who really want to adopt other serials to do that with. In particular, this is where we get to the structured data part of the talk. Because you may have noticed, when you were looking at this inventory, that a lot of the titles here have web links. So clicking on one of those links will take you to a page with more information on that title. If we click on the link for the Atlantic, for instance, we find out that we have a lot of information on it besides just their first renewals. So we have links to online digital content. We've got links to Wikipedia and Wikidata. And we have a list, in fact, of all the issues that were renewed through the 1940s. And if we scroll down, we also find out that someone has also added information on all contribution issue renewals through the 1930s. And we see that, okay, April 1933, yes, there are some renewed articles here, but not any article by Alice Hamilton. So we're good. All right, that someone was actually me that had it in the stuff, but it doesn't have to be. It could be you, or it could be somebody else who's interested in this particular serial. Behind this page is a JSON file that I created or that you could create or somebody else could create. And that JSON file has a lot of structured information on that serial. So among other things, it includes structured linkable information on each renewal, including the dates that are in ISO format and names that use authority identifiers and more. And all of these JSON files are retrievable from our site or you can download them in bulk from GitHub. And the structure enables a number of interesting applications. For instance, you could take this set of JSON files and a spreadsheet of serial holdings that has ISSNs and date ranges and run a script that automatically identifies serials and those holdings that are likely to have public domain issues well past 1922. Or you could take the name authority identifiers and use them to link to further information about a rights holder. So I mentioned there are a few April 1933 renewed articles. There's one here by Edith Wharton. Well, there's a permissions link you'll notice next to her name. And if you click on it, you go to a database at the University of Texas called Writers, Artists, and Their Copyright Holders or Watch. And that tells you who to contact and where if you wanna get permissions to use Edith Wharton's article. And we link to other external sites. We link to sites that have tables of contents compiled by publishers or fans of a particular serial. Now we don't have one for the Atlantic but there is one here for Galaxy which is an influential magazine that published science fiction stories. So if we take the contents list that fans of Galaxy have compiled and we subtract from them a list of copyright-renewed contributions to Galaxy that we've compiled, the results after you do the subtraction is a list of contributions to Galaxy that do not appear to have a renewal. And then you can use that list of stories to double check and digitize as people are now doing. So here's one of those stories from the first issue of Galaxy that's now online at Project Gutenberg. There's one last kind of link I wanna mention and that's links both to and from WikiData. So WikiData is building up a growing corpus of bibliographic information to support a wide variety of projects including WikiCite, WikiSource, and things like WikiProject newspapers trying to put those things online. And a while back WikiData folks reached out to me and they created a data property to link to copyright information that I have in my data set and I returned the favor. As a result, I can do things like link out to Wikipedia articles automatically and retrieve ISSN data, get access to a lot of other data on a serial that I'd rather not have to manage myself. Thank you. And they can get easy access to my knowledge base to do things like clear serial issues for digitization. So to review, here's what we've done so far. We have a large set of increasingly structured data on serials and their copyrights that's now comprehensive enough that you can use it along with other data to make practical determinations of whether serial publications from a large part of the 20th century are under copyright or in the public domain. And we've published a guide for using this data to make those determinations. You saw a bit of it before. It's currently in draft form but we hope to make an official release as part of our celebration of public domain day on January 1st. And we've defined a set of data structures that others can also use to contribute further copyright information and link it to a variety of related data sources elsewhere. Basically, structured data is great. You'll hear more about this from Greg in his presentation. So what's next? Well, we're hoping to improve our draft guide with feedback from the community. So if you do want to read it over and get comments to us by Christmas, we should be able to take them into account for January 1st release. And we'd like to test the guide ourselves at Penn to clear some serial copyrights. If you'd like to try it out yourself for serials that you're interested in, we'd love to hear from you how it went and see what we can do to improve the guide. And we're going to keep growing our structured dataset since we want other people to use and contribute to it. Our first order of business is going to be publishing documentation in the fields using the JSON files that you just saw. So you can see what goes in them and what to put in them. And we'll also continue to create JSON files and Wiki data links for the serials already in our inventory. They're not all JSON-affide yet but they're growing that way. And if there are particular serials that you're interested in, please let me know. We might be able to show you how to create some files for them. Or in a few cases, somebody's asked about a serial we didn't list that we're doing research in one case, Mennonite History. And they said, asked about two serials, I looked up and said, yeah, there's only like one or any one in both of them. So we just created the files and put them up. So that's fine. And finally, we'd love it if you just spread the word about this work. Our guide will be CC by once it comes out of draft status. Our data is CC zero, so you can port it around as you like. So we hope that it can be used and adapted by all sorts of libraries and cultural heritage projects. Now, I've got a lot of people to thank for helping this work. Just really quickly, big thanks to IMLS again for funding it, to the Penn Libraries for supporting it. And to our wonderful interns, Allison Minor and Carly Sewell who did a lot of the data wrangling and folks at places like Hathi Trust and elsewhere who've been supporting our grant proposal and have looked over drafts of the guide. Thanks also to the information communities I linked to, especially the Wikidata community after reaching out to me. You can find out more at the URLs in red. And if you'd like to learn more or help out, I'll be glad to talk with you here or contact me with any of the links in below. Thank you very much. With that, I'll turn it over to Greg. Thanks, John. So just to emphasize, this data that we're talking about is the record of creativity in the United States. It is the most complete record we have of data or of the creative efforts of people in the United States. They document, these records document a significant part of the literary, music and artistic and scientific production of the United States from 1870 to 1977. And unfortunately, as we've said, these records are locked away in the copyright office records. As Melissa said, you can go to the office, it's just down the street. A little bit way down the ways. And these records, you can go to the office and go wander the catalog. You can go pull out some drawers and look at some of them. And for years, these records were accessible only if you traveled to DC to see them in person or you paid someone to do that search for you. And as you might guess, that search is expensive. So there's been some progress here. Melissa said these cards are making their way online today. The virtual card catalog is a great first step. It's a first step though. It's great that we have access to these digitized cards. We can see the scans and we appreciate that the office was not gonna sit on the scans for another decade waiting for another search engine to be built. So we're happy that the copyright office put them out in any form. But the experience of using the virtual card catalog is nearly identical to using the physical collection. You gotta know which drawer to go to. You gotta open that drawer. You gotta look at the cards. You gotta then go flip through the cards to see which card is the card you're looking for. And oh, by the way, there are 45 million cards. So one of the problems with the cards is that they are for the most part just images of the cards. There's no OCR. There's no optical character recognition of the cards or at least none that you can search on today. And that means you've gotta use them as though they were a 19th century technology. You can't run searches over those cards. Melissa said there's another set of records where these cards, the information from the cards goes and that's the catalog of copyright entries, the CCEs. Every six months, the copyright office issued these CCEs at many libraries, including almost all of yours, probably have these in microfilm today. The Copyright Office in Cooperation with the Internet Archive imaged them, took photographs of them, looks like this. Here's the renewal for F. Scott Fitzgerald's, the great Gadsby, and they look like this. The data that you see here is actually pretty valuable. It tells us that the book was renewed in 1953. It also tells us more information. It tells us that it was a child of F. Scott Fitzgerald who actually filed the renewal. And so now I have another name to go find if I was gonna go try to clear this book. Unfortunately though, the CCEs that are currently available through the Internet Archive, the experience is similar to that of using physical CCEs. You've got to page through. You've got to know which one you're gonna go through because the OCR that was run over these things is not high quality, let's say that. And therefore it produces a lot of false positives and a lot of false negatives. So if you did a search for a word, you might find lots of answers or lots of search results, only some of which are good for your search and likely you've missed some because the search, because the text is just not there. And this is important to have really accurate data because as you've heard from these two, what we're looking for in many cases is the absence of data, the absence of a renewal. So if you don't feel comfortable, if you don't feel confident in the data source by running a search and getting a null result and saying, well, that means there was no data, that search result does not appear in the data, if you don't feel comfortable with that search, you can't rely on that search. So others have tackled this problem of transcription in small slices of the record. Stanford, we all pay homage to Stanford renewal database because that is the grandfather of a lot of this work. But it's not a complete data set, it's just the renewals and just for class 1A works. So this should be easier, right? You shouldn't have to sit through half an hour, almost 45 minutes of us complaining about how awful these records are. So the use of some of these tools that we've described require specialized knowledge. You have to know which drawer the particular card is, you have to know which page to look in the CCEs and which volume and which date and which type. Those searches all take time and it takes effort to do this work. You've got to have specialized knowledge and in some cases it's really difficult to do even for experienced folks. This should be easier. This is just data. What it should look like is something closer to this. It should look something like this. You should be able to enter a database, do a search for a keyword that gives you highly accurate and well-formed results. For example, if you search for the words flying express Dixon, you should be able to get a search result that looks something like this. And here's all the information related to the book, the mystery of the flying express written by Franklin Dixon. And you should get all of this data in well-formed fields. And you should also be able to click on the bottom that says original records, expand it out and look at the actual records that the data is based on. You should be able to see the CCE. You should be able to see the renewal. You should be able to see the card catalog card. And you should also be able to see any assignments and transfers that happened as the record, as the work was being processed in the copyright office. And these links, these fields, should actually be turned blue. They should be linked out. They should be in triples. We should be able to go find data sources. So for example, if you were trying to find the publisher, Grossett and Dunlap, and find all of the books published by them, you should be able to click on that link and get all of the results. We're not there yet. So we gotta do some work. Before we can get to that point, we need to begin with the data. And our intervention into the space, the New Republic Library's intervention into the space was announced in April of this year. We announced publicly that we were going to start this project that Melissa and I had been talking about for far too long and launched this thing. Our goal is to make the catalog of copyright entries, the CCEs, searchable and produce high confidence search results. It's funded in part by the Ford Foundation and the Arcadia fund, but that funding is only gonna get us so far. We're always looking for more funding. So we're gonna accomplish this task with two different methods. The first method is transcription. That means to take the image file and read the letters, the numbers, the characters, and make them into machine readable form. So for example, this is the registration for that book, Mystery of the Flying Express. It's a Hardy Boy's book. Here's what an OCR result should look like. It's kind of unformed text. It's just a blob of text, but it is a transcription. And there are lots of, for example, problems in our data where I've got ones that should be L's or I's. So we use humans to help us, insanity checks, to help us make sure we're getting the right ones or L's. Our goal is to get to 99.9% character level accuracy. 99.9% level character level accuracy. That is a high level accuracy in this because we want accurate data. But transcribing the records enables us to search the records, not understand the data. And so the next part of this is to parse out those records. So we want to break those records into their constituent parts so that we can search on fields. We can facet our searches to say, I only want to search for books written by Einstein, not books about Einstein. So this registration record for our Flying Mystery of the Flying Express would be broken down into something that looks like this. So title, author, publication, reg day, reg publisher, place of publication in the record number, all our data sets that appear in this record. So problems. So I've done now just about 30,000 pages of the 450,000 pages by the CCs. And surprise, surprise, we encountered some problems. Some problems are caused by us and more problems are caused by the copyright office. Here's what I mean. So the first problem that I have is that when we started to work with the data produced during the pilot, we basically produced two separate documents for each page of the CCE. One page had the clean OCR and one page had the clean parsing. The problem was that we didn't link the two, didn't think about linking the two in the initial run of the first 10,000 pages. And so I've got to go back and link those two back together because when I have that search engine that I showed you a little bit ago, those things have to be linked back together. So I've got some work to go remediate the first 10,000 pages. We put a fix in for the next set of pages, but that was one problem. Another problem is bad printing. So for example, the registration in this record is really faint. The registered number is on the third line all the way on the right. It's really faint. At first we thought this reg problem was just in the internet archives copy that came from the copyright office. So we went and pulled the scan from Hathi Trust and realized the same problem. And I said, well, fine, we're gonna go check our microfilm and see if the problem is persistent across three. And as it turns out, oops, it is, in fact, appear in all three, it's faint. And so there's some bad printing that happened in the CCEs. We also have more printing problems, namely inconsistent printing. I've got four different shades of typeface on this one page. And in this case, every odd page in this volume has this problem, not the even pages, just the odd pages. So clearly there was a problem at the printer and this problem persists across all of them. So it's not just one scan, it's every single one. These printing errors cause the OCR, the process that we use to get the text out of the images to fail every time. No engine can be able to read all of this text at once. They have to do a lot of manipulation and that means it's more expensive. Other problems is not just in the printing but in the data itself. There are some edits that look something like this, small handwritten edits. And these edits were made presumably by the copyright office because this scan came from the copyright office. So which reg number do I go with? Which number is the canonical form? Here's another one. I've got someone in the copyright office decided to paste in a record into the CCE. Luckily they did it in the right column and they did the right column width, so great. But that raises some authenticity issues. Is this the canonical record or is the record that wasn't modified the canonical record? Here's another one. I've got examples of someone having a really bad day. They took catalog cards, photocopied them, and then glued them into the front five or 10 pages of a particular volume. So it's not in normal form, it's just the catalog card that someone has just pasted in. Is this a canonical record? I don't know. Our approach is just gonna be to digitize it, to convert it, and we'll note that it was where it appears. The other problem that we've had with the data is that we've assumed that there are identifiers that are unique to each entry. Turns out they aren't. In fact, we found some evidence that the copyright office reused registration numbers. This is a set of two registration records that include the same registration number. In fact, they appear in the same volume and are one page apart. They're two different books. So that means you might not be able to search solely based on the registration number and get your exact result. Instead, you've got to do other kinds of searches. That means the high quality confidence of your data is gonna start to go down when the data's got these problems. But after some digging, we actually found that this wasn't a case of reusing registration numbers, but it was an example of another problem that we've come across, typos. This data has typos in it. And so if we're trying to be faithful to the record, which one do we incorporate? The one that's got the typo or the one that we know is fixed. We're gonna choose the one that is, we're actually gonna choose both. We're gonna include the typo and then in the back include the fixed version. If these are not enough issues, don't worry, I've got to get a page with another set of issues you can get through. Let me just check it out. But despite those problems, we're getting some data back. We've already used some of the data coming out of this project to help us understand the completeness of the Hathi Trust corpus. This is a comparison between the number of registration records for books and the number of books within Hathi's corpus spread out over time. For example, in 1930, you'll see a difference that Hathi might lack about 1,000 titles that were registered with the Copyright Office, but aren't actually in the corpus. That indicates to us that there may be some areas where we need to go digitize some books because they're not represented in the corpus. It also helps us to understand the size of the public domain. So this data is starting to reveal to us the number of titles that were renewed each year, the number of book titles that were renewed each year. What we're finding is that only 25 to 40%, and I'm gonna be really broad and say 25 to 40 right now, we're actually renewed, which means that only 25 to 40% of books published in the U.S. between 1923 and 1964 are protected by copyright. The other 60 to 75% are not. That means we expect to find as many as 315,000 books published at a time when renewal was required to be in the public domain. 315,000 books. Hathi's already found a number of these, but not all 315,000 are represented in the corpus, and therefore, CRMS has not yet found them. So knowing these copyrights weren't renewed might help us prioritize which books we're gonna digitize next. This data is also already helping us feeding of understanding the size and scope of the long tail of book publishing as we try to go after that long tail and bring it back to library users in ebook form. So I'm from the New York Public Library, not the U.S. Copyright Office. Yeah. So why is this an NYPL project? Well, first I'm sick of waiting. This is driving us insane. At NYPL, we have a team working on researching, analyzing and entering rights information. Looking at over two, we've already looked at over two million objects at this point that have been digitized by NYPL that are in the corpus with ambitions for more. And so to recognize their contributions to this work, I wanna introduce you to the team. There's me, here's my photo. There's a person named Kaya Wahamans who works for me, and that's it. It's just us. Since May of 2014, this team has added copyright status determinations to about 1.6 million captures for about 430,000 items. We've got about 11% left to review, and as you can tell by our burn down chart, we've slowed down a little bit because we're starting to get into the harder stuff. That means my team feels like this sometimes. We continue to review the new digitization of NYPL and chip away at our backlog, but what we're missing, what we spend a lot of time on is actually going into these records to determine whether a work was renewed or even registered when we're thinking about risk. Having a searchable database cuts down my time on this dramatically. It cuts down our time per item dramatically. That makes us go even faster than Kermit. So we still have a mountain to climb. There are 450,000 pages of CCE records. That's lots of DTDs to build. That's a lot of analysis to do. There's a lot of work to be done. Just to emphasize it, we have completed over 32,000 pages. That's the blue slice of the pie with another 8,000 in progress right now. That's the very small green slice. We've skipped for now about 50,000 pages either because we've found that the work was already completed, thank you, Stanford, or is duplicative of other records that exist within the corpus. Those are the yellow, that's the yellow chart. Those are usually indexes or things that are repetitive of data that's already been processed. But that leaves us about 80% left to do. And that's just the CCEs. We want to tackle all of the historical records, not just the CCEs, we want to go after the cards, the transfer records, the assignment records. That means there's an additional 45 to 50 million cards that I want to go after and include in this database and another 20.2 million pages that have been microfilmed of transfer and assignments. So we've got some work to do. We've got a lot of work to do. But if we're successful, we're gonna have a really rich trove of data that's not only useful for libraries determining the copyright status of a work, but it's gonna benefit others in significant ways. It could help answer some questions about geographic trends in certain creative industries. Where was the heart of music publication from 1900 to 1950? How did it shift? Where did it go? Economists want to study the production of works over time and they've expressed interest in this dataset to understand some of how these things, how creative works are being produced in the US today. So how you can help? You can help with creating some user stories for us. As I said, we're still going to funders to help us close that 80% gap. You can help us with user stories. If you know researchers on your campus who might be interested in this dataset or if you are interested in this dataset, let us know. We wanna hear how this data might be useful to you in ways that we hadn't thought of yet. You can also check out the GitHub account and go pull the data down right now. As soon as we get the data back, we upload it to GitHub and make it available to everyone for free and for free in this case means both in the gratis sense and in the beer sense, the free beer sense, these records are our shared cultural heritage and from the beginning NYPL has taken the position that these records are our shared cultural heritage. We are going to make them available without charge and without restriction. We know that they will be scooped up by legal database providers and you will be charged an arm and a leg in your Lexus searches, sorry, but we remain committed to keeping the data as open as we can and as available as we can either through our GitHub account or ultimately through a search engine. So stay tuned to all of this work. We're all doing interesting work on this. Stay tuned, we'll have more updates in the future and let us know if you wanna help out. So we've got some time for questions. Love to hear them. And.