 The next session coming up, which I'm really looking forward to is being given by Jefferson Bailey of the Internet Archive, an old friend of CNIs. He's going to talk about the work that the Internet Archive is doing to support perpetual access to open scholarship. And I'm, without further ado, I'm going to just hand it off to Jefferson to take it away. Welcome, Jefferson. Great. Thank you, Cliff. And thank you to the CNI team for the opportunity to talk. And of course, thanks to everyone for attending and hope to see you at the next in person one soon. So as Cliff mentioned, I'm from Internet Archive and I'm going to talk about our Internet Archive scholar project. So let's get the slides going here. So it's our 25th anniversary this year in an archive of course also it's December so it's our end of the year fundraising campaign as a public service nonprofit so feel free to click and give if you'd like to talk. Otherwise, I think as most people know our mission is universal access to all knowledge. So just sort of stats of all the stuff that we have slide, which people usually ask about. So, you know, lots of, lots of millions that you can dig through there, but we'll be talking today or I'll be talking today about our work on scholarly preservation which is a somewhat more recent endeavor than some of our other areas of archiving. So the stats around there are at the bottom and I'll talk more about them on on later slides and what those numbers mean. So Internet Archive scholar is sort of our project name and you know what were the goals that we came up with this. I would say in 2017 or 2018 we basically decided that we needed to do a better job of archiving and providing perpetual access to scholarly materials. We already have plenty in the archive at that time through digitization web harvesting user uploads things like that. But we did not have sort of intentional collecting partnership products development all those sorts of areas programs and so that was something we decided to pursue, and this is me updating on that a couple years later. You know, one goal that we often try to do is using is open infrastructure for public good and access the knowledge. We do own and operate all our own data centers so we're not using any commercial cloud services. We are a nonprofit as I mentioned. So I think that's an important aspect, but for the specific project, we sort of surveyed the landscape and wanted to see what we were good at doing already and how that could be applied to preserving materials. So instead of specialized archiving services we wanted to take sort of open scholarship automation and apply that to some of our existing technologies, especially around collecting the web. And instead of specialized curation methods. We wanted to build tools that identified scholarly objects where they lived. We identified the metadata associated with them, which were of course often associated with DOIs or registries or other things and bring those two together for better, not just preservation and access but for better discoverability. You know what were some of the challenges that we felt like we could help try to mitigate the print to digital transition and scholarly publications has had custodial challenges many publications and journals. We really think of preservation many of them are volunteer driven they might have a short lifespan and not really think about long term access. The sort of traditional curation methods that we used to use for deciding what to preserve, or what to, you know, purchase in subscription deals. There's a lot of detail as well, especially with the proliferation of the ease of creating journals with systems like OJS or Web only publications. Traditional expectations around authors depositing into repositories is gotten a lot more complex, especially when there's a lot more to a scholarly output that just the version of record article. There's increasingly a problem for long tail publications for non North American Europe publishing entities for smaller humanities non stem publications. And of course, things that are web only or completely born digital, or even more at risk for long term access. And the access challenges well we know we have about a billion PDFs in the way back machine we think about 7% of those are scholarly materials. How do you find those you have to know the URL so that's not a very good access method. The web collection is not well searchable at scale, let's say there are small thing curated collections that have ample metadata and help scholarly outputs are all over the web of course it's not just publishers it's also a lot of things in institutional repositories there's YouTube videos like this will be there somewhere at some point. So I think the diversity of platforms to which people can publish scholarly outputs has created challenges. There's no great quality assurance or quality control mechanisms for any of this. And prior to this project we didn't have really targeted harvesting methods. So we came up with two approaches a sort of top down and a bottom up concept one the top down is that we actually harvest and archive. We also have persistent identifier systems, which is great and the open infrastructure so we do across a few data site do is Zanotto registries other metadata sources manifest outputs from other aggregators. So we can take this whole ecosystem of not just the actual published items but the metadata around them, which often contain the URL and the location of the actual output. And we can use those as signals of what to go out and archive we can of course archive journal websites and those those kind of things too. So we have a larger sort of top down ecosystem that can be used as a signaling mechanism to what to go out and archive and this is all of course on the public web right. The bottom up is that of course we're already collecting a lot of things of scale, we're digitizing a lot of things at scale. We're having, you know, content deposits from institutions into the archive. So we can assess those to determine whether they are scholarly materials, if that isn't already known. So we built sort of the bottom up is take what we already have and determine whether it's a scholarly output, and then add it into our sort of search and discovery platforms. So that's using machine learning and automated tools basically. And of course there's sort of the more partnership forever approach. Partners build partnerships build services, all the code or an API is around this is open source that you know we're using our own infrastructure which is of course owned by nonprofit entity. We can also make all this content discoverable because it's open, it's away material, we can make it discoverable and other discovery services which is great. Include the ability to deposit and redistribute redistribute this content into sort of distributed digital preservation systems through partnerships, and we work closely already with many national libraries around the world and so we can leverage those relationships both to provide them with content they may not be getting as well as if they have open content that they're interested in having other people provide access to or preservation of. So we can do that as well. So a lot of these sort of indices and services will be familiar to many cross rev PubMed arts of other aggregators like on paywall and core. We have worked closely with like the semantic scholar project. We use Microsoft academic graph before it shut down and work closely with folks like DOA J and DOA D and places like that. So this is sort of using either partnerships or their own open APIs to get metadata to archive it to interpret it to determine what what can be archived. So, as an archive we of course archive all of this metadata that we're getting as well. That's proven useful in some cases for researchers or other projects who don't want to sort of abuse someone's API and are looking more for a database dump. And so I think that's one side benefit that's come out of the project is that we can provide persistent access also to metadata stores. And of course many of these services are also doing that themselves but more copies more places for the better. So there's some of the infrastructure goals around the project as I mentioned we do own and operate everything. We have a lot of automation, not just in the web harvesting, but in the, in the content processing and indexing. But we also of course do a lot of manual assurance or interpretation of that material. So it's really trying to solve some of the human resource challenges around working at the scale that we're trying to collect at. For the metadata that we the catapact in catalog that supports Internet archive scholar. It is a sort of wiki style catalog that can be publicly edited. So you create an account and things like that but basically we have journals and individual authors that can go in and correct the metadata. Of course we're consuming metadata from like do is or whatnot. And that is often incomplete and so we're also providing a mechanism for metadata improvement, and we have had some successful examples of sort of giving information back to the metadata provider. An example would be we work pretty closely with ISSN, and we can work, you know, with the ISSN data, finding the journal home pages that are in there archiving those pages. And in some cases the website has moved, or it's redirecting to a new location, and we can give that information back to ISSN and make an update their, their records so there's a little reciprocal reciprocal metadata enhancement I think that it's been one success of the project. So what are the components just to talk a little bit about the technology. We sort of have a lot of content crawling and processing infrastructure. We are consuming all these apis daily, and then launching web crawls around them, either for singular artifacts like individual PDF URLs, or seeding larger crawling initiatives for like a whole journal website and things like that. So just looking at PDFs we're of course dealing with HTML only publications in XML as well. All of this content gets matched with the metadata that has driven the archiving endeavor. So in many cases that supports bibliographic record that is a DOI. We of course are also creating bibliographic records, extracting data from PDF or from HTML to create the equivalent of a DOI record that we can then use. And all of this gets matched the web crawls of course go into the larger way back machine collection, and then the bibliographic metadata goes into this fat cat big catalog back in system that marries basically the archived content in the archive and the bibliographic metadata, and then the search and discovery layer on top of that is a scholar which is at scholar.archive.org. And just a little diagram we use a number of different harvesting technologies both browser based link based and sort of manually driven goes into the way back machine gets analyzed as part of all this, and then goes into the catalog. So when dealing with PDFs, we have a couple of different machine learning tools that we use. One is to identify PDFs in any crawl. Now we of course have crawls that are directed towards scholarly material intentionally. We also have websites or URLs from DOIs or however else, but of course we also have generic web crawls that we're just conducting as part of our web archiving. And so we have a tool that can basically look at any PDF and identify or make a guess or an informed guess a machine informed to guess as to whether it is a scholarly output or a research paper, and it is pretty good at that. That happens first. If it is, then it goes to Grobin, which is a, not something we developed, but we have contributed code to the project, which basically works with PDFs to extract bibliographic metadata and full text and put it into XML. That's useful. If it doesn't match an existing identifier we use a tool called a fuzzy cat, which we did build, which is doing fuzzy matching to try to say does it have doesn't match a DOI or some kid. Knowing the author and the title and maybe that some part of the abstract. So there's a little fuzzy matching. And if it doesn't then we just create our own record, and that is what goes into the catalog. So what does that look like we're doing about 40,000 archiving about 40,000 scholarly objects a day. Most of those are PDFs or papers but of course there's also data sets and protocols and conference proceedings and things that are not the traditional journal article publication. So we're doing feeds, not just from DOI is an archive and PubMed and things like that, but many others. To date we have around 170 million PDFs process. This, as I mentioned is a lot less than is in way back, going back historically but that's since we started this intentional project, and it's about 240 terabytes overall. You know, a lot of that is is away journal home pages of course, the sort of PID signals that I already mentioned, as well as URL list and manifest sharing with places like on paywall semantic scholar. We're also, of course, hitting every IPMH feed that we can find to try to discover more content there too. So this is a somewhat complex catalog data model that is sort of fervor ish for those who are into that sort of thing. It gets very complex and then of course there are many versions of a single article there's the preprint potentially multiple preprints. There's the authors copy there's the pre publication version, which someone might be able to put onto their faculty homepage on their university website. There's the version of record there's the retracted updated version of record. So, so that's just for the article of course then we're also trying to find data sets protocols conference talks whatever else might be out there that is associated with that. And so it gets pretty gets pretty complex and then we try to make evaluative decisions of what is the, what is the one that should be highlighted in search and discovery things. Of course we're associating all the metadata from these external things as well. So there's a very complex data model around that. Of course a lot of this is web captures as well which adds extra complexity because there is a landing page, which might have the abstract and some other information on it, as well as of course the PDF that is embedded there. The, any web captures actually multiple individual files the CSS the JavaScript, the HTML, the PDF. So it can get pretty complex pretty fast but of course we're trying to abstract all that away. And we do have a public ability to look at the coverage per journal, mostly for internal use but it is queryable. So trying to do coverage dashboards by journal or by other sort of filters is accidentally muted there. I mentioned the edit stream, and that people are actively editing and submitting edits Wikipedia style to our metadata catalog so that's been an important aspect of the work as well. So final numbers for the last couple of slides about partnerships and products. I think we have about 180 million research outputs. These numbers are going to get a little fuzzy because sometimes it's hard to identify things with absolute determination at this at this scale, and especially with the two to three people that are working on this project, but about 120 million papers papers can of course be a conference proceeding it might occasionally be sort of great literature. It includes a lot, but I a scholar, the discovery layer is very highly known and identified and verified works that are open with full text. So we're at about 32 million open access scholarly articles with full text and a lot of metadata for research and discovery. We think about a little less than half of those actually have no known the digital preservation solution currently we are, we're archiving sort of the keepers registry which I'll talk about a little bit later, or comparing against the keepers registry or other known digital preservation sources. So this is not a, we're sure it's 14 million. But I think it's, it's definitely a pretty big number of those and that's, I think a good, good outcome of the project is that we have provided preservation infrastructure at a relatively low cost for a relatively high of away papers and Germans. We're also starting to include some of the digitized material within internet archive within the, the I a scholar project that's sort of ongoing. We had started basically all born digital and from the web, but we're starting to add more, more scholarly open scholarly material from other sources. If you're interested in the tech stack it's a relatively large elastic search index, it's a pretty big Postgres SQL database but for the most part it's only using a couple of medium sized VMs so it's not infrastructurally it's not a huge project, though it is you know a couple hundred terabytes of data, at least in the archive. I won't do a live demo because that would be, that would be dangerous but scholar.archive.org please go check it out. Contact me or any of us with any feedback we launched it I think in q2 or so so it's about six months and the amount of continued UI work and other features that we want to do, but I think it's, it's pretty good and pretty responsive for the, the maturity of the project which is early maturity. What else is in there. We have just released a relatively large citation graph product, and the intent of that was to find all the citations within the corpus and make sure that they were linked to other content in internet archive. So if there's a web URL in a paper citation, we have extracted the citation and we have linked it to the Wayback machine capture we and I have some stats on this later. There's plenty of command line utilities, open APIs, and both data sets that I'll talk about as well a little what we're working on currently we're trying to focus especially on non PDF version of record data. So, or articles or outputs. So this is HTML only publications, which are pretty complex to archive and to know whether you're archiving well. We're doing a lot of data sets both through mirroring data repositories but also through partnerships and development like that. So we're trying to look at secondary scholarly outputs, like protocols and conference talk videos, and other things that are associated with the original article, but are on other platforms, you know, slide share YouTube, things like that. We have some tools for people to be able to submit their own papers for for automated crawling. We have full tech search expanding citation integrity, and and both data sharing some of the partnerships I wanted to highlight a couple that I'm showing on time so I'll go pretty quickly. We are working with a directory of open access journals clocks the public knowledge project which runs OJS is and the ISS and keepers registry around project Jasper. We have those smaller away nonprofit journals that are in DOA J to basically submit their content for preservation, instead of us going out and crawling it, and they do that as part of their submitting their article metadata to do a J so it's trying to get preservation upstream to where the journals themselves are already doing things, instead of having to think of preservation at the end of the process. So I'm pretty excited about that and we'll, there'll be more news on that next year. We're also working with Center for Open Science and their OSF platform to basically have an integration so that things in OSF or automatically archived in Internet archive. We have started with research registrations and hope to expand that so that's sort of thinking about open science and open knowledge preservation not just from articles but also of course from data, and other things like research registrations that might not be traditionally thought of as a scholarly output. I note that we did ask Google scholar if we could call the project I a scholar. So we did not steal the name. We have a good relationship with Google, and we are having announced it yet but our I a scholar is indexed in Google scholar and we're expanding and we started with about 18 to 20 million. So excited. Everybody uses Google scholar so it's great to great to have your stuff there. So excited about that partnership. Also, we are joining the keepers registry so we'll start publishing our preservation holders, our preservation holdings in the keepers format with the, I think 12 to 14 other keepers. So I'm excited about that and then we've been working pretty closely with the semantic scholar project and sharing what we've included mostly but also some, some seed list and things like that. Also, Internet Internet archive is is working on interlibrary loan and exploring some interlibrary loan pilots, and our Internet archive scholar material is included in that so I think it's a great mechanism to actually get into the ILL systems. So that's, that's an exciting one. I mentioned citation graph it's about 1.3 billion citations. We matched 14 million things the way back 20 million to digitize books that are in. IA's open library program, and a whole lot of stuff matched a week to archive Wikipedia pages. So, we just released it last month and I'm excited to see what scholars or data minors can do with a relatively large citation corpus. And then we do have a number of sample data sets for people to use if interested. So I mentioned is products development and we have been working on this throughout the project with many of our institutional partners that might already use other Internet archive services. But I'm excited to see how this can work in the away and scholarly communications and sort of open infrastructure open science communities, we're obviously already working with many journals either through like the project Jasper tile style partnerships are directly to have them either deposit or archiving their things well. So we can offer of course pretty easy and low cost and even free preservation services to journals. We can also if they don't have persistent identifiers, we can help them get persistent identifiers, which is important to have their information more discoverable and our system and other systems for institutions we've had interesting conversations around. We have a lot of scholarly material from your institution. Do you want it. It is obviously challenging for many to convince their own researchers to deposit into their institutional repository. And hopefully we can help some universities get content they might not be getting through faculty deposit or things like that. We have them also of course archive their websites using our web archiving services, and we can of course analyze, use the same processes to analyze their own web archive actions and extract scholar material for that. And for researchers, you know providing systems where they can deposit some of these harder to archive outputs of their work, be it protocols or conference talks or videos and demos or multimedia material things like that. And of course provide data sets out of the project that they can use in their own, and their own analysis. So that's it 25 minutes. I did want to give a shout out to the project engineers Brian new bold and Martins, by again, and also shout out to David Rosenthal and Vicki right who many of you know who have helped in the project, and we have gotten support for this through Andrew W. Mellon foundation and from I am less so thank you of course to them as well. Thanks. I see three chats so I can try to let's see open the chat box. Yep, you've got, you've got a question in the chat Jefferson. Okay. Thank you Michael. Are you drawing content from other long term archiving system such as locks portico in addition to clocks. But then are there too many legal complications how active is the collaboration with keepers or industry. And I'll just give a footnote on that that question that I am very much involved in long term archiving here in Germany, and the legal complications are clearly a major problem so your list of collaborators is interesting and I'm wondering whether you're hitting those legal issues as well. I haven't hit them too much yet. We are expecting that a number of the key so we have joined keepers officially we haven't announced yet or published or keep our report yet but that'll happen next month I think. Through the keepers that are involved in the project Jasper there is the discussion of how we can distribute the data that is deposited there via many anyone who wants to preserve it essentially. So we do have integrations with lots and clocks for content sharing directly. And that has mostly been from Internet archive to them so far, we would of course be happy to take any content that is, you know, license legitimate or legally viable to be deposited with us. And we have done that in some small cases, not as intentionally for scholarly material, certainly with, we've worked a lot with lots of course over the years have done content sharing but that wasn't always necessarily scholarly material or scholarly. I know you are working with you. And I think there's a lot of potential for there for material that can be widely distributed for preservation at many institutions. I think all of these parties are interested in doing that. Thank you. Good question. I know it's very early days Jefferson but one of the things that I've been watching is the, the move to make the Internet archive, one of the keepers that are covered in the keepers registry and your participation in the Jasper project. I think Jasper is a really important response to the obvious missing material in the keepers registry that's open, you know, the long tail of open access publishing journal publishing and I'm just wondering if you can give us any insight into how that's going so far. Sure. So certainly, you know, I think one of our goals if I didn't mention it was explicitly to focus on long tail at risk and on the web. We have things on the web or at risk of disappearing quickly. And as I mentioned the long tail really doesn't have the people resources of the financial resources to think about digital preservation much so certainly the long tail and the non profit the non commercial as well as the non Europe North America has been a something we've been very explicit not focusing on. I also like the Jasper project because for getting preservation where people live instead of making it either. It has to be the outreach of the preserving entity that's trying to do outreach communications and you know that is challenging or expecting that people will come through the front door is also sort of challenging. It is very early stage we have integrations between all the partners. So that happened within the last couple of months. It is the it is in, we, it's still in pilot, but all the technology on the DOA day site for people to deposit their open access material for it to be distributed amongst the players for preservation, and for us to report that, of course in our own keepers reports so back to the, to the, to the publishers is is done. I would say, you know we're sort of at the maybe 10 or a dozen participant participating journals so far. So, so very early stage, because a lot of those systems are built in the last couple of months, but I think pretty successful. And I think at least our goal and I think the goal of others some of the others is how can we do a project Jasper anywhere, like for any right this is very today have to be people have to be nonprofit away journal in DOA J. So, and that is, I think the qualifying journal, I forget the exact number but I think it's about 7000. So it's a pretty, it's a pretty significant. Yeah, within DOA J. But can we just apply that model to other to other, you know, directories or registry services or other open infrastructure services that journals are already working with and I think. So I think it's not just a great project but it's a good blueprint for making it easier for the, for the long tail to make sure that it deserves. And a tremendously good fit with Internet Archives capabilities. So, anyway, thank you for that update. I think we are about at time. I really appreciate you coming to, I mean there's just a huge amount of information in these slides and you're obviously doing an awful lot of wonderful work on behalf of our community so thanks so much for coming to update it. Really appreciate it. Thank you. Yeah, thanks for letting us talk and if anyone is interested in partnering or more information definitely reach out. All right. And now we're going to take a short break till about 230 Eastern time so about seven minutes or so. See you then.