 Archive and I managed the turn all references blue project there and I'm just kind of counting down for a couple more seconds here and I think we will start right about now. So, really great to be here at this was turning into now an annual update about the turn all references blue project at the wiki media conference. So I've got a bunch of slides. I'm going to run through them fairly quickly. First of all, this is the team. The core team, at least there are there are others involved. Some of you may recognize some of the long time wikipedia is here. I'm x million dora j borlitz james here Stephen ball back and for the summer grace chen Google summer code intern. The project we call tar turn all references blue. We recognize in the future, pretty much everything that anyone has ever written a record is going to be a click away. We're working to accelerate that process by adding links to references, especially those in wikipedia sites. The first of all, part of the problem link rock. It exists really bad. The web is fragile and ephemeral good links go bad and also content drift. There's no change control mechanism for the web. The content associated with the URL can and often does change without any kind of warning or notification system whatsoever. So if someone references content at a URL on one day. The content at that URL on the next day, maybe radically different. So to address these issues. We work to archive much of the public web and then keep up to date with with links in wikipedia articles. But first of all, how do we, how do you add links to things in the first place? There's really only two fundamental ways you can do that. One is you can edit the object. If you have access to the object itself, then you can go in and you can edit it. You can change it. Maybe it's a PDF file. Maybe it's a word file view. It's a website that you control. But that's actually The minority of most cases. One doesn't have access to things. One wants to add a link to unless those things are wikipedia pages and in those cases, you can annotate them. You can edit them. You can edit the wikipedia pages quite nicely. If you can't edit a page, then all you can do is you can annotate it. You can add metadata to it. So wikipedia provides a wonderful foundation to be able to add links to things and edit those links because one can edit the wikipedia articles themselves. How do we go about doing this? Well, we identify the resources. Say, for example, in a wikipedia article, we parse the semantics. We go out and we find related documents and resources and then we actually edit that. So we've been doing this for many years now and we've been doing it through a process where we get approval to run the Internet Archive bot software on individual wikipedia sites. It's been a very labor intensive. It's a relatively slow process. We wanted to change that. So we put an application in to allow for there to be a global bot approval for For bots, not just our bot, for any bot. Any bot could apply to get global approval to run. And we did that a few months ago and we're really happy that we got approval. We got approval to be able to apply for Internet Archive bot to run on all wikipedia sites. So then we use this new approval process and we applied for Internet Archive bot to be approved. And we're happy to say that on July 17th, we got that approval. And that means that now there are 576 wikipedia sites that we can run the Internet Archive bot software on to identify and fix broken lakes. We've run this software at more than 150 sites and it is now currently actively running on 120 of them worldwide. Over the coming months and maybe short numbers of years, we'll expand this number that we're running on, hopefully to all 576, if not more wikipedia sites. So that's a pretty big development since the last time we reported in. And how do we go about doing this? Well, we basically listen to the wikipedia event stream API for URLs that are added to our changed on sites. We listen to that continuously. And then we archive those URLs and also related links. We basically do a one hop crawl. So all of the outlinks on all of the discovered pages, we archive them as well. So this is a look here at the number of URLs, the count numbers is the key number that we archive from the event stream API by day for the last several days. So it's about 7 million URLs a day. You might say way that it makes no sense whatsoever. There's not 7 million URLs added or changed. That's because we archive a seed URL and then we archive outlink URLs as well. And all of the embedded resource. So here's an example how the page www.cnn.com on June 17, 2021, if one was to archive that page, all of the 338 embeds, all of the 174 outlinks, and all of the 37,746 embedded URLs and all of those outlinks, one would archive 31,259 individual URLs. And so that's that's what we go about doing. So here's what what it looks like we add way back machine links to Wikipedia articles. In most cases, these are links that have gone bad, or maybe we're not formatted properly, or where at risk in some fashion, and one can now go to Wikipedia article instead of getting a 404. One can be brought to the way back machine and see an archive version of the cited page. And today I can report progress to date there's more than 25 million archive web links, mostly from the way back machine but where we don't discriminate against other archives we use other archives where appropriate. And we've added these links to more than 150 with Wikipedia sites. And this year, or this so far this year alone, we've added more than 3 million links. So a lot of progress there on helping to make not just the web, but Wikipedia, more resilient and more reliable and enhance more useful to people. The real world impact of this is that as per report, Wikimedia Foundation research did in April of 2019. There's a lot of links that people click on from Wikipedia articles back to the way back machine, more so than any other external source that got clicked on during that time period. Here's an example of another phase of our work where we add links to cited books. This was a citation in a Wikipedia article about Martin Luther King Jr. And you can see that if you click on reference three, citation three, one can be taken right to a preview, basically a two page preview of a digital version of that book available from archive.org. How do we go about doing this? Well, a couple of years ago the Internet Archive basically bought a bookstore called Better World Books. It's a quite large used bookstore. We turned it into a nonprofit. So Better World Books is owned by a nonprofit organization, which is a sister organization to the Internet Archive. And we maintain a wish list of books that are cited on Wikipedia articles that are not yet been digitized or linked. You can see the warehouse of books there on the left. And then the books, those are pallets on the right of boxed up books. They say Internet Archive wish list on them. So those are books that are packaged up and are about to be sent to the Philippines to be digitized. And then we can link them from Wikipedia articles. So you can see that the investment that the Internet Archive makes in this process of trying to help add links to citations on Wikipedia articles is quite substantial. And this is another example here where we have Wikipedia citations directly to books. This is one that I personally did in the early days of COVID. I went and I read the Wikipedia article about the movie Contagion right after I was actually watching the movie Contagion and going back and forth between the Wikipedia article and the movie, which is actually pretty accurate. And I noticed that Reference 26 was to an academic book about a particular kind of virus, but it wasn't linked. You couldn't get it anyway. And I tried to track the book down. And I even wrote to the author of the article in the book and he said he lost the hard drive that he had produced it on and didn't even have access to it. So I bought a copy of this book. I donated it to the Internet Archive. And now if you click on this link, Reference 26 to the Wikipedia article about the film Contagion, one can be brought right to a preview of the chapter, the Hendra and NIFA virus from the cited book. Additionally, we've done some experimentation to add additional links. Here's an example of populating a section called works cited by the book. This happens to be a book about the burning of books, which is really a history of the destruction of knowledge. And all of those cited works cited by the book links point to digitized versions of books at the Internet Archive. So we've been adding a lot of links to books, more than a million links to more than 250,000 books available from archive.org across 50 Wikipedia language editions in the last year. We're working to accelerate the effort of sourcing non-English language books. This is a quick overview of some of the organizations that we're working with around the world to try to help us source books in Greek, German, Arabic, Hebrew, and a variety of Indic languages. If anyone, and this is frankly a broad shout out for this entire presentation, if anyone has ideas about how we can do the work that we're doing better, more efficiently, faster, more collaboratively, please, please, please do reach out to me. These are some things that we're working on. This is kind of in the labs. You know, you know how on a Wikipedia article today, you can do a rollover and then see a pop up of something that's from another Wikipedia article. Well, we think one should be able to do the same exact thing for a cited book or journal article or other reference. And so we've got some tooling to do on our back end to support this. We pretty much know how to do it and we look forward to being able to experimenting on rolling this out in the coming months. Another thing that we think would be helpful is to populate further readings sections of Wikipedia articles. So here's one that we did for my favorite Wikipedia article about Easter Island. And I look at my personal bookshelf and I have a lot of really high quality books and other references about Easter Island that were not cited in the Easter Island article. So we think in addition to Wikipedia library, populating further reading sections of Wikipedia articles could be helpful for editors to write better articles. This is an upcoming project. There was a recent paper cited on the left hand side there that looked at citations to software in journal articles. And it turns out that this obviously is a very important thing to be able to cite the method and the software used in a particular science experiment or research to help be able to replicate. You don't just need the data to be able to replicate the work that was done, but you also need access to the methods and the software. So we partnered with GitHub and we archived about a half a petabyte of citations of underlying resources from GitHub repositories. And then we're working with a number of others to link these up. Another project that we're working on with a open syllabus.org to expand the corpus of syllabi worldwide in multiple languages that could be referenced and open syllabus has done a great job of mining this information to extract out useful URLs. We just finished a pretty extensive crawl with this data and we accumulated about 30 terabyte worth of material that we're now doing data mining on so we can extract out more syllabi. And this is something to we think will be very helpful to be able to link up to Wikipedia articles. A few months ago, Internet Archive launched scholar.archive.org, Internet Archive Scholar. There's more than 30 million open access journal articles available through Scholar. And, you know, we understand because of, like, say, OA bot, there are some policy issues around adding links to academic papers, but we're actively working through them, and we hope to accelerate more links to a whole range of resources, not just web pages, not just books, but also journal, more journal articles, and also more magazine articles. This is a brand new project having to do with a microfilm collection that the Internet Archive has been digitizing. And so we've started to add links to articles cited in some of these older magazines and journals. I just have to note this, you know, it's kind of astonishing that the Wikipedia, one of the most important dimensions of Wikipedia articles is the citations, the ability to be able to reference an external resource. And yet there is no comprehensive up-to-date maintained and available database of all of these citations and the links and the metadata associated with them. And I think that's why it and others put together what I thought was a fabulous proposal for it was referred to as the shared citations database. They put it forth for funding by the foundation. Unfortunately, it was not funded in this most recent round. So we're working with Liam, with many others from the community to begin to build out a Wikipedia citations database that will be available, internally. We use this, we had to build a lot of this ourselves for the work that we've done with the turn all references blue project. And now we're saying to ourselves, what can we do to expand that to more Wikipedia sites to include more citations, more metadata, and then also to make this available to anyone who wants to use it to leverage it to help Wikipedia to become the best service that it can be. So this is an ongoing effort. I would more than welcome anyone's ideas about how we can accelerate that work. Huge thank you to the Wikipedia community and the Wikimedia foundations, the Archive it project that the Internet Archive team, which is a group of volunteer archivist that the Internet Archive works with various folks at archive.org. The team way back machine that does a lot of the heavy lifting on the back end archive URLs, better world books and especially a shout out to Chris Freeland from the Internet Archive, all of the archivists who have come before us and Alexis Ross in particular who initiated this project to help fix broken links about 10 years ago, all the book scanners and book authors that make it possible for us to even link to digital versions of the books. Brewster Kale, of course, and the Internet Archive members, the patrons that support us, more than 100,000 patrons who support us, especially this year is our 25th anniversary of the Internet Archive where we're thrilled to have this relationship with the Wikimedia Foundation and the Wikipedia community. And then finally, everyone who loves links, lists, footnotes and facts. And I'll close it at that and I have just a few more minutes, take some questions, but also I want to know I put a link to this presentation in the etherpad. And I'm going off of full screen. I'm going to steam your stream yard and I think that just worked. I can't hear anybody. So I don't know what's working. It just says two minutes left. Is there anything in front of your computer speaker. No, please remember to speak slowly. I try. Okay. And I don't even know if there's anyone else here in this room. Check your etherpad for questions. I will do that. Here we go. I'm loading it up. Okay. Questions. Oh my gosh, lots of them. Hey, so there is a capability for anyone to flag things on archive.org. I would encourage people if they see something that I think is will be considered hate speech to please flag it. On the on the right hand side is a flag and there's human beings that review those flags every single day. We take action to curate that material. So thank you very much for that question. What are the dots represent they represent places where internet archive bot has been running. Do you consider. Yeah, absolutely dictionary wiki books. I don't know about open street map per se. And then Daniel. You know, we don't work today to add things to wiki data per se. But that is something James here obviously on our team. That's near and dear of his heart. What is the idea. Well, you know, copyright, I would say control digital lending is not a circumvention of copyright. So that's just I'll put it. I'll leave it at that. It's a fair use argument. As many of you know, this will be tried in the course, but we represent a under control digital ending completely legitimate method and model for libraries to land out. I'm not sure about this lend out one copy of books that are owned by the library with control with digital rights management, and in particular to allow a preview. Basically, a couple of pages of a cited book under fair use so there is that one. And I don't see any other questions. Okay. And I think it says 40 seconds. Listen, my email is mark at archive.org. Thank you very much. Bye.