 Hello! Can folks hear me? I'm just hearing like hold music. Okay, great. Can people hear me? A little bit of feedback would be great. Great, okay. I will dark them. Excellent. Okay, well listen, this is a little weird because I can't see anyone and I don't hear anyone, but I'm going to assume that there are people on the other end of this screen. My name is Mark Graham and I'm really happy to be with you all today virtually. I have enjoyed participating physically in other Wikimedia conferences and I'm really sorry that I'm not with you in Singapore today. I will be there next year. This is the annual Wikimedia update from the Internet Archive. I'm going to go over some material that those of you who have been working with the Internet Archive for a while probably already know, but I guarantee you there will be some new material and I will give an update on the court case. So, first of all, the team, I just want to acknowledge that the core team, we call ourselves turn all references blue or tarp, consists of some faces that I'm sure many of you are familiar with, Stephen Ballback, Maximilian Dorr, Jake Orlitz, James Hare, and Dr. Sourwood, Alam. The, we call ourselves turn all references blue. We've got this kind of general big idea. The big idea is that in the future everything's going to be connected. We'll just take it off for granted. When one references any material whatsoever in any medium, the all related and reference material would be a gesture away, a click away. It'll just all be there and we won't even, we won't think about it any longer. We'll take it for granted, but at this point we're not there yet and so we have a lot of work to do. The Internet Archive as a library has decided to take on a piece of this task with a focus on Wikipedia in two major areas. One is dealing with links that have gone bad and the other is with links that can be made better. First off, link rock. When good URLs go bad, this is basically the dreaded for O4s. On the left we see what one might get on the live web, page not found. Whereas if you go to the Wayback machine for that same URL, there's a really nice archive. That is what this work is about. The other area of work is around content drift where what's available on a given URL and then given point in time changes and such that at another point in time it may be radically or subtly different thing. In any case, this is where web archiving comes in and where we effectively take snapshots of what's available on web pages at different points in time. So the foundation of the work that we do with the Wikipedia sites, it starts with doing a really good job of archiving. First of all, the public web writ large. We archive more than a billion URLs a day. But in particular, ensuring that we archive URLs that are referenced on Wikipedia articles across all of the more than 320 Wikipedia language editions. And we do that by listening to the event stream API. We've also listened to the enterprise API. We extract out new URLs and then we submit them to our infrastructure to be archived and to be made available via the Wayback Machine. And this is a recent view of URLs broken down by day that we have archived based on them showing up, becoming available to us through the event stream API. You'll see that numbers are about a million a day there. But we don't stop there. We go to URLs that are referenced on the web pages associated with the seed URLs, the base URLs that we get. Let's say for example, the URL that we want to archive as a CNN article. Well, in this case, we're looking at one particular CNN page. So you start with one C page. And then that page has 333 embeds on it. Those are page requisites, page elements, HTML, JavaScript, CSS, images, etc. So we have to get all of those other 388 URLs. And then on that page, in this case, there were 174 other pages that were linked. And all of those pages have a collection of embedded page objects. And so this is an illustration of where you can start with one URL. And then if you do what's referred to as a one hop crawl or archive all of the embeds and all of the referenced pages and all of the embeds on those pages, the process turns into 31,259 URLs. So it gives you an idea of how things scale pretty quickly. You can imagine if you went for one more hop out that these numbers would grow fairly dramatically. So in addition to archiving the seed URLs that we get from the event stream API, we also archive outlinks from those URLs as well. So let's say, for example, we had archived a page on, oh, I don't know, an anthropology report about some Mayan site or something. And then that page referenced a lot of other pages. It'd be a shame if we only archive the base page. And then if a researcher wanted to go deeper, they go and click on an archived, they go to get an archive of an outlink, and they would get a page not found. That'd be very sad. So we work really hard to try to do a really good deep job of being both complete with perspective, as well as going very deep to get as much as we can. Just as an illustration case in point, this is an archive of an interactive page from the 117th Congress of the United States dealing with January 6. The interesting thing about this is this January 6 page by the US Congress was probably one of the most watched pages on the web because many people knew that when the 118th Congress came in, they would eliminate the committee associated with this series of web pages, and they would get rid of the website. And they did exactly that. And there's a lot of archives of much of the site. But this particular page, which is an interactive timeline of what happened on January 6, to the best of our knowledge was only archived by the Wayback machine. And this is not available yet even through the Library of Congress or other archives. So it's just a testament to the importance of being diligent and trying to get things when they're published because you never know. You don't know what you've got until it's gone. And one of our models is if you see something, save something. So what do we do? We start with making sure we're doing a good job of archiving was published through Wikimedia Foundation, the related websites. And then we run software called Internet Archive Bot. And today, Internet Archive Bot is run on 215 of the 334 language editions and also 58 additional wikis. And what this software does, it runs through an individual Wikipedia site. In some cases, it can do it in a day. In some cases, it takes a couple of months. It looks for broken links, tests them three times over the course of multiple days. So we don't get, we try to reduce the false positives. Then we go, we look, if we really think that the page has gone dead, then we try to find an archive from it in the Wayback machine. And if we can find an archive, then we edit the page. So the first to this presentation is that I can announce that we have now fixed more than 18 million formally broken links across 215 Wikipedia languages. And furthermore, that 4 million of those were fixed in the last year alone. So this is not the number of total links that have been fixed or the number of total Wayback machine links in Wikipedia sites. Those numbers would be much larger. These are the ones that our internet archive bot software have gone in and actually found were broken and were able to successfully fix. So in addition to fixing broken links, we, as I noted, we also work to make links better and more useful. And one way we do that is we identify opportunities to add links to things that are referenced, specifically books and papers where there is no link to that reference to resource yet, where there's not a link to Google Books or another source for that particular resource. And we add links to those resources. So we work to turn book citations blue. And one of the ways we do that, I mean, similarly to how I said, we archive much of the web so that we can then replace those archives with broken links. A few years ago, the Internet Archive bought Better World Books, one of the world's largest used bookstores. And we donated it to one of our sister nonprofits. So technically, Better World Books is owned by a sister nonprofit of the Internet Archive. And we take books right off of the conveyor belts that have been referenced in Wikipedia articles. And we prioritize the digitization of those books so that we can add links to them. So the second tada is that I can announce that we have, that there are more than 1.8 million links now across about 60 Wikipedia language editions pointing to books from archive.org. More than one million of those links were added by our software and more than 800,000 were added by Wikipedia editors. In addition, our software has added more than 141,000 links to academic papers and articles across about 60 Wikipedia language editions. It's a two-way street also. I just want to note here the importance of linking in general. This is a book that actually came out today on Kindle. It comes out next week on paper written by a China scholar. It's called Sparks, and it's about the China's underground historians in their battle for the future. I'm sure this book will be of great interest to many Wikipedians. And there are 85 wayback machine links in this book. So this is where an author recognized the importance of preservation and not just relying upon live web links for the book. In fact, the author just told me today that several of the links in this book have already gone bad, have already turned into 404. So it was good of him to have used wayback machine links. A couple of things that are coming up, one in the lab. We call this Reference Explorer. It's an attempt to create kind of a Swiss army knife for references that appear on objects where the objects could be a Wikipedia article or a PDF or any given web page. This is in development, and I welcome the opportunity to collaborate with others that are working in the space. We recently cataloged about 15 citation helper kind of like apps and have been evaluating them. And we're interested in collaborating once again with other people who have worked in this ecosystem. What's coming up next? One is finding and fixing soft 404. So what we've done so far is focused on hard 404s. That's when you get an HTTP status code of 404. A soft 404 is actually technically a status code 200. So it's basically a good page from the web browser perspective. There's nothing wrong with the page whatsoever. But from a human being perspective, for example, as you can see in this example here, it's pretty obvious that this is not a good page. In fact, it may even say error 404 on the page, but it's really, once again, a status code 200. So being able to reliably identify soft 404s and then go look for replacement, good 200s for them is something that we are developing software capabilities for as I speak. Okay. I've got eight minutes left here. And so about the publisher lawsuit, I'm sure some of you have questions about this. I'm going to read from a blog post that we made available today. And the blog post actually has an image of a Wikipedia article, the Martin Luther King one right there. What's the hashay versus internet archive decision means for our library? Our library is so strong, growing and serving millions of patrons. But the publisher's attack on basic library practices continue. Last Friday, so this is hot off the press. Last Friday, the Southern District of New York issued its final order in the hashay versus internet archive, thus bringing the lower court proceeding to a close. So that this part is finished. We disagree with the court's decision and intend to appeal. In the meantime, however, we will abide by the court's injunction. The lawsuit only concerns our book lending program. That's that's fairly key. This was a lawsuit about book lending. The injunction clarifies that the plaintants, publisher plaintants will notify us of their commercially available books. And the internet archive will expeditiously remove them from lending. Additionally, the judge also signed an order in favor of the internet archive agreeing with our request that the injunction only cover books available in electronic format and not the publisher's full catalog of books and print. Separately, we have come to an agreement with the Association of American Publishers, the trade organization that coordinated the original lawsuit with the four publishers, that the AAP will not support further legal action against the internet archive for controlled digital lending. That's the lending of the books if we follow the same takedown procedures for any AAP member publisher. So what is the impact of these final orders on our library? Broadly, this injunction will result in a significant loss of access to valuable knowledge for the public. It means that people who are not part of an elite institution or who do not live near a well-funded public library will lose access to books they cannot read otherwise. It is a sad day for the internet archive, our patrons, and for all libraries. Because this case was limited to our book lending program, the injunction does not significantly impact our other library services. The internet archive may still digitize books for preservation purposes and we may still provide access to our digital collections in a number of ways, including through interlibrary loan and by making accessible formats available to people with qualified print disabilities. We may continue to display short portions of books as is consistent with fair use. For example, Wikipedia references, and then we say as shown in the image above. This injunction does not affect lending of out-of-print books and of course the internet archive will still make millions of public domain text available to the public without restriction. So thanks to your continued support, our library is going strong, growing, and serving millions of patrons. Libraries are going to have to fight to be able to buy, preserve, and lend digital books outside of the confines of temporary licensed access. We deeply appreciate your continued support and we will continue this fight. So that's the end of my remarks and there's four minutes left here and oh, I can see people now. This is great. I can see the back of people's heads. Okay, yay, thank you. I would give many of you hugs because I know many of you if I could see you there. So thank you so much. It's really good to see you from Half Moon Bay, California. I don't know if I could take any questions. I'd be happy to if I can, but I also can't hear anything here. So sad. Thank you. That was great. It was interesting to hear you talk about the IA bought fixing broken links and rotten links in an automated way. I understand that a very large number of links have also been fixed manually by Wikipedians. Is that number in the millions as well? Yes, it is. And I can come up with some number about that. We're doing some research. I don't have that number handy here. And there's a number of additional numbers that I will try to share. But yes, I mean, this has been an ongoing effort, right? Wikipedians have since the beginning used Wayback Machine URLs. We've also in some cases gone in and proactively added Wayback Machine URLs. We did that, for example, for the Ukrainian and the Russian Wikipedia sites when the war started, as part of many things that we did relative to the war, and as a way to help strengthen the integrity of those sites, because we recognize that probably a lot of the URLs would end up going bad. In fact, in the case of Ukraine, many of them have. Thank you. Just one last question, and then we'll move to the next session. Yeah. Thank you. The turn the links blue project looks fantastic. Is there scope to expand that or work with you, you know, if there's a database or I'm particularly interested in like report literature and, you know, NGO reports and government reports and that sort of thing, which often don't have any links? Absolutely. I just buy mark at archive.org. I please do reach out to me. That's probably the most important thing I wanted to say. We, I personally and the team work to be responsive. When we when we expanded and scaled up to, you know, more than 150 language editions, we also scaled up our efforts to support the people that were posting, you know, bug bug reports, etc. And so please, please do email me and I will work to try to help meet whatever needs you might share. Thank you for turning around. I can see you. There you go. Okay. Hi. Okay. Thank you. Thank you.