 I want to welcome Mark. Hi, Mark. Welcome. Hi. Mark is the director of the Wayback Machine at the Internet Archive. Super cool stuff. And his talk is going to be great today. I think I'll turn all references blue about linking and back linking. What do you call? I heard it called a few things, Mark. What do you call it? Linking, adding links too. Yeah. Yeah, just adding links too. OK. All right. So good. I like it. 28 people in here. I see Chris, Doug, Dave, among others. So I'm going to turn it over to Mark now. And I'll monitor the chat. Please ask your questions either in the chat or the Q&A. And if we have time, like I said, we only have half an hour. But if we have time, we're going to try, like heck, to get some questions answered. So all right. Mark, over to you. Excellent. Thank you very much. Hi, everybody. I just wanted to note that I can see my presentation, but I can't see you. So is this stuff I'm talking to my computer here? My name is Mark Graham as noted. I'm going to talk a little bit today about a project to help turn more things into links, basically. The idea is turn all references blue. And this is a project of the Internet Archive. The Internet Archive is a 25-year-old nonprofit. We're actually celebrating our 25th anniversary this year. Our mission is universal access to all knowledge. And one of the ways we do that is by being the best library that we can be. And so that's kind of on a daily basis. We work toward that goal. Once again, the title here, Turn All References Blue, and just a little bit of visioning in the Star Trek future, everything that any human has ever written or spoken will be available through a command, a gesture, a thought, or whatever have you. We're not there yet. But along the way, we think that we can add value to things that are digital by connecting them together. And so here we just have to make the requisite nod to Memex and Manover Bush and Project Xanadu. I know other people and other presentations in this conference have spoken about the influences of Ted Nelson, Vanilla Bush, Doug Engelbart, and many others at length. So I'm just going to skip over that, except to say, yeah, they influenced me greatly and continue to inspire our work. I'm going to step back a little bit and say, what are we talking about when we're talking about adding links to things? There are really two different ways to add a link to something. The first one is you can edit the actual object itself, change the underlying primary document. And the second is you can add metadata to that object, which we often refer to as annotating. I think of all of this as annotating, but I'm just going to use the phrase adding links to things generically. And why do we do this? One of the challenges and reasons that we work with links that are added to things is because the links in the web themselves are somewhat ephemeral. Links go bad when the underlying thing they link to is changed or deleted in some fashion. Here's an example of a URL that on the live web returns a 404 page not found. And then an archive version of it on the right is much more satisfying. There's another issue, which is content drift, meaning that the underlying thing that something links to can change over time. So we have a web of addresses, but not necessarily addresses that are tied to specific information that's fixed in time because it can change. So what we've been doing about this is we've been working closely with Wikipedia. We have a piece of software called Internet Archive Bot that's now running on 78 of the 321 Wikipedia language editions. And we've been going in and we've been looking for broken links and looking on how we can improve links. And we've been adding a lot of the links. These are as an example here of a Wikipedia article about open firmware. And you can see that many of the external links down on the bottom connect to archives from the Wayback machine. So this might be, they may have been born this way. The editor might have used a Wayback machine URL or our software might have found a broken link and edited it to fix it to point it to an archive version on the Wayback machine. And to date, our software has edited more than 25 million URLs on Wikipedia sites and pointing about 23 million of them back to the Wayback machine. Just alone this year, we added about 1.5 million links. And the proof is in the pudding here on the effect of this. This is a look at clicks from Wikipedia, English Wikipedia to third-party sites. And you can see here that for this particular time period, web.archive.org, which is the Wayback machine, by far, was linked through more than any other external link. I just want to make a little note here to show a little bit about web archiving. The basis for everything I just said is that we've been archiving a lot of Wikipedia and a lot of the web for a long time. And you might not think about this, but for cnn.com on June 17, 2021, you start with www.cnn.com. That's the seed page. There are 174 outlinks on that page. Those are links to other pages. And just one of those pages had 338 embeds for a total of 30,000 embeds. So this is an example of how you can start with one URL and try to archive one URL and all of the pages linked from that URL and all of the page resources and you get a number like 31,000. So it's a lot of heavy lifting on the back end. We archived more than a billion URLs a day into the Wayback machine. In addition to working with links to web pages, we've been working to add links to citations in Wikipedia articles where the citation is to a book. So here's an example of the Martin Luther King Jr. page and is down at the bottom, you see the citation to a book and we added a link directly to a digital version, a preview of a digital version of that page at archive.org. We do that through a process of identifying the resources in Wikipedia articles, trying to understand the semantics of what's there and then finding those documents either at archive.org or other sites and then editing or annotating the document. To accelerate this process, the Internet Archive helped by another organization called Better World Books and we turned it into a nonprofit. This is a used bookstore. So it's a sister nonprofit to the Internet Archive and we were able to pull books off of the conveyor belts there, Better World Books and put them in pallets on the right hand side of the screen here. You can see a pallet and you can't quite see that label but it says Internet Archive Wishlist. So I think we pulled something like a million books off the conveyor belt last year and sent them to be digitized. We also are working with a number of other organizations around the world to try to source non-English language books. And I can say that we've added more than a million links that pointing to more than 250,000 books from archive.org across 50 Wikipedia language editions and that work was done in the last year. Upcoming, we wanna add the functionality such that when you're looking at a Wikipedia article, you can do a rollover and you can get a preview of the book that's referenced on a Wikipedia article as a pop-up. And we also want to add further reading sections at scale to Wikipedia articles. Today, Wikipedia articles contain links to things that are cited but they don't contain a lot of things to things that are not yet cited. And we think it'd be helpful to populate Wikipedia articles with links to additional resources that people might want to follow if they wanna go deeper and learn more about the topic. In all these cases, we define success as doing something at scale, which means numbers like hundreds of thousands or millions. Gonna shift a little bit away from Wikipedia and I'll talk about books in general. The bottom line is that while more than 100 million books have been published, very few of them have been digitized let alone have links. For example, all the books that were born as paper books, I don't know, I just happen to have one here. This book was born as a paper book and it's never gonna be republished by the publisher as a digital book. We digitize it at the Internet Archive. There are no links in it. And then in even born digital books, like books that come out today now on a Kindle, often they don't have links as well. Here's an example, a couple of new books, Active Measures by Thomas Red. Even though it's got 75 URLs to archive.org, none of the links in the Kindle version of the book are clickable. Silicon Value is a book that came out a few months ago. Once again, tons of URLs in the book, but none of them are clickable as well. I could talk about the why about that. It has to do with the pipeline of the publishing industry and frankly, a lot of influence that Amazon has. And so this is not so much about technology, it's about policy, but it points to a lack of appreciation and understanding for the value of links in general. And so we're working on trying to influence this process and getting more born digital books with links in them available. Here's an example of a book by Tim Harford, the data detective that just came out recently, and it has links in it, which is great and the links are clickable. And in at least one case, it ended up going to a dead link, however. So that points to the importance of using the persistence of links, especially if a book gonna put a URL in it, you want it to be alive for a long time. So I'm gonna encourage that we use archive URLs, internet archive, Wayback Machine or other archive URLs per MCC, so that they're persistent. A little bit of information about a project we did a couple of years ago with the Digital Public Library of America. When the Mueller report came out, we noted that it had more than 2,000 footnotes in it, but that something like seven of them were clickable. And so we saw a lot of opportunity to help make the Mueller report more accessible to people by adding links to it. So we did a lot of research and we added 747 links to the Mueller report. In this case, we republished it as a new PDF object with the Digital Public Library of America. So that was a, remember the example earlier, you can edit the document or you can annotate the document, you can add metadata. So here's an example where we actually edited a new version of the EPUB. But in addition, we also produced a version of the Mueller report that we annotated. And so for this, we had to do a little bit of engineering work. We ended up using a custom version of PDF.js and also have the hypothesis client to produce this view that you're looking at here, which anyone can go to a URL on the net and see a annotated version of the Mueller report without using any kind of software. You don't have to have the hypothesis client installed on your browser. And I could talk a little bit more about how we did that. We think this is a very interesting opportunity to annotate existing PDFs and add links to them. I should note that we work in a lot of medium at the Internet Archive, television news, for example. And so here's an example where we annotated or added metadata to a video archive linking to a book that was referenced. And this is where Donald Trump was talking about how someone wrote a book about what a great environmentalist he was. And so I thought to myself, really? That's interesting. So I got the book. I donated it to the Internet Archive. It was digitized. And then I was able to add a link to this video archive pointing directly to the book if anyone wants to read the book and learn more about what a great environmentalist Donald Trump is. You can now do it. Here's an interesting one. You might have noticed a few weeks ago, there was a lot of emails from Dr. Fauci that were released as per a FOIA by Buzzfeed, I think. And in the emails that were released was an email from someone saying, Dr. Fauci, you have to read this article. It's so important. And it included the URL you can see on the screen, a medium.com URL. And if you put that URL into the live web today, you come up with an error message. But if you put it into the Wayback machine today, you get what you can see on the screen here, including that yellow bit at the top. The yellow bit at the top was some context that we added. The Internet Archive added this context programmatically to this playback URL, noting that the underlying live web page was deleted by medium because it violated their content policy. And you can read more about that here. So here's an example where we've added some annotation to a web archive such that in the future, when people wanna try to understand what was going on with this particular article, they have more context for it overall. And in this case, it just worked. When I got the email that was released under the FOIA and I put the URL into my browser and I was able to see the context here that we had added a few months ago. We are working now to add links to books we've digitized at the Internet Archive. I noted earlier about helping to ensure that books that are born digital, like say, for example, on Kindle or Apple or Google, have links in them. But what about all the books that we're digitizing? They never were digital. We're digitizing them for the first time. They have no links in them, obviously. So we're working on a project to identify opportunities to add links to newly digitized paper books. We're also working on a project to help ensure that software that's cited in academic papers is linked. There's obviously academic papers, I have links in them to other academic papers, but often they use underlying software and methods, but they don't necessarily link to the underlying software or those methods, which is important for data reproducibility. So if you have access to the data, that's one thing, but if you don't know what version of the software or method that was used to analyze that data, then there's a missing piece there. Also, we're working with open syllabus. They have analyzed 7.3 million syllabi, and from that they've been able to extract out, excuse me, they've identified 7.3 million syllabi, 7.3 million objects that are syllabi for college university courses that then contain metadata that can, in many cases, link out to a books. So we wanna ensure that everything that the syllabi reference and link to is preserved and increase the links in the syllabi to improve the access to those underlying resources, the reading list, et cetera, that the professors might assign. You might know we recently at the Internet Archive launched scholar.archive.org, which contains about 28 million PDFs of completely open academic papers. Those are all accessible through the Wayback Machine. And so we're now starting a project to add more links to Wikipedia articles that point to the underlying academic papers. That is not so much of a greenfield opportunity as adding links to books was because in many cases, academic papers are born digital and links were already added to them at Wikipedia, but not in all cases, and especially not in older publications. So for example, we're working with some microfilm that we have and digitizing material that had not yet been digitized, and we see an opportunity to add more links to those resources. And it's just fun I wanna end here with a note about annotation and the web. And this is a point that the Brewster Kale, our founder actually had made at an I Annotate conference a couple of three years ago. And what he said was, he said the web already has an annotation system. It's called Twitter. And I think quantitatively, if not qualitatively, he's right. You know, if you think about what tweets are about, in many cases, tweets are about something that's accessible via the web. You're tweeting a comment about an article or a book or a paper or something like that. It's an annotation. But we don't necessarily think of it in that way. For a variety of reasons, one is because it goes in the direction of the annotation, if you will, the comment, that's the primary object, as opposed to the thing that is being commented about. So here's a little hack. This is an experimental new version of the Wayback Machine browser extension. And it's got a little button there that says tweets. So what one can do is, one can be on any page on the web and then click this button tweets. And then it will submit a query into Twitter and you'll be shown the web pages excuse me, the tweets that that web page is about. So here I was on the I Annotate 2021 web page. I looked to see who had annotated it, if you will, who had tweeted about it and their field to Jean had written that tweet about it. So I titled this talk about like a progress report and a call for how people can help. And what I hope is that I've inspired some of you to think maybe more broadly about annotation and what it can mean and what some of the opportunities are. I think, I don't know how to quantify this but my hypothesis is that the vast majority of opportunity to add links to things that are already digitized is huge. And that where they've only begun, we want to scratch the surface and whether that be on Wikipedia. Wikipedia articles are probably the lowest hanging fruit because you can edit them, right? You can go in and you can change them but anything almost everything else, you can't go in and edit the underlying document. You have to add metadata to the document in a more traditional annotation process. So whether it be the web or books or academic papers, government documents, software citations in general, there's a lot of opportunity. And I personally and people on my team there's a whole team working on this Toronto References Blue Project. It was my 20 minute timer because I wanted to leave 10 minutes for some conversation. My entire team is available to help you. And there's lots of people and organizations that I want to thank here and I will leave it at that. Thank you. Okay, I'm coming back on to pipe up a little bit here and I just want to say, oh, great. Thank you. Really great. And man, the way you just went through those slides at an amazing pace but still with keeping us connected to what you were saying. I did see one question in the chat. Oh, and somebody's raising their hand. So I'm going to allow this person on if we have time to talk to them. What do you think you want to have someone on stage, Mark? Yeah, how do I do that? I'll do it. Okay, great. By the way, I can see you now. I got rid of the presentation. So I feel it feels much more comfortable. Yeah, I also unhid myself. So, okay, I'm going to shut up and let Jeremy ask a question. Oh, I was typing this in there. I had no idea. Hey, Jeremy, you've done this. I'm up here in Half Moon Bay. Yo. Right on. Yeah. So I've had a book, a favorite book of my own that I was just searching your archive for. And I'm looking for the, where can I upload the images I had it professionally digitized. So go to archive.org and top right-hand side, press the upload button. Goodness gracious. And if you need any help, just mark at archive.org. Mark at archive.org. Okay. If it was a snake, it would have bit me. So thank you. You're welcome. Oh, thank you, Jeremy. How do I do that? I have a question for you, Mark, which is do you ever get into good trouble with this? You know, you put something on the way back machine and someone's like, no, I don't want anyone to ever see that again. Oh yeah. And I slap you with a slap lawsuit. I mean, how do you deal with that? Respectfully, I would say. But no, seriously, respectfully and responsibly. And we generally respond to requests from legitimate rights holders or just people who might be embarrassed. You know, I mean, it's the web, right? So yeah, I'm not gonna go into details, but yeah. Okay, that's good. That's good. The whole team that had to deal with. Patron services, and that's an important dimension of our work at the Internet Archive. I gave out my email address, but I also wanted to just say info at archive.org. We have a whole team of people that handle inbound email. We monitor Twitter. We're experimenting with real time chat. Just a variety of ways that we're trying to make ourselves more accessible and responsive to our patrons because, and I wanna keep coming back to this idea that the goal is to help us be the best library that we can be. And what does that mean? That means, first of all, it means that we have stuff, right? But more than that, it means that people are able to discover this material and that they're able to access and use this material. And so this is our guiding value of almost everything that we do. And whether that be archiving television news or books, we digitize something like 3,000 books a day or archiving much of the public web and something just went away. Okay, a question. How do you think about security and links? I mean, that's a kind of a big question. That's like, how do you think about security on the internet? I'm reading that book. They tell me this is how the world will end about zero-day exploits. For a variety of things, I would just say as a practical matter, we do have software that runs in the background, that processes, files as they're coming in, looking for malware. There are automatic processes. There are human processes. There's a variety of methods in place. But it's the way, so you gotta be careful out there. Okay, I magically put another question on the stage. I see that. I'm curious if there's an interface for the public to help adding annotation digitized books. Hi, Chris. We're working on some stuff. Chris, I'm gonna get you in contact with that team that is working on that, the open library team. Open library is kind of a catalog interface to a lot of the books that we have digitized, the more than four million books, but also metadata on many, many more millions of books. And so this is something there. I know there's some experimental projects going on around this and maybe you can help us with it. So yes, yes, yes. Thank you, Chris. We have two more minutes. Did somebody wanna raise their hand and come up on stage? I mean, Mark is like an efficient machine here. So we do have two, one whole minute. So come on up if you want. Come on up if you want. But I'll just also say, at the end of the archive, before COVID, we used to have this lunch every Friday and the team would go around. We'd have between 50 and 100 people there every Friday and people would share about what they were working on that week and it was really great. Well, in the time of COVID, we don't do that anymore. So what we do now is we have a Friday lunch all on Zoom and between 50 and 100 people come to it. And we've been practicing this thing for the last many, many months where we kind of do a deep dive on some dimension of the Internet Archive and the team will present about that. We've got a catalog of those available now, the video of those up and accessible and I can put a link somewhere to it. So you can watch and you can go a deep dive into those incredible presentation about open library and annotation and things like that. So... So if you can drop a link in the chat, that would be great. You might also... Yeah, that's probably the place to put it. This has been really great, Mark. I want to thank you. And I'm still just my head spinning with how efficient this whole thing was and how you took a half an hour and really filled it. So great. Thank you so much. Right on. Cheers. Thanks. Thanks everybody. All right. Go team.