 Well, welcome back to our closing plenary session for the spring 2019 meeting. From what I've seen, it's really been a wonderful set of sessions, and I'm particularly grateful for some of our colleagues from the campus research computing group who are having a workshop here tonight and tomorrow who also were able to join us for some of our sessions and to be able to share some of what they're doing earlier today. I'd like to just thank a few folks before we start off our closing plenary, which I'll introduce in a moment. First off, I'd just like you to join me in thanking all of our presenters. They've made it a wonderful, rich, enlightening, and exciting conference, and let's thank them all. I'd also like to just express my thanks to the CNI staff. They always make this look easy and make it run really smooth, and that is an art form, and one that I am very grateful that they are so very skilled at, so thank you. With that, let me get on to the main part of what we're here for this afternoon, and that is a talk by Michael Nelson. Many of you know Michael. He's been a long time CNI attendee, and he shared his work with us on several occasions. His work often is collaboration with other folks that you would know here, people like Herbert von Dessampel, Martin Klein, and others. Also with his students, of course, as all good professors at the research frontier. Part of what they do is bring their students into active work on that research frontier. If you've attended many of his talks here, you'll note that they have been very deep looks at specific areas around, for example, how we can preserve and track the web. When you aggregate up his body of work as he's shared it here and in publications, he's really developed, I think, an exquisite appreciation of an understanding of just how fragile our history, our archive of the web is and why it's so fragile, why it's getting more fragile. I think he's going to try and give you that sort of broad insight at a high level. In addition, Michael and I both share a growing concern that while what we have in place is really fragile, even in an environment of what I might characterize as sort of benign neglect, it's now being placed into an environment that in some cases is actively hostile. That's something that I think both of us worry about a lot. I believe he's also going to share some of his thinking about the vulnerabilities there and some of the steps we might take to address them. Mike is a professor of very longstanding in the computer sciences department at Old Dominion University and has, as I said, really amassed quite a wide ranging and deep body of work in this area. I think we're in for something kind of special today in that he's really going to try and synthesize some of that for us rather than share a particular result at great depth. Getting that kind of perspective, I believe, is going to be very valuable for us. Please join me in welcoming Mike Nelson. Thank you, Cliff. First thing, this is the title of my talk, but this is actually the title that I wanted to use. When I submitted this title, I was told very politely that you could not use that title and you needed a more appropriate subtitle. So I'm really happy to be here. I'm happy for this opportunity. I want to thank Cliff for inviting me. I also want to thank my collaborators, Michelle Weigel, Herbert Vandesampel, Martin Klein, a whole host of students. But the reason for this title, how many of you are Blade Runner fans? Philip K. Dick, right? So for those of you who aren't, I'm going to try and make you into a Blade Runner fan before it's all over. So Philip K. Dick's book came out in 1968. It was called Do Androids Dream of Electric Sheep. And then Ridley Scott, 1982, came out with a film. They renamed it Blade Runner. And in 1993, this is obligatory library reference. The Library of Congress inducted it into the Film Registry for aesthetic significance. So I'm trying to tell you this is serious cinema, right? So the part that it begins with is Blade Runner is set in 2019. Very opening credits as Los Angeles 2019. And then we pan in and we see this horrific post-apocalyptic civilization is collapsed because of pollution. We have robots that live off world. It's a dystopian post-apocalyptic future. Now, both Philip K. Dick and Ridley Scott predicted a lot of things in these movies. And because it's 2019, there's all kinds of links. And by the way, I tweeted this, the links to the slide, there's going to be gazillion links in the slide. So you can go online and find all the material there. But there's a whole bunch of links about what Ridley Scott got right, what he did not get right and so forth. So we're not going to talk about flying cars and so forth. What we are going to talk about is I'm going to tie it back to really the larger canon of Philip K. Dick's work. So a lot of the themes that he covers are things about identity, the relationship between self and the other, memory, the essence of humanity, what does it mean for something to be authentic, the difference between a reality and a simulation, and this concept of an unreliable narrator. So the idea is anything we're saying now, today in the world that we live in, has to be said and expressed in a tweet. So Blade Runner summarized in 239 characters, had one character left. So we're in this post-apocalyptic future where replicants are off-world slaves and not allowed back on Earth. And they have their indistinguishable few humans, and to make them more manageable, they have fake memories implanted in them. And if they return to Earth or otherwise deviate from their programming, they're to be identified and then retired, which means killed, of course. The movie begins with the Voint Kompf test, which is this idea of we can tell a replicant from a human by asking them a series of questions, because otherwise they're indistinguishable. We ask them a series of questions, and based on their empathetic response, particularly empathy towards animals, we can tell whether or not they're real. And it begins, this guy is in Star Trek, he's a red shirt, he dies almost immediately because the replicant that he's interrogating eventually realize he's failing the test and then kills him. All right. So we've got robots indistinguishable from humans, we've got off-world slaves, we've got a perpetually dark and stormy Los Angeles. So this is all the essence of good science fiction and good cyberpunk, right? But that's not our 2019, right? Well, the future is already here, the cyberpunk future is already here. On the left-hand side, you see a description of the Jeff Bezos situation, and when you read it like this, you're like, yeah, that's definitely at least a couple of cyberpunk novels that I've read. And then the equally fascinating and horrifying demos from Boston Dynamics, it is clear that the future is already here, the cyberpunk future. Now at this point, you're wondering, when is he gonna start talking about web archiving? And Herbert warned me not to go too fanboy on Blade Runner, but it's really tempting. But what I'm trying to establish here is web archives are science fiction, right? Web archives are enabling a reality foreseen by Philip K. Dick and several other authors, where we can insert bespoke fakes into our collective memory, our cultural record. Now web archives are like science fiction because there's a paradox to we need significant and continuous investment today in order to take a page and say this page used to look like that. It takes continuous investment to be able to say that. Now in this audience, it should be obvious web archiving is not file backup. I have this slide just in case. So backup that essence of backup is you want to prevent, detect and repair changes. In web archiving, we're continually changing the content to better simulate the past. So in essence, web archiving is a simulate room of the past. It's not really the past. So the essence of a web archive is to modify the holdings that it contains. Now here's a page of CNI from 1997. The web and HTML was very simple then. But we can see what the web archive, when it was replayed through the Internet Archive's way back machine, it rewrites the links and it also inserts a banner to let you know metadata about the past. It was captured at this date. It is a copy of this URL. There are these additional copies available. It's completely transforming the page to give us a simulation of the past. Now, it's not just rewriting links that are involved. So David Rosenthal on the left had a seminal paper in about 2005 where he addressed concerns. So it used to be we had this idea that someday we'll forget how to remember render JPEGs. That's never going to happen. But if it did happen, the web archive can silently transform the content into a format that's more amenable for our processing today. Now, the JPEG situation hasn't happened, but it has happened with Flash. On the right hand side, you see an HTML page, a YouTube page that was archived in 2010. And when I tried to replay it today, my browser does not want to replay it because it's Flash and Flash was a terrible idea. So we're continually modifying the web archiving playback software. So here are the GitHub release pages of Open Way Back and Pi Way Back, two of the most popular open source versions. And you can see the updates are just happening all the time. And they're doing this in order to have a more authentic version of the past. Now, the idea here is this version of the past, it's no longer just for vanity, this is what the page looked like in 1997. Just a year ago, there was a court case in which screenshots of Internet Archive pages were allowed as evidence. Now, on one hand, that's a terrible idea. And we'll explain why. Now, in this case, they had someone from the Internet Archive go and verify that what was in the screenshot matched the Internet Archive's business records. Now, what we didn't establish in this court case is whether or not the Internet Archive's records matched what actually happened in the past. That's an open question. So why is it hard to recreate the past? If the web was just JPEGs and PDFs and MP3s and static files, this would be really easy, right? But real HTML pages, the pages we enjoy interacting with are not that simple. They're actually super complex and involve hundreds, several hundreds of embedded resources. So here's a Twitter page from a couple of days ago. And obviously, there's links that need to be rewritten and so forth. There's embedded resources, audio, video, and so forth. And then the bar that says 12 new results up at the top, that's inserted by JavaScript. And so this page is continually updating. It has advanced media in there. This is super difficult to capture and render as in the past. Now, the main problem here is JavaScript. JavaScript is why we can't have nice things. And if we can stop JavaScript in our lifetime, then we will have accomplished something. So let's look at an example. Here is a page from the Fish and Wildlife Service, archived 2013. So we load this page and we get an equal at the top. Now, we hit reload immediately. This is the archived page, not the live web. When we hit reload, we get a tiger. We hit reload again, and we get a mountain. Now, what's going on? We're hitting reload, but I'm getting a different page. Now, in this case, it's pretty simple. We look inside and I promise this will be the only code slide that we have. When we look inside the HTML, there's some JavaScript, and it's clear that it's randomly deciding which one of the images to show you. Now, this is pretty simple. We just make sure that we reload the page enough in our crawl so we get all three images, and then we can recreate all the three versions that somebody would have seen in 2013. Most pages are actually a lot more difficult to figure out what's going on. Now, this is an animated GIF. It is a CNN page archived also in 2013. The right-hand top corner, you can see when my student hit reload, he was waiting an hour in between. The top left-hand corner, you can see that on each reload, the archive itself is not sure how many copies of that page that it has. Now, this is actually going to be a source of much confusion. If you have three copies, it probably knows, hey, you have three copies. Once you get above some large number, exactly how many copies you have is not entirely clear. It depends on which indexes can answer in time and so forth. The more disturbing factor is in the bottom left, you see that we're getting different temperatures for the high temperature in Atlanta in the summer of 2013. Now, I'll buy the 90 and the 80 degrees, but the 39 degrees, I'm willing to bet it was not 39 degrees in Atlanta in the summer. So every time we reload this page, we get a different result. So we can't replay the past year. The combination of the embedded resources and the JavaScript makes sure that our simulation of what CNN looked like in 2013 is flawed. It will never be 2013 again. So in some assets, we can't, this page is lost, we can't go back and get it. Now, you can say maybe the headlines are there, the important part of the page is, but we don't always know in advance what the important parts are going to be. Now, this actually manifests itself in a lot of different ways. This is something we identified a long time ago and we called zombies. And when you reload a page, the live web would leak into the page and sometimes produce results that were jarring. So this is a CNN page that was archived in 2008, but my student, Justin Brunel, rendered the page in November 2012, right before the 2012 election, and you see on the right-hand corner the Obama Romney advertisement for the debate where in the left-hand pane, we see that he's running against McCain. So these combination of resources could not have existed on the live web. Now, this is happening because the JavaScript would run dynamically and then create a URL that the web archive could not transform in advance and would reach out into the live web. Now, about two years ago, the Internet Archive fixed this and prevented zombies. Mostly fixed it. It's not complete, but this is something that can happen where the live web and the archive web mix and produce a result that didn't happen. More disturbing is when you don't reach out into the live web. When everything is legitimately archived and in here, we have a weather underground page for Verena, Iowa. Anyone from Verena, Iowa here? I figured not, right? It says in the text in the bottom left-hand corner that it's going to be a miserable day, 41 degrees, and it's going to be rainy and cloudy and so forth. That web page is archived in December of 2004. The radar image shows a clear sky and it was archived in September of 2005. There's a nine or ten month difference between the HTML page and the JPEG of the radar image. When they're recombined, both are archived resources. We did not reach out to the live web. When they're recombined, they produce a page that never existed. Now, I'm claiming that you should be concerned about this, but you're probably thinking West, Nowhere, Iowa, the weather forecast in 2004, who cares? This is hardly the essence of dystopian cyberpunk. Why do we care about that? There are cases where these temporal violations begin to look like tampering and they actually have more real-world impact. So, does anyone remember the Joy Reid case with her blog? It was about a year ago. She wrote some in 2005, so she wrote some homophobic things. She apologized and then she started saying, I never wrote those things. And then people on her team used something called robots.txt and basically blocked the internet archive from replaying those pages because she didn't like what was coming out of them. And then she started saying, oh, it's been hacked and so forth. What the people on our team didn't realize is there's more than one web archive. So, we went and got copies of the same posts from the Library of Congress web archive and were able to demonstrate that the screenshots that were being circulated actually matched the text that the Library of Congress had. One of the common complaints at the time was, well, you're showing me this replayed page, but I don't believe it because she was a prolific blogger several times a day and she had a big audience and there were a lot of comments on the blog. And I assume everyone in this room remembers blog if I was talking to a younger crowd I'd say it was like Twitter but a long time ago. So, when you replayed the page you saw that there was no comments and people thought, well, that's weird. Clearly you're lying to me, right, because it does not match what I remember from the past. And what I found, and this is actually kind of complicated, but there's a temporal violation. So, she used blogger at the time but the comments were served from a different service, halo scan or so. I don't even think it exists anymore. But the HTML page had some JavaScript in it and it would load a separate JavaScript file that was an index for which posts had which comments. And this is one example that I worked through. The HTML page was archived January 1st, I'm just kidding January 11th, and then the JavaScript index page that was loaded at the same time was archived almost a month later in February. And it was the index from February that corresponded with the HTML page from February. So, when you recombined them, it produced what you see here is when it failed to find the copy in the index it would say comments as if there were no comments. Many of the posts, most of the posts, actually had no comments. There'd be a little bit of overlap, so some posts would have comments, but most of them didn't have that many comments because these things were so far out of sync. Now, I can tell I'm already losing part of the audience here. Now, imagine me trying to talk to someone who doesn't want to believe that Joy Reid did that because it's very complicated when I'm like, no, there's HTML, there's JavaScript, and there's toxic other JavaScript. But wait, that was... And it sounds like you're telling a bad science fiction story where you're introducing characters as needed, but it was archived later. I know I didn't tell you about that before, so I understand why people don't think I'm telling the truth. But the reason why this page does not match people's historical expectations is because of these temporal violations. Now, this has more real-world implications than the Verena, Iowa weather forecast. But it's not just JavaScript. Now, established JavaScript is almost always the villain, but there's some case where cookies by themselves are the villain. This is something we published, blogged about not too long ago. I could do a whole talk about how web archives and Twitter work at cross-purposes. But here's a case where we have Obama's page, his official page, replayed from the archive, and I can't read it, not that long ago, 2017. And his posts are in English, because he's writing in English, but because of how Twitter sets up their support for multiple languages, the template information comes back in Urdu. And Urdu is a lot like Arabic, it's a right-to-left language, so you see the template information for when he joined and how many followers and tweets and replies and so forth. That's all in Urdu. And clearly, this supports the narrative that he's a secret Muslim, right? Because why does his page look like that? But when it's crawled, what ends up happening is the Internet Archive associates the Urdu language page at the URL of the English language page. And the blog post explains how that's happening, but if you're already inclined to believe that he's a secret Muslim, then this is a smoking gun, right? Then we have another example where the combination of cookies and JavaScript does all kinds of crazy stuff. So here we have an Indian journalist, and we replay his page from just like a month ago. And in the bottom right-hand side in yellow, we see that the template information there is in English. And then the blue on the right-hand side in the top, the template information is in Portuguese. And then there's JavaScript running, and it waits a minute or two minutes or whatever it is, and it says 20 more posts. And that's the red bar in there. That is in Urdu. So we actually have three different languages combined into a page. It's clear there's no Urdu, Portuguese, English language conspiracy happening. But there's also a temporal violation because if you click on those 20 more posts, it's not the 20 posts more from the archiving time, it's 20 more from some time in the future. So you can click more and get the posts, and you'll get these posts and those 20 posts, and there's this gap in the middle where you don't have any posts. So we're replaying this page, and you think you archive to Twitter page how hard that can be. It turns out it's really hard. So the idea here is web archives are unreliable narrators. And the problem is if we have an unreliable narrator, we have to start to question everything we've been told. Now here's another example where the stakes get higher. Imagine we had a president that admitted obstruction of justice in a national TV interview. Just purely hypothetical. And in August of 2018, he tweets something that Lester Holt was caught fudging the tape. We know this didn't happen, right? And then a journalist replies on the right hand side and says, look, the NBC website has had this video up since August of 2017. Now if you're a conspiracy theorist, if Lester Holt fudged the tape, then the copy on the live web is going to be the fudged copy, right? So what we need to do is consult the web archives. And now I'm going against the dark angel of demos here. I'm going to try and pull up something so I can show it to you. So here is the Internet Archives page. And the first time it shows up, that NBC link, the first time it's not 2017, it's in 2018. Now I'm going to click on this and you need to pay attention because it's going to happen. I can't really control it because, you know, JavaScript, right? So I'm going to click on this and pull up the page. And at first it's going to look legitimate. Come on, network, you can do it. So there's the Lester Holt. I'm not touching the keyboard. It's going to just click through by itself. And then the damning evidence of how Lester Holt fudged the tape. All right, who else is on the network at this time? So I'm not sure why it's looping, but the image, the video is of a letter carrier walking in the snow and falling down. And it just sits there and loops over and over again. And after we show this and his bag slides down, it's going to go back and start playing again. And this is the damning evidence that we were going to catch. Here's the original source of the interview. This is what the archive has. Now imagine trying to convince a skeptic that Lester Holt did not fudge the tape here. And we go to the archive and this is what we get. Something's gone horribly, horribly wrong here. All right. So the essence here is this came about in errors and crawling and probably compounded with errors and playback. But at some point, the errors in crawling and the errors in playback become indistinguishable from someone tampering with the record. So Kate Starbird and then Cliff Lynch, basically, and others have said this, the goal of disinformation is not necessarily to make you believe a specific thing, but cast doubt on the entire process. So if we consult the record to find out what was actually said and we get this humorous video of a letter carrier falling down, at some point with the Obama playback and the Urdu and the Verena, Iowa, are you guys still trusting web archives at this point? Can we really trust this? So disinformation applied to web archives doesn't mean that we necessarily have to believe a specific narrative. We just need to doubt about the entire archiving process. Now in fairness, I have to say we're unaware of any specific hack that has occurred in the Internet Archive to accomplish some larger goal, right? What I'm going to say is that web archives are not immune to this kind of hack. It's just the theater of conflict has not moved to web archives. So here are some examples. Here's one of the first tweets from Jack, setting up my Twitter, and that's from 2006. And in 2006, Twitter's just this weird gateway between SMS and the web. And about 10 years later, we have the IRA influencing the election through Twitter accounts. Here's Facebook archived Mark Zuckerberg's page in 2009, and we learn interesting things like he likes Katy Perry and Lady Gaga and other super important things. Less than 10 years later, we find out about Cambridge Analytica. Gmail, when it came out, this is one of the first blog posts about this, it was awesome. All the other web mail services sucked. Gmail gave you a gigabyte of data when that seemed to possibly large. And then now it's used as a spearfishing attack to release the DNC emails, right? Here's one of the first articles I could find from 2002 about the Internet Archive. There's probably others, this is the one that I found. What's going to go on the right-hand side? What's the big exploit that we're going to have here? I've given you little teasers, but of things that don't really matter. What's the big social impact we're going to have? Why do we expect things are going to be different from web archives? One of the things I want to convince you of is our trust model for web archives is still firmly rooted in a 1980s, early 90s mindset. I started working at NASA as first as a student in the 80s and then later in 1991 as a full-time person. In the late 80s and very early 90s, we had supercomputers that were very expensive and I used an X-terminal to connect to the machine. I was one of literally hundreds of people on the machine at any time. So there was one machine and many users. And about 1992, they decided I was worth some investment and so they bought me a Spark IPX. Anyone remember their first Sun workstation? Just talking with Simeon now. But that was, I mean it was thousands of dollars. It was probably $10,000 or more, but it was mine. It sat on my desk. I set up the first NASA digital library, the first NASA homepage on that. It was all on my desktop machine, but there was a one-to-one relationship. That was my machine. I was able to do a lot of things with it. I got one of my students to take this picture now. I don't even know how many computers I have access to. I get email reminders from the inventory police telling me they have to go through. And I look at these and I'm like, I don't even know where these computers are. There are so many computers that we have. The idea here is, and this is an email that I got from a guy named Brewster Kale when he was setting up a startup company called Waze Incorporated. Anyone remember that? So he's blasting out and it's this weird email of don't talk to the press about this company that we're setting up or so forth. But on the left-hand side, he's just blasting out this email to everyone that has used Waze in the past. And you see my name on it. But in the bottom left-hand corner highlighted in red, you see Root at this machine, Root at that machine, Root at this other machine. Back when machines were expensive, if you sent an email to Root at some machine, you had a pretty good idea that the person at the other end was a responsible person. They would not be given control of thousands if not millions of dollars worth of computer equipment if they were not a white hat. Clearly this seems quaint now, right? If you get an email from Root at some machine, you're not going to believe it. But in the 80s or 90s, that meant something. And what I'm saying is web archives are like the Unix mainframes of the 80s and 90s, right? We implicitly trust Root. How well do you know Root at archive.org, right? And I don't mean did you meet Brewster at a conference one time. I mean, do you know him well enough that if you emailed or called him, he would reply? Some people in this room, given this on it, that answer might be true. But in general, it's not true, right? Our entire national digital preservation strategy is dependent on Brewster Kale not being evil. If he's running a sleeper cell, we're all doomed because he's got all of our stuff. Now I'm going to say how well do you know Root at all these other archives? And there's at least one of them on here which you absolutely should not trust. There's a bunch more archives that I didn't bother to find icons for. So there's a gazillion archives. We can no longer implicitly trust Root at archives. All right. Up until this point, we've looked at failures of crawling and failures at replay. And it produces weird results. Let's talk about deliberate fakes. Now, we've had deliberate fakes for some time. So we've had Victorian photo collage and then copy and paste to make funny music videos and so forth. The one on the right is Brian Williams rapping gin and juice, which Snoop Dogg song from 1993. It's hilarious. You should listen to it. Now, clearly this is done for comedic effect, right? And so because we have the cultural context, we know this is not real. But there are other fakes that require knowledge and skills and access, right? Even if I could create some of these things, I don't have access to a monastery in Europe to go and place it for discovery there. So some of these fakes exist in the real world, but the threshold to create these fakes is much higher. We have to worry about deep fakes now. And deep fakes being a combination of deep learning and fakes. And here we have a couple of clips from Reddit where people, two people asking for the creation of deep fakes and one person advertising the capability to create deep fakes. Now, we're not going to look at these examples, but we have a safe for work example you've probably seen. So the mail carrier is still falling down here. I just like, it was just, I, this was, this was very truly surprising for me. Yeah, I was just really surprised. I share the Watson from best view. So you're a huge Bravo fan. Oh, it was your favorite and least favorite housewife of all cities. My favorite is probably Lisa Vanderpump. All right, you get the idea, right? That's pretty convincing, right? We can tell it's fake, but, you know, given the right circumstances, we'd have to be concerned about this. I'm going to, while I've got this load, I'm going to show you another fake. Simply President Trump is a total and complete dipshit. You see, I would never say these things, at least not in the public address, but someone else would. Someone like Jordan Peele. This is a dangerous time. So you get the idea of that, right? We're going to come back to the Jordan Peele thing, but I figure I'd save on the switch. So this is the level of deep fake with the Steve Bushimi, Jennifer Lawrence combination. We just put his face and maintained her voice. And the Jordan Peele Obama version, we essentially had Jordan talking and mouthing through Obama, right? So it's sort of related approaches, but you know, it has become mainstream in the sense of there's now a website. You no longer have to go to the deep dark reaches of Reddit to get this material. You can go to a website and create it. It's not that difficult. Now I'm going to put a part aside for a moment, the illegal and creepy and all the other aspects of the deep fakes that you see on Reddit and so forth. As far as detecting deep fakes, this will happen. There will be an arms race of we get good at detecting them, we get good at covering them up and so forth. Preventing them is never going to happen. And the idea here is the mementos of a past, even a fake past are core to the human condition. We're going to be so attached to these that we're going to continue to find a way to create them. And this actually has a significant plot point in Blade Runner, where on the left hand side, Leon, the guy that killed the interrogator in the first scene, he actually is eventually caught and killed because he returns back to his house to recover the photos. Now, he knows they're fake, but he's so attached to them that he actually risks his own life to get them. So now imagine that's your next Thanksgiving dinner with Obama and Peel video. You just take that three or four second clip, edit that out, embedded in an HTML page, use JavaScript to rewrite the URL that shows up in your banner. You also set the date time to be November 2016. You claim that the deep state has removed the page from the live web, that's evidence of how deep the conspiracy goes. And then they have endless conversation with your uncle. This is not just hypothetical. Here's an attack that we've demonstrated of a page inserted into the Internet Archive that shows Brian Williams wrapping gin and juice. And if we zoom in, we see that it's at a URL that's fake that did not exist then. And we see that it has a date time of 1992. That's one year before Snoop Dogg's version came out. So we've proven that Brian Williams is the original gangster and Snoop Dogg was copying him. Now this is obviously fake and it's fake on inspection. But imagine we can do this. My student did this. Now imagine this with the Jordan Peel video, just that slice where we don't see Jordan Peel, we just see Obama mouthing it. Now we've known about these attacks and other attacks in a series of seminal papers for about two years. Fixing this, preventing the archives from being an attack vector is going to be a great deal of work. It hasn't happened yet to my knowledge, but only because people haven't figured out that it's important yet. I don't think that's going to continue. Of course there are other ways to attack the Web Archive or the Internet Archive. They host a Friday lunch. It's really awesome. You could just show up and have lunch with everybody, but it's an insecure facility. Now they have geographic distributions and so forth, but the thing is you can go have lunch with all the archives, archivists there, and then go and abduct them and hit them with a lead pipe and have them insert anything into the archive that you would like. Now you're probably thinking that would never happen. I think there are some journalists and dissidents and so forth that would like to say it could happen. If the Saudis or the Russians or any number of other governments decided they wanted some information to appear or disappear in the Web Archive, they've already killed a lot of people and don't think for a minute that they would hesitate to add a librarian to their list. We're not special. The idea here is we can't have it both ways. We can't claim we're doing important world-changing stuff and then pretend like we are somehow immune to the associated dangers. We can't have it both ways. Now maybe they don't attack the Internet Archive. Maybe they set up their own Web Archive. Now setting up your own Web Archive is not that expensive. Now the run-out costs are still expensive. You're going to commit to supporting a Web Archive forever. Forever's a really long time, so there's a lot of costs associated with it, but there are a lot of open source software packages that are quite good and with AWS, it's not the cheapest way, but it's the fastest way to get set up with a lot of storage, you could have a convincing Web Archive set up in the same amount of time it takes for somebody to render the Jennifer Lawrence Stevie fake. Then we can start talking about placing fakes into fake archives. I don't know root at any of these archives and I wouldn't trust them even if I did, but if these archives existed, what happened if most of the time they faithfully replayed the pages and then they just changed the pages that they wanted to change access to? What if they're not even four different archives but it's really one person controlling all the different archives to give the illusion of independent observations, essentially mounting a sock puppet attack? Going back to Rosenthal and locks, they've done a lot of nice work at talking about threat models for digital preservation, but I think even in their work, there's an assumption that all the nodes in the network begin as trustworthy components. They might be corrupted later, but what if we begin an environment where everyone is not trustworthy? I want to introduce another possibility. What if archives including trusted archives like the Internet Archive were targeted to amplify a specific narrative, right? Prior to this we talked about disinformation, can't trust anything, mailman falling down, that kind of thing. What if some agents come with a court order and decide you're going to put this in your web archive? How much is Bruce going to push back on that? So I don't know if you saw about this, but DHS set up a fake university to attract students who want to stay in the US but don't actually want to attend class. Now this came out just recently a couple months ago, I think. Unfortunately, and they did a reasonably good job. So there's a university of farmington.edu and it has course schedules and all kinds of stuff. And it even has sort of a confusing narrative about how they began, which sort of fudges the dates. But we look it up in the Internet Archive and we see that the first time it was archived was October of 2016, which is a little disappointing because that means the operation began under the previous administration. But the archives don't lie on that. Now we can claim if the would-be students had checked in the archive, they'd see hey, there's a university that just came online two years ago. That's hardly believable. What if the DHS decided the Internet Archive was going to host some backdated pages to the 1990s when all the other universities were coming online? Or even easier, DHS could have put a robots.txt at that page which effectively blocks the Internet Archive and they could have co-opted all the other archives and put the fake pages backdated and that. So DHS already got a .edu registration. You're not supposed to do that, but obviously they can do it. So why do we think that the boundary, they just didn't think about it. But now that I've told them that we'll think about it. Now at this point you're probably thinking shouldn't you be talking about blockchain by now? Blockchain to the rescue and the lasers and sirens and a soundtrack and so forth. We're not going to do that. So everyone's seen all the Rosenthal articles. We all know that that's not good. There's no shortage of deep fake versus blockchain articles. And some of them I'm going to claim start to look a lot like a Voigt-Kopf test where you have to prove that your photo is legitimate and so forth. There's a good article that talks about the unintended consequences of that. Probably more than we will catch deep fakes, we will succeed and surveillance of tracking who posted this photo if this happens. On the other hand, I'm less worried because anyone with pets and microchipping their pets and multiple competing standards for microchipping our pets, we can't blockchain our pets. I'm hard to believe that we're going to be able to blockchain all the photos that come out of our phones. Now we'll note that the synthetic pets in Blade Runner have serial numbers so that's how you know it's the future. They figured out how to blockchain their pets. As far as blockchains and web archives, you might have seen this blog post almost two years ago where they claimed, oh, we've time stamped, hashed everything in the NRR archive. Then you read further on and says, basically what they did were the MP3s and the PDFs and so forth, which is awesome, hooray, but they didn't do it for the web pages. And they said, we're going to get to that in the future. That's not going to happen. Absolutely not going to happen. At least not in the conventional way that you're thinking of it. So we have an ongoing work now where we sampled 16,000 mementos archive pages distributed across 17 public web archives. We replayed each page 35 times over the course of a year. And here we just have a visual representation of it. So this is University of Michigan page as archived in Permacixie. And then we see at some point they have some security problems with their server. They upgrade the server. It's offline for a while. We get different content. It comes back online. And then it becomes variable. We never quite always get the same thing back each time. And the punchline here is 16,000 times 35 is a big number. If we do a histogram of how many times we get different hashes, about one in eight of the pages always gave the same hash all 35 times. And ironically enough, that was the worst archive web citation. The reason why it gave the same hash is it doesn't know how to process JavaScript. On the right hand side is more concerning. One in six pages always gave a different hash every 35 times. So every time you visited it, it produced a different result. And that should be clear just there that fixity base approaches, conventional fixity base approaches are not going to work. Not through a playback mechanism. Now you're thinking hash the screenshot, not the HTML. That doesn't work either. Here is one work file, an archived page. And horizontally we have two different way back machines. And then vertically we have three different browsers. And the takeaway here is even though the same source material that never changed, when it's played back and then rendered, we get six different answers. And which one is the right answer? Now some are clearly better than others, but it's not entirely clear which one is the right answer. Now why don't we create a locks for web archives? I've been talking about locks a couple of times, right? Why don't we do that? Here's the dirty secret. Web archives are not very interoperable, right? And here we have upside down tortoise, if you remember your void conf test. There are many issues regarding interoperability. I have 99 problems. I'm going to show you one of those problems just as a parable. So here's that tweet that I had earlier, and I pushed it into the Internet archive. And it looks pretty good. We get the banner so forth. I take this page from the Internet archive and I push it into archive today. And that's still pretty good, but it's taken away the Internet archive banner, but inside the banner it's smart enough to know that it came from the Internet archive. Then I take this page and I push it into permaccc, and it keeps the archive today banner, adds its own manner, but the content below, it's starting to change, right? We're just seeing the banner image and so forth. Now I take this permaccc and I push it into website, and it's totally banged up at this point, in part because it doesn't do JavaScript well and it has its own idea of frames, but you see in the very bottom left, there's this little portion of archive today that's holding on. All right. And then you're thinking, well, that's web citation, that's an archive that's not being developed, that's not fair. Here's that same page pushed back into the permaccc page, pushed back into the Internet archive, and that's not much better. We just get the banner, but none of the content. Then I take this page, push it back into web citation, and we've lost everything at this point. So this is not the only problem with web archiving interoperability, but this gives you an idea, right? Every web archive thinks it's the center of the universe, and it'll be clever of dealing at any page it finds except another archive. We hate those guys, right? So the metaphor here is web archives are like cats, and they don't really, if you're a pet owner, you know that cats don't really have a strong pack mentality, right? They do their own thing. All right. Now to summarize, existing trusted archives can be compromised by a number of ways. One, they can crawl malicious pages, pages that know they're headed towards the Internet archive and use JavaScript to attack the archive. We can also attack the facilities and personnel. We can also just get a court order, and that's probably the easiest way to do things. The other aspect is the lowered threshold for archives means we're going to have a lot more untrusted archives. People are going to try and commercialize archives, and there'll be a lot of activity in this space. Most of it will fail, but it's going to produce even more uncertainty. There's no reason to think that the Russia and the Saudis and so forth aren't investigating their own archives. They can afford to have a long game archive where they set things up and it does the right thing for a long time and then defects later on. They've also set up sock puppet archives where it looks like there's multiple archives, and our state of interoperability is so poor that we can't track what things move between archives. Again, the nature of web archives is to change content so the current fixity-based approaches won't apply. Now looking forward, I don't have a lot of how things are going to get better. I'll give you a little vision of how things are going to get better. We need new models for web archiving and verifying authenticity. We need a Voight Conf test for web archives. The Heratrix way back machine technology stack is very successful, but it's in some ways limited our thinking about what it means to have an archived web page. I'm just going to make an observation that we spent a lot of time on the web proving that we're not a robot. We solve captures and almost half the traffic on the web is in fact bot traffic. It's true in web archives as well. Robots are the primary consumer of archived web pages. I don't know exactly what the solution will look like, but it's got to leverage things like ClickFarms. ClickFarms are going to exist to drive up fake engagement with videos so that you have thousands of views and so forth. This is a cyberpunk future as well. People produce videos and then we have all these fake slash real consumption of the videos just to make your numbers go up. But we need to leverage this kind of concept for observing web archives. So I'm going to go back to Cliff Lynch. She had a paper out a year or two ago. There was a million things in it. One of them is documenting instead of web archiving. The idea of robotic witnesses, new Nielsen families. This is also in contrast with something I wrote about a while back where game walkthroughs as a metaphor for preservation. So this was a game that I played constantly in my teenage years, Star Raiders. How many people played Star Raiders? Just me. It was an awesome game. You're thinking, I don't believe it was an awesome game. Well, there's an embedded YouTube link of somebody who's a very good player solving the game. And you can experience how I spent my teenage years by watching that YouTube video. I'm going to claim we can learn a lot from what it means that this is what we saw when we crawled it. And then we need to figure out as I replay it, can I measure the difference? We have to assume there's going to be a difference, but how can we measure the difference? All right, now finally, some of you are thinking, I'm still not convinced about Blade Runner, right? So I already know I'm not a fan. And what can I take away from this talk if I'm not a Blade Runner fan? My wife refers to the movie, A Serious White Guy's Talking. So I realize not everyone will be a Blade Runner fan, but after this talk, you might be, or later going home, you might be in crowds of people where you're talking with fans. We saw there are a lot of people who are fans. Give you two ways to pass the Voight-Kampf test for Blade Runner fandom. Number one, you ask the question, is Harrison Ford's character Deckard a replicant? Interesting enough, the answer to this question depends on the source material that you read. In the book, Deckard is a human. It's clear, and Philip K. Dick had said he's definitely a human. Ridley Scott changed it, but there's several different versions of the film. And some of the versions are, it's ambiguous whether or not he's a replicant, and some it's very clear that he's a replicant. So you can ask this question and watch all the other fans fight, and you can walk away and you'll fake it. The last thing is the Tears in Rain monologue. Who likes this? So you just asked the question of the greatest monologue in science fiction, or the greatest monologue of all time. All right, thank you. I look forward to your questions. I'll ask you one, because I think you glossed over it towards the end. But this whole thing has been reminding me of the problem of archiving games, and is what is the thing you're trying to archive? Is it the code, the gets of the webpage, or is it the experience of it? And you did talk a little bit about this, but there was a joke at MIT where if you wanted to archive the digital theses, you would print them out, shrink wrap them, and send them to the salt mine. Because that was the only way we could be sure. And of course, that's silly, because a lot of them can't be printed. But why not just videotape the screen or something? And you talked about that a little at the end. So how does that relate to all this research you're doing? So I think it would help. So take a step back. I think we need to reexamine what does it mean to have an archived version. So the the Heratrix Wayback Machine gives us the HTML, it gives us the JavaScript, it renders client-side, everything's copy and pasteable, and that makes a lot of sense most of the time. The archive IS, a web archive, has actually a slightly different model. It gives you HTML, but server-side at crawl time, it flattens out all the JavaScript, so it removes all the interoperability, interactivity, which is both good and bad. But what the good part is, it freezes that page in time, and it doesn't screw up anymore. Now we could also do screenshots, and we kind of have that. And then I think adding the game walkthrough, the video component, the robotic witnesses, and sort of filming, recording people playing the game. So Google Maps, what does it mean to archive Google Maps? We're not going to archive all of it, but maybe we can archive a walkthrough through Google Maps. And that maybe helps us establish later the page that I'm getting back, maybe with or without JavaScript, it establishes a baseline of what we can expect from that page. The flip side is the footprint to archive a page just now ballooned. But on the other hand, once we get to real-world implications for incorrectly playing back a page, like governments making decisions, militaries making decisions, political outcomes, maybe that's a level of investment that we have to have. But I don't have all the answers on that. Other than, say, read Cliff's paper on that has a lot of good ideas. Anyone else? Anyone want to talk about whether Deckard's a replicate? So Michael, I'm trying to sort through, archive that. I'm trying to sort through, there are many unsolved problems. And thank you for this uplifting ends with a blade runner. This is well done. What are your thoughts on getting more archives or more archiving activity? So other than a upside down tortoise test, what can we be doing in a simply collecting more and doing more collecting other than through internet archive a good first step? So what I would like to see is more interoperability from independent archives. So there's pie way back, there's open way back. That's a good start. I'm concerned that some universities are closing down their archiving efforts. And I understand that. It's easier just to pay a subscription to archive it. I would like to see multiple independent web archiving efforts not and with variation in the technology stack. So I mentioned the archive IS archive today. Web citations sort of interesting. Archiving solutions that come from different communities have different presuppositions. And they actually, some things that get wrong, some things they actually get surprisingly right. And I think setting up more way back machines is a good start. But we need to investigate something beyond just hair tricks and way back machines. And I know we're moving towards headless browsing and so forth. But the way back playback paradigm is still constricting our way of thinking. So part of this is what does it mean for images? What does it mean for video? And how do we interoperate and say I observe this, you observe that, let's measure the difference between them. And also having the semantics to express some of my content I observed independently. Some of it I got from a seed from the Internet archive because I didn't exist in 2005. So I just got a bunch of their stuff, loaded it up and played it. So I'd like to see more of a move towards standards and a focus on interoperability. And I welcome more people participating even though it comes at the security and authenticity risk associated. Now I also admit that's completely self-serving because I do web archiving research. So obviously the answer is we need to have a lot more web archives. So I get that. But on the other hand I do think it's important that we move. Hi, this is more of a comment than a question but this was a really inspiring or actually just what's the word I'm looking for? My mind is kind of really on fire thinking about all of this. And I'm going back in my own head to earlier days in my archival career where many universities relied upon printing emails as a way of archiving and archivists, and I was an archivist who made the case to our chief business officer that in a court setting you need to produce the email in native format. So you can't just print it out. And so I feel like what you're showing us is that archive it as cool as it is to have a print out of an email or as cool as it is to see a rendering of old web content. It's not submissible as you know it's not the native format. And that just caused into question that whole larger digital preservation infrastructure that's required if we are going to have to reproduce evidence. Right. So thank you. Right. For some things now I've sort of cherry picked my examples I have to admit. Right. So I didn't show you the gazillion cases where everything's unambiguous and works right. Are we sort of so there are certainly some cases where you can say that probably is what we saw at the time. But there's enough ambiguity that if anyone wanted to hire me for the defense I could mount a pretty convincing defense of you can't trust this stuff because I can produce fakes convincing fakes easily. Hi Michael thanks for the talk. I was listening to your talk. I mean this is really important stuff temporal violations. That's like the first thing we now talk about when we're teaching our students to work with web archives. But I was wondering as you're going through the sort of dystopian scenarios right. Someone coming into the Friday lunch at kidnapping Jefferson Bailey. You know or the or the really you know the heavy force of a state actor trying to manipulate these archives. And the question that came to my mind is how how is this different than a traditional archive. Like I've worked at our national libraries in Canada and I do have to take a lot of it on faith that the sensor who's going through during the ETIP request is being faithful to this that someone hasn't manipulated it that the person who donated the record was doing so. You know there's a lot of things that I just put faith in Library Archives Canada. I don't know if that's if you have any comment on what makes web archiving different than the faith you have to put it right. So I tried to give a little head nod to we've had fakes forever right. Part of it is the reach involved right. So IRA had significant impact on on I guess you can argue about how much impact they certainly tried to have the impact through social media right. So you know they could have had radio and they could have had TV and they could have had conventional communication mechanisms. But for a small investment they were able to have a tremendous impact. I think that's the key difference. If they wanted to in a hearts and minds operation if they were able to establish something as trusted as the Internet Archive or corrupt the holdings of the Internet Archive that would be worth their effort because people implicitly trust the Internet Archive at this point. So I think at some point that empty box in the slide will be filled in with some significant impact and even if later it comes out oh we discerned it was fake I mean you can't unring the bell that's one of these things and this disinformation campaign. So even after it's fixed the damage is still done. That was just amazing and I guess if you didn't have enough to worry about before you should feel you know really fully satiated at this point. Michael that was really tremendous thank you so much and we did take video of this so it will be available in a bit and you may want to share this with some of your colleagues. I think there's an awful lot to think about here but with that let me draw this meeting to a close. Wish you safe travels and I hope to see many of you at various gatherings in the coming months and again for our December meeting in Washington DC. Thank you so much for joining us.