 So, good afternoon. It's really good to be back at CNI. Chris and I have divided this briefing into two parts. The bad news and the maybe not so quite so bad news. And in my traditional role as Cassandra, I get to go first with the really bad news. So, we just heard Herbert tell us a lot about how to publish on the web the right way. The bad news can be summed up in two parts. One of them is that people who publish on the web don't actually pay any attention to what Herbert says. And the second one is Herbert was gossing over some rather hard issues along the way. So, it's not your grandfather's web any longer. Well, I am actually a grandfather and I'm old enough to remember the web the way it was. Actually, I'm old enough to remember back a long time before there was a web. About this time in 2015 will be the 50th anniversary of the first program I ever wrote. A couple of years later I started life as a Cambridge undergraduate and encountered computer graphics. This is a deck with a 340 display which is how I started doing computer graphics. And just before that one of the smartest people I ever ended up meeting co-authored a paper on computer graphics entitled On the Design of Display Processes. That was Ivan Sutherland. He introduced the idea of the wheel of reincarnation as applied to graphics hardware. The idea that the design of the hardware was cyclical. It started out between with a fixed function IO peripheral that over time grew into a programmable IO processor that ended up being a fully functional computer connected to a fixed function display. And that wheel is still rolling right along. When we started NVIDIA 20 years ago we designed a fixed function graphics chip. Almost all our competitors and six months after we started NVIDIA we knew about over 30 other competitors. They pretty much all designed fully programmable graphics chips. It was one of our key performance advantages that allowed us to get for the very first time ever arcade games running at full frame rate on a PC. This is Sega's Virtua Fighter on an NV1 chip preserved on YouTube. The bottleneck was the frame buffer memory and the advantage we had was that we used every single memory cycle to render graphics. All these programmable people had to take some of the memory cycles to fetch instructions out of the frame buffer for their program. Of course now NVIDIA's chips are fully programmable GPUs. In 20 years they've gone halfway around the wheel from fixed function to fully programmable. The wheel also applies to graphics and user interface software. The early window systems like SunView and the Andrew window system were libraries of fixed functions as was the X window system which ended up being very successful. James Gosling and I tried to move a half turn around the web with a system called News which was a user interface environment fully programmable in Postgres. Postgres is actually a pretty neat programming language but we were premature it didn't work out. What's this got to do with the web? Well the half turn around the web that James and I couldn't manage has actually happened on the web. The web that Tim Berners-Lee invented was in a practical implementation of Ted Nelson's utopian concept of Xanadu. A web of documents connected by hyperlinks encoded in a fixed function document description language. That web was pretty easy to ingest for preservation because a crawler could visit the page, easily find the links in it, follow them to other pages it needed to ingest. And it was relatively easy to preserve once ingested because the content of each document changed infrequently so two visits in succession would probably obtain the same content. The phenomenal success of the Internet Archive was based on this model. Here from the Wayback Machine is the front page of BBC News from more than 15 years ago. The only difficult aspect was replaying the preserved content so that the links resolved to their preserved targets rather than their current targets. It wasn't until the work that Herbert's just told us about, about Memento, that this part of the problem was resolved. But, as we should have predicted based on the history that I went through, the web we all use today is a half turn around the wheel from that web. That web's primary language was HTML, document description language. The browser downloaded documents and rendered the fixed set of primitives they contained. Browsers still do that. Of course, once a network protocol like HTTP or HTML gets widely used, it can't really be changed in incompatible ways. But mostly what browsers do these days is download and run programs in the current web's primary language, which is JavaScript. JavaScript is a programming language, not a document description language. Your browser is only incidentally a document rendering engine. Its primary function is as a virtual machine host. So I use a Firefox plugin called NoScript. This screen grab shows that at least 11 sites wanted to run programs in my browser as I visited the front page of the New York Times a couple of weeks ago. So Chris and I organized a workshop last year to look at the problems this evolution of the web poses for attempts to connect and preserving. And here is a list of the problem areas the workshop identified. Now, clearly I don't have time to go through all of these individually in detail. For that, you should check the document that the workshop generated. And you can find it, and you don't have to take notes about my part of the talk anyway, because the text will go up on my blog, blog.dshr.org, at the end of this talk. And then we'll incorporate first things right? No. I take a lot of care writing these things when they're written they're done. I don't want people seeing previous versions of what I wrote. So I'm going to try and abstract out of this mess a few big picture problems. First, in order to preserve content from the web, you need to be able to access it. Increasingly, as news disappears behind slightly porous paywalls and sites use social network identities to gate access, this is becoming more difficult. Equipping a crawler with a credit card to pay for access is not really a practical approach. But the problem is compounded by another trend, sites that forbid access, for example, because the content is behind a paywall, and show you a login page, no longer do so with an error code, such as a 403 forbidden. They put up a login page with a 200 okay. And this is a result of this behavior. This is an article from the defunct journal graft as preserved at the internet archive. And the message here is this item requires a subscription to graft online. This, from the clocks archive, is what the article looks like if you actually have a subscription. A web archive full of pages explaining that the actual content is inaccessible may be an accurate reflection of the state of the web at that time, but it's not really a great way of preserving our digital heritage. Determining that some content that was delivered with a 200 okay code is actually not valid content preserved is hard. The lock system is designed to preserve subscription content, so this is a problem that we face every day. We have to write really crafty, custom per site login page detectors in order to handle it. So when somebody says HTML, what you think of is something that looks like this, which this looks like familiar HTML, but actually it's an HTML5 geolocation demo. Most of what you actually get from web servers now is programs like the nearly 12 kilobytes of JavaScript, which is included by the three script lines near the top of the page. This is the shortest by far of the three files, and it looks exactly like a program, right? A crawler collecting the page can't just scan the contents to find the links to the next content it needs to request. It has to actually execute the program to find these links. This raises all sorts of issues. The first is the program may actually be malicious, and a lot of them out there on the web are. So its execution needs to be carefully sandboxed. Even if the program doesn't intend to be malicious, its execution will require an unpredictable amount of time, which can amount to a denial of service attack on the crawler. So how many of you have encountered pages that froze your browser? Everybody's nodding. Executing may not even be slow enough to amount to an attack, but it will be a lot more expensive than simply scanning the page for links. So ingesting content from the web just got a lot more expensive in compute terms, which doesn't help the whole economic sustainability is job one problem. And then again, it's easy to say execute the content, but the execution of the content depends on the inputs the program obtains. In this case, the inputs are the geolocation of the crawler, and the weather at the geolocation of the crawler at the time that it was crawled. In general, inputs depend on, for example, the set of cookies, the contents of HTML5's local storage in the browser that the crawler is emulating, the state of all the external databases the program may call upon, and the user's inputs. So the crawler has not merely to run the program, it also has to emulate a user mousing and clicking everywhere in the page looking for behaviors that trigger new links. But we're not just finding links for their own sake, we want to preserve those links and the content they point to for future dissemination. If we preserve the programs for future re-execution, we also have to preserve some of these inputs, such as the responses to database queries, and supply those responses at the appropriate times during the re-execution. Other inputs, such as the mouse movements and clicks, have to be left to the future reader to supply. This is very tricky, and there are all sorts of other issues, like you have to be able to fake the secure HTTPS connections and so on. Re-executing the program in the future is a very fragile endeavor. This isn't because the JavaScript virtual machine will become obsolete, it's very well supported by an open source stack, several open source stacks. It's because it's very difficult to be sure which of the significant inputs you need to capture, preserve and re-supply. So a trivial example is a JavaScript program that displays the date. Is the correct preserved behavior to display the date when it was ingested to preserve the original user experience, or to display the date when it will be re-executed, which is to preserve the original functionality? There's really no right answer. Among the projects that are exploring the problem of preserving executable objects are Oliver Coney Mellon, which is preserving virtual machines containing executable objects. But I don't believe it's really exploring preserving their inputs. And Workflow Forever, another of Caragobal's projects, which is trying to encapsulate scientific workflows and adequate metadata for their later reuse into research objects. The metadata includes sample datasets and corresponding results so the correct preservation can be demonstrated. They've just written a paper showing that generating the metadata for the re-execution of a significant real workflow is a major effort. And even for workflows, which are a simpler case than generic JavaScript, Workflow Forever has to impose some restrictions in order to make it work. You can think of this as analogous to PDF-A, which turns off all the hard-to-preserve aspects of PDF. For the broader world of web preservation, the HTMLA approach isn't likely to be robust enough, even if we could persuade websites to publish two different versions, one of them for use and one of them in HTMLA for preservation. Well, an alternative that preserves the user experience but not the functionality is, in effect, to push the system one more half turn around the wheel, reducing the content to fix function primitives. Not to try to re-execute the program, but to try to re-render the result of executing the program. The YouTube of Virtual Fighter is a simple example of this kind. It may be the best that we can do to cope with a special complexity of video games, most of which have complex DRM and server-based backends and things like this. In the re-render approach, as the crawler executes the program, it will record the display and build a map of the sensitive areas with the results of activating each of them. You can think of this as a form of preemptive format migration, which is an approach that both Jeff Rothenberg and I have been arguing against for a long time. As with games, it may be that this wild flawed is the best we can do with the programs on the web we have. So now the question is, what on earth are all these programs doing in your browser? Mostly what they're doing is capturing information about you so that it can be sold. I won't shed a lot of tears if we fail to preserve this aspect of the web, because some of the captured and sold information drives what you see in the page, such as the advertisements. So I've never understood why archivists think that preserving spoof ads is important, whether they're selling fake products or real products, but preserving real ads is not that important, even though they dominate our political discourse. So the programs that run in your browser these days also ensure that every visit to a web page is a unique experience, full of up to the second personalised content. The challenge of preserving the web is like that of preserving theatre or dance. Every performance is a unique and unrepeatable interaction between the performers, in this case a vast collection of dynamically changing databases, and the audience. Actually it's even worse, it's like preserving a dance performed billions of times each time for an audience of one, who is also the director of their individual performance. So we need to ask what we're trying to achieve by preserving web content. We haven't managed to preserve everything we've done so far, and we can expect to preserve even less going forward. So there are a range of different goals. On one extreme, the Wayback machine tries to preserve samples of the whole web. Now, when Brewster originally told me that this is what he wanted to do, I thought he was nuts. But I was totally wrong. A series of samples of the web, even if they're noisy, through time turns out to be an incredibly valuable resource. In the early days the unreliability of HTTP and the simplicity of the content meant that the sample was fairly random. The difficulties caused by the evolution of the web are introducing an increasing systematic bias towards the pages that are easy to capture. In the LOX program, the other extreme, we sample in another way. We try and preserve everything about a carefully selected set of web pages, mostly academic journals and books. The sample is created by librarians' selection. It's a sample because we don't have the resources to preserve every academic journal, even if there was an agreed definition of what an academic journal is. And again, the evolution of the web is making this job gradually more and more difficult. What we're preserving more and more is the easy to preserve parts. So what Web Archiving is about is preserving a sample of the web. We're not going to get everything. And the web evolution is introducing a systematic bias into this. And in order to combat this, we need to have a diversity of approaches. Most Web Archiving at the moment is done by the Internet Archive and other people using the same tools as the Internet Archive. And this is dangerous because everybody's got the same systematic bias. And the good news to transition to Chris is that at least Memento provides a way of unifying diverse collections from the web into a single resource that people can access uniformly. Okay, Chris. Is it going to turn this switch? Sorry, one. I think if we escape, maybe then. Okay. So what I'm going to attempt to do is provide a little bit of hope, a little light at the end of the tunnel. As you can tell from this slide, which actually is produced by a project called 10x, which is working on the semantic web end of the spectrum of this visual, the web is an incredibly complex ecosystem today. It's not any one set of techniques, approaches, implementations. It's a diverse network of everything you can imagine and all the things that we haven't thought of yet that somebody in their garage is going to put out there in the next two to six months. And as archivists in this context, we are charted with trying to come up with pragmatic, scalable, sustainable approaches to try to sample and collect and preserve as much of this material as possible. And as David pointed out, there's a whole another set of issues relating to access to these materials. Because we can't cover the full spectrum of all of that today, what I'm going to try to concentrate on is what are some of the innovations that are in this presentation happening on the collection and data preparation side of the equation that enable the access. There's a lot of additional research and innovation happening on access, but we're not going to be able to talk about all of that here today. So in this context, even looking at whether you're taking an entire national domain and crawling sort of every registered entity within that sphere, you're going to touch upon everything, at least in these three squares and a few things that start to spill into what we're referring to as sort of the ubiquitous web, which gets a lot more complex. But today we sort of regularly address sort of those three. And what we used to think was that smaller, more curated collections of content would be easier to manage. And unfortunately that's not the case. If you even try to create a capture of the federal congressional branch of the U.S. government, you will hit every single thing displayed in this visual above. The spectrum of content that we're wrestling with on a day-to-day basis continues to be more and more diverse as David articulated. So what I want to do is try to set the stage on how are we sort of approaching this. And so first and foremost, it's important to understand what sort of the classical model of web harvesting looks like. And that's specifically you're asking a question of a resource. Usually in the model of you're identifying yourself as a particular client. You might be an iPhone 5. You might be a Firefox web browser with this particular version. But you are identifying yourself in a specific way. You're asking a specific question and then you're waiting for the answers to that question to be returned. You're examining the output and then identifying what pieces of that response are important for adding to the information that you now know is out there and need to go and visit. These are primarily unique URIs but occasionally you're also identifying specific MIME types that need closer attention or special processing and analysis. So this is sort of the classical model of a traditional web harvester. A lot of what we're basing this on, of course, is the most widely used open source web crawler hair tricks. But there are lots of other proprietary crawlers out there that obviously use this same framework. In terms of trying to talk through where things are evolving, it's important to understand the difference between how a crawler interacts with the resource and how a traditional client, even if you're mimicking that client in kind of this traditional harvesting context, might be. So browsers are really sort of optimized for getting only as much as you need to display that particular view at that moment in time that that end user is experiencing. Then a browser can respond obviously to inputs that allow you to go deeper and deeper into that resource. A crawler can only get additional content that would be initiated and identified by a user interaction. A crawler can often not, as a standalone entity, execute a script and discover links. There are extractors that have been designed and developed and embedded in order to try to mimic that behavior, but it doesn't act in the same way as a browser might. So a lot of what you're having to do with the crawler is contain it a little bit because it can quickly overwhelm the server and cause unintended consequences for interaction of other users with that resource, which is unacceptable in general web activity. So there's a lot of rules that have to be structured in order to guide the behavior of a crawler while it's in progress. And depending upon what kind of institution you are and what rights you hold, there are specific rules that must be obeyed or should be obeyed or might be obeyed in different contexts. And this is where all of the gray zone comes into play and why there's so much complexity even when you get down to the end of the momentum, end of the spectrum when Herbert was referring to access rights and issues that that has. You can have a public web archive, but you might not be able to view any of the video content embedded in pages because there's a live restriction that's published out on the web by the owner of that content that says, nope, sorry, you can't see this. And depending upon whether you are the national institution that is legally mandated to collect and make that accessible again, it may only happen within the four walls of your reading room and not over the public web. So those are some of the things that come into play between the two resources. And this is just a quick visual that gives you a little bit of an idea that when a browser goes out to a website, what all the things are that are happening simultaneously in terms of collecting the resources that might need to be assembled in order to view that. So what's happening in the web in general? So this is not specific to cultural heritage or academia. The proprietary commercial web is also spending a lot of time integrating browsers and traditional crawling methods primarily for discovery, for supporting sort of the search engines that we all know and love and have come to depend on in our day-to-day lives. However, the difference is that those engines just need to be able to discover the location of a particular resource and try to identify meaningful text and context. It might be metadata that they can collect around that resource and surface that back up in a discovery context. When you're talking about an archival capture, the bar right now is set very, very high in terms of you're trying to recreate an experience that perhaps a human being sitting in front of a device experienced at a date, at a point in time, at a location, on the web. So you're taking all of that into account and attempting to not only capture and preserve it, but then have a chance of re-rendering that context at some point in the future. So when I talk through this quickly, please keep that context in mind. There's a lot of tremendous commercial innovation that we can take advantage of, but it only gets us so far and then we as a community have to sort of carry it further from there because until some of the corporate compliance, legal compliance that is now affecting publishers around the globe really comes into effect and there's a mandate to preserve this content for, you know, more than 10 years, we're probably not going to see as much innovation there. However, we do think that personal web archiving will continue to provide some interesting innovations for us as a community because more and more individuals have their entire lives in digital form across a wide array of sites that publish and aggregate and maintain this content on their behalf, whether it be photos, videos, blogs, you know, other types of content. So again, they're starting, the commercial sector is starting to tackle some of the same issues that we are. In general, the experimentation with browsers have fallen into three camps. You have extractor modules that are written for traditional crawlers and embedded. The idea being you send a set of locations to a browser emulator that may be a fully instantiated browser or a headless browser that basically records the communication between a client and server, records all that information, passes it on to the crawler for a collection. You also have the standalone headless browser clusters that are being used for similar purposes, primarily for preceding data collection, quality assurance during and following a crawl, whereby you can look at all of the resources that are written to a particular repository, compare it with the communication that you have documented and determine whether or not everything has been captured. If not, it then gets queued up for either a special process to go out and explicitly grab a particular file and or maybe just as part of a typical patch crawl cycle. But you also have on sort of an additional end of the spectrum this concept of a more fully functional engine that allows site publishers to test sort of their design and implementation of standalone websites. That infrastructure is largely available in open source format. There's a number of different projects and programs out there. Much of the international internet preservation consortium community has started to standardize on solutions that integrate PhantomJS. It just happens to scale particularly well. There's some other benefits from this particular solution. But that approach is being used as a standalone mechanism for actually browsing out using scripts to browse out, collect content and write using an archival file format called the work file for web content. So that's actually a capture mechanism. We use it at the archive to regularly capture YouTube video at scale. There are weeks when we are capturing millions and millions of videos and writing them directly to work. It's not integrated with some other crawl process. There's also experimentation going on with recording and snapshot generation. So if you can't actually capture everything to render the experience that an end user might have, as David pointed out, it's not just being used in video game context, but also just to say, okay, every registered entity in .com let's take the slash page and take a snapshot of the live or source. Because at least at this date and time we can have an idea of what the default representation was that if you're not logged in, what did you see? If they don't know anything about you. Because the crawler doesn't always have an identity. We can give it an identity. There are mechanisms, as David described, to create a login, to create a persona. But in general, if we're trying to do a sampling at a very shallow level across an entire domain, it doesn't have an identity and we're presuming there's no login associated with that experience. So these are just some examples of some of the browser work that's ongoing. But what's more interesting to look at is recent initiatives around merging browsers and crawlers. As it turns out, the Heratrix open source crawler makes a set of synchronous requests. We could modify that software to also support asynchronous requests, but right now it's heavily architected around synchronous requests. So one of the things that we were trying to do to address integrating the browser mode and the crawler mode was trying to avoid issues of scalability. We can't wait for hours for all the resources to synchronously return. We also don't have the luxury of having to revisit the same content over and over again. It's incredibly inefficient. So the crawler in its own right has mechanisms for maintaining this, but when you introduce another element like the browser, which is trying to call everything at once in order to create a particular view, how do you go about managing and maintaining that? So I want to talk to you about three implementations that are starting to try to address this. I'm just going to get my clock going to make sure I have it. So in terms of the key projects, most of these have been incubated by members of the International Internet Preservation Consortium. That's a group of institutions that are constantly trying to trade technology, best practice, and openly do even engineering exchanges where people will go for months at a time and try to help develop specific solutions. But also regionally, we've found more and more sort of the emergence of projects that are trying to open up and create solutions that are not web-specific, but actually address a broader array of content types and repository issues that institutions might be facing. So right now we have a project that is really incubated out of open planets that is trying to look at integrating a browser extractor module into HairTricks, replacing one of the modules that was designed to try to execute JavaScript and discover links. So they have implemented the code. It's publicly available on GitHub. They are implementing it web-scale for the .uk domain. They are sending every registered entity through this extractor module to try to identify all the content that might be missed by a traditional crawler, but they're actually replacing the extractor. We have colleagues at INA, which is the Audiovisual National Archive in France. They have a slightly different workflow. Theirs is they crawl a fixed body of resources that publish content within their cultural domain. They crawl everything, and then they identify through data mining and other techniques where there are holes in the process, and they take those resources and then route them through a browser, but they're using a proxy to try to ensure that there's no duplication between those two efforts, that there's unique capture of resources and not a broad base of duplication. A final project that got off the ground as part of the NDSA here in the United States was centered on the federal elections of 2012. In that context, rather than sending seeds or targets that may be used to start a crawl process, what was done there was the creation of a hybrid model, where you're using a traditional link extractor in combination with a browser-based extraction that also integrated a proxy to try to deduplicate content. What was routed through the browser was primarily, initially it was all HTML and scripts that were identified by MIME type, later due to all kinds of factors we ended up biasing to script only. We hit some scalability issues that even with 5,000 resources, the delays and some of the impacts on those individual sites were significant enough that we needed to actually focus entirely on the scripts and not on the HTML in its entirety. Those are just three implementations where this is happening. To give you an idea of what we're finding with some of these experiments, and this is just one baseline test I'm using for illustration, there are dozens and dozens of these experiments that have tried to document the differences. In general, if you're dealing with browser-only relative to traditional link extraction only, you're missing about 30% of the content. And the reason is it's really hard to figure out how to tell a machine how to be a person and execute everything that might be happening in every resource across a broad spectrum of resources. It's not so hard to do that if you have the time and you have the mandate to do it on a resource-by-resource basis. So if you're a site owner and you're doing site-based testing, you might have your quality assurance team doing this as regular practice to save you on manual quality assurance of a particular web resource. If you're archiving a journal and you have a particular mandate to try to optimize exactly how you need to get in there and collect all of the specific publications, then you invest the human time in order to describe what that process is looking like. If you've got 5,000 seeds of websites that are thrown up only for an election by dozens, maybe more in terms of number of publishers and publishing frameworks, there's no human way possible that in the window of time that these sites appear and disappear that you can provide that manual guidance. So you have to come up with general rules of thumb that you hope to apply across as many of the resources as possible. So that's why you're seeing that degradation in content not found. When you combine something like a PhantomJS implementation, which has the ability to operate as a fully browser instance or a headless browser instance in combination with traditional link extraction, you actually see an increase in the content that's discovered and able to be collected in a timely way. So we're definitely seeing huge benefits from going down this hybrid path with even just these two components in combination. The problem is that the processing overhead is significant. None of us in this room are Googles. We don't have thousands and thousands of servers that we can run crawlers in parallel on on an ongoing basis in a 24-7, 365. So we can't scale rapidly enough or in a sustainable enough fashion to address all of the execution that would have to go into this. So once again, these are not perfect solutions, but they're getting us closer toward at least beginning to be able to collect and have the ability to have representative samples that were at least equivalent to the Web 1.0 days. So I want to talk to you just very briefly about some other techniques that again are being used in combination but are not directly linked to the crawl process. This is happening as a preprocess, as a parallel processor, as a post-process. One of the most significant areas of infrastructure and architecture by far is data mining and analytics. We operate a 1.3 petabyte Hadoop cluster at the Internet Archive. And if you're interested in learning more about that, we publish all of the information about the configurations and how we allocate MapReduce versus other tasks in the system. But the important part to understand about this is this is a piece of infrastructure that's used by every Fortune 1000 company globally. It is a critical piece of infrastructure that's here to stay in terms of the Web in general and how people are interacting with Web data at scale. And Web data at scale could be 10 resources all the way through to the millions that you might see registered at a national domain level. It's frequently used to identify the difference between an embed versus an outlink and to place priorities on collection of those resources and to differentiate those. There's a lot of work that goes into also identifying different MIME types that we can then route through other capture mechanisms. Or in some cases we need to take, for example, an embedded piece of social media and actually connect it to a direct archive of a feed that's completely running separately from sort of a crawl itself. So I guess what I would emphasize from this is there's additional diverse methods that are being used in practice by all of the partners that I mentioned and some innovative up and coming companies that are trying to do this in more of a commercial context. But for anyone engaged in archiving of the Web, there's no one set of techniques that's going to work and unfortunately it's not just the browser crawler integration that I described, it also requires additional forms of capture that would enable us to have additional perspective on the samples that we're collecting and then attempting to make available for research and access. Thanks. Okay, so plenty of time for questions.