 Hi, everyone. It's great to be here. I'll be talking about web archiving data and tools. And so I'll be talking about work files, which is one of the components of web archives. And then the second part, I'll demo a tool I've been working on for the past several years. And so a little bit about me. I've been working on open source web archiving tools for the last eight years now. Initially, I was working at the Internet Archive on their web back machine there. And I created a project called Web Recorder 2015. And I've been leading development on it at Ryzone.org, which is a digital arts nonprofit since 2016. And so kind of starting with the basics. So what is web archiving exactly? And well, the basic idea is that it's essentially capturing, storing in some way, and providing access to web content. There could be additional parts to it, but that's sort of the key parts to it, as I see it. And it's actually quite distinct from scraping or extraction since we're not trying to extract certain types of data from the web. We want to get everything such that it's preserved in an archival fashion. And it's also not saving URLs, because that's a little bit different. We're actually saving, when we think of the web, we think of HTTP as the main protocol of the web. And so web archiving involves capturing the HTTP request and response traffic exactly as it comes down the wire, essentially. And another part to it is that web archiving doesn't have to mean archiving the entire web. That's sort of how it perhaps initially started with the Internet Archive doing a lot of crawling, but there's also other approaches to doing web archiving. In particular, web archives could actually be quite small and targeted. Web archives contain small, bounded objects, essentially just one or two pages or a single website. And it's also possible to focus on the quality of a web archive rather than sort of how many pages you've captured. And the quality is also really important. And so why do web archiving? Well, actually, a really great example is this graphic that I've set. It's now a couple of years old from the British Library, who are one of the institutions that do web archiving. And it kind of shows that over that gray and black area are the pages that have essentially disappeared from the web and are no longer available. And so over a time span of 10 years, possibly sooner, anything that's online will eventually likely no longer be available. And actually, a pretty important secondary reason that I think I want to mention after the last couple of talks is that web archiving can also be an essential component of reproducibility because of how much content is actually on the web. And I'll get to that a little bit later as well. And so kind of starting, how is the web archive data stored? And the key format for web archives is what's called the work format, or sometimes referred to as the work file. And it's basically created in collaboration with the British Archive and many national libraries in 2005. It's currently an ISO standard. So there's actually two revisions of the ISO standard. The first one was in 2005. The second one was in 2016. And it's designed to essentially package HTTP requests and responses and also support deduplication, metadata, and actually also storing essentially storing arbitrary resources in this package format. And so quickly, I'll show an example of some of these work records. Essentially, the work records are concatenated together. And an example is something like this. I know it might be a bit hard to see. But essentially, it's MIME style headers before the HTTP headers, and then after that is the HTTP payload. So essentially saving the entire HTTP transaction with additional metadata inserted in front of it. And there's also usually a digest that represents a unique, essentially, the hash of the payload so that it can be deduplicated. And so that can include the HTTP response record. And the request record is in a similar format. It essentially stores the request received during the HTTP transaction. And that can include not just get requests. It can include any HTTP verbs, so including post, put, or even delete. So when archiving a URL, for example, actually involves storing both the requests that was sent to a server to get that URL and the response received from the server. And unfortunately, the work limit does have a few limitations. One is that there's no index of records. It's just essentially a listing. There's no real defined metadata format. It's essentially, you can put arbitrary data in there, but there's not a specified format other than this sort of headers and the payload after that. And so it's extensible, which is great, but also a bit limiting in some ways. There's not a way to specify starting pages, which are URLs that you might want to load in order to actually browse a web page because it can contain any data. And there's also not really support for some of the more recent web features, such as web sockets. For example, there's not really a way to store web sockets in work files or represent dynamic history changes that you might see in a single page app, for instance. And so one way to deal with that is the key part that we have is the URL index. So in order to actually get something out of the work file, you need to be able to look it up by a URL. And unfortunately, there isn't really a standard for it. The work file is sort of the main standard web archiving data structure that is out there. There are a few pseudo standards. And it's actually a text-based, space-delimited format for essentially looking up entries in this work file. And it was popularized by Internet Archive. And it looks something like this. Basically, a single line represents an entry in a work file. A more recent variation of this includes essentially, instead of just space-delimited, put all of the data that's after the URL and timestamp into a JSON blob to make it a little bit more extensible. And actually, the first part of this format is designed to be looked up in a binary search. So it's specifically formatted in such a way that, for example, the domain is reversed such that you could more easily look up all URLs that end in .com, for example. So that's sort of a technique for normalizing this data. And actually, Internet Archive provides kind of a query interface that you could use to look up this data for many pages, essentially anything that's in the Internet Archive. So a way-back machine can be looked up and it'll return actually a listing of many of these URLs in this format. Another important data piece for web archives is replay rules. And this actually is a little bit complicated, but I'll try to kind of summarize it. Is that in order to actually replay a captured web page, you have to replay the request and response traffic. And that can actually be more difficult than the actual capture. And what you need to do is to match the HTTP request to response. And often cases, there may not be an exact match. And so in this way, web archive replays itself sort of a reproducibility problem. And an example of that is something like this where you might have a URL that's been captured. It might end in one time stamp, but then when it's played back, it has a different time stamp appended dynamically. And this happens all the time because if user is interacting with a page during the capture process, they might wait one second to do something, and then when they try to replay that page, it might wait two seconds, and so they'll have a different time stamp. And so we need to figure out a way to fuzzy match, essentially, the certain parts of the URL away. And so there needs to be certain rules to determine which parameters to ignore. And here's a more complicated example. You might have, again, a URL that has certain parameters that are significant and certain ones that are not. And so we need to only match one of the parameters, but not the other two, in order to be able to reproduce this particular URL. So the request might ask for the first URL, but we only have the second one, or vice versa. And so these rules also need to be part of the web archiving system. And then there's also web archive collections, which is essentially a way to organize work files and provide kind of a context and metadata to them and group them into kind of usable units. And there's not really a standard for that either. It's also possible to analyze web archives. And there's a really great tool set called the Archives Unleashed tool set that provides data extraction and extract text, link analysis, and so forth. And that results in additional data. And so how can all this data be distributed? That's still kind of a work in progress. Data can include work files, the URL indices, page lists, replay rules, search indices, other derivative data sets, organized by collection. And one of the things I'm currently working on is kind of creating a spec for including all this web archive data. So going beyond what's available in the work files and providing a way to specify and distribute neatly all of these other data sets that are part of web archiving. And so next, I wanted to talk about how could you do web archive on your own. And for that, I wanted to demo the project I've been working on, which is called Web Reporter. And just a little bit of quick intro, the purpose of Web Reporter is kind of web archiving for all, which is the idea is that anyone can create a web archive, and it uses the browser to capture, and the same browser to then replay any website. That's sort of the goal of the project. And creating both user-friendly service and apps for people to use, as well as a whole set of open source tools for working with web archives. And I'll go ahead and do a quick demo of Web Reporter. And so this is right now webrecorder.io. And I'll try to do a live demo, which is always fun. And so for example, I can enter the hashtag for CSVConf. And I'll use the current browser. And so what you're seeing here is that there's a size counter here that's indicating how much data has been captured so far. And this data is written into a work file, essentially. And here it's loading Twitter through Web Recorder, which is a recording proxy. And essentially as I scroll down, you probably see that the size counter is going up. And so more data is being captured. And now we're up to four points, where you can see that more data is being captured. I won't click on the live stream, because that would be interesting to do, but I won't do that now. But I can, for example, navigate to if there's a link on this page somewhere. For example, I can click on this. And then this GitHub repository, or just a page that's currently loaded, is now also being archived through Web Recorder. And I can stop. And when I do that, there's a listing of pages. And so this is kind of what I was talking about with key pages that are part of the collection. So there's a lot more URLs in there, but these are the pages. And I can, for example, click on this again. And this will now show the replay of the, yeah, it's essentially that we're now viewing the replayed version of this hashtag that I just browsed before. But another part of what web archiving that I wanted to add is, so we have all the web archive data, but then what about the web browser itself? So even if we have all the data, can we still actually replay it later? And so browser features change, and they become obsolete. Fortunately, we can actually also preserve the browser itself using a Docker image. And using that, we can provide browsers with Java and Flash. And so here's a, just to kind of show that part of what I recorded previously, I was using this current browser that I'm on. We also have an option to select a different browser. And we have these different versions of Chrome and Firefox. And so I'll go to this particular collection that includes a version of Firefox that supports Java. We were able to create that version as a Docker image. And so it's actually running, hopefully it will reconnect. So I'm connected to the Wi-Fi through my phone, so it might be a little bit. And so what it's doing is it's actually running a browser remotely in a Docker container in the cloud. And it's streaming that connection. So this is a version of Firefox here that's running remotely, and it has this embedded Java applet. And we might not be able to quite stay connected to Wi-Fi. And so this becomes kind of an immediate issue, I think, for reproducibility. Another example I just recently looked for is Flash. And there's sort of a ton of, I just wanted to see what there's any kind of a lot of Flash content out there. And so I found this page with Flash animations for physics. And so if I just load them in my own browser, I'll probably just get this, that the Flash player doesn't install. But if I go to WebRecorder, enter a URL, and select the Flash capable browser here, select this version of Firefox, I should be able to, hopefully this will work. And so this is that same page. But now if I click on any of these projects, I can actually see that the Flash is now loading. And I can go through and basically go through all of these, or automated, and capture these Flash applets, which will probably will not be converted to JavaScript, and have a working version of these. And so this is kind of combining both web archiving with preservation and emulation of web browsers, which I think will become more and more important for being able to preserve and reproduce content that is online. And yeah, so I can stop that. I was capturing these and then I can, so this is kind of the view of the, I can then select one of thelets here. And now I'm in browsing mode. And so I'm actually browsing this particular piece. We also have a desktop app that's being developed. We actually have two, one still in development, and any other is called the WebRecorder Player, which allows for viewing these web archives locally. And so I can also download this collection. Right now it's going to download as a single work file, because the new kind of multi-collection, kind of data structure format is not yet ready. So right now we're just packaging everything into work files. And then let's see if this works, and so I can open it with the WebRecorder Player app. And I actually don't know if this will, maybe this won't work, because I don't know if this is the right version. So here's this app that I've, sorry, this flash piece that I captured before running in the desktop player, and to actually show that I could even disconnect from the Wi-Fi and still browse this, kind of show that this is actually running offline now. So I was able to capture this particular flash project, download it, and now run it offline. I'll connect back the Wi-Fi. And yeah, so that's sort of part of the WebRecorder Toolkit. And so all of this I want to add is open source, and hopefully I'd like to get more people involved, more users, more contributors. And so I'll cover some of the tools that we have available. So if you just want to read and write work files, the very basic, then we have a library called work.io. And with that, you can create work files with just four lines of Python, basically trying to make it as easy as possible. This, of course, won't give you what you see in the browser, this is just for a single URL at a time. But you can do that if needed. If you want to package existing files as works, we also have a small library for that called work it, which allows you to convert directory of files into a work file, essentially, so that they can then be opened up as a WebArchive. Then we also have PyDB, which is kind of our Python wayback machine slash WebArchiving toolkit. And that's the core engine that powers WebReporter. And then if you want to archive through the browser, and I'll share these slides so it could be easier to follow along, you could essentially use the simple script to launch a browser and kind of do what I just showed initially on the command line. And then you could also host a wayback machine using these tools. And then there's also the WebReporter player, which I showed earlier. And then we also have for the browser system with older versions of Firefox and Chrome, that is also available. It's part of the old WebToday project initially. That's also available in GitHub as a separate project. And if you want to try everything, the whole Web recorder configuration is available as a Docker compose setup on our GitHub also. And it includes the entire package that I've kind of shown with the front end UI. Yeah, and so I'm kind of happy to answer any questions about all of this.