 Without further ado, I will just go in, head and introduce you to our presenter. Today we'll be hearing from Josh Hogan, who is the Digital Preservation Archivist at Georgia State University in Atlanta, Georgia. And also a former colleague of mine from the Atlanta University Center. So thank you for joining us, Josh. We're gonna be hearing from you on web archiving today. So I'm gonna just hand it off to you. Take it away. Hi, everybody. I think I'm gonna go ahead and turn off my camera only because it can be a little bit dodgy, but I will turn it back on for our Q and A. As Gail mentioned, I am the Digital Preservation Archivist at the Georgia State University in Atlanta. And I'm also the Archivist for Social Change in the archives. And previously I was the Assistant Head of Digital Services at the Atlanta University Center, Robert W. Woodruff Library and was in charge of administering our web archiving initiatives there. Despite all that, I want to say that maybe, like many of you, I still consider myself a beginner or newbie in this field and I'm always learning more about this complex and challenging topic. My presentation is really going to focus on web archiving at a high level and consider what it is and what its origins are as well as think about what criteria to consider when planning for web archiving and close with some discussion of different tech tools or tech platforms that you can use to get started or enhance your web archiving efforts. So before we get started, I wanted to have this quote as something to keep in the back of your head as a guiding principle for web archiving. I don't know if Kathleen Rowe actually said this originally, but she's the first, she's the person I first heard say it when I attended the Georgia Archives Institute in 2013. She was talking about archives as a whole, but I think the quote may be even more crucial for web archiving since the temptation with digital materials in general and the web is to think that we should grab everything, but just as we wouldn't collect all the paper material or every photograph in our community, we wouldn't want to just grab every website, social media account, online database, or anything like that. If anything, we need to think about how we can carefully select and care for the most appropriate and high value web archiving collections that we can. So what is web archiving? So in a nutshell, web archiving is the process of collecting portions or sites from the worldwide web and ensure that that information is preserved in an archive for future researchers, historians, or the public. So that can mean a lot of different things, of course, because we might be talking about anything from an entire website to small portions of that website. For example, a social media, we might crawl just a particular hashtag and get tweets from that hashtag on Twitter and not entire profile of someone in their tweets or we might grab an entire set of websites around an organization. But I think the key to the archiving part here is the provision of ongoing access to the community that you serve. Also, as I mentioned before, archiving appropriate portions of the web is key. As you can see, the image on the left is just a very small representation of how complex the web is in terms of all the different nodes, all the different connections and how difficult that all is to grab and make available for the long term. You can also quickly go down a rabbit hole following all those different connections and going down all those different nodes. On the right, you can see a visual representation of how the web is changing its relatively short history. Many of us forget how the internet is really still a novel technology as far as human beings go, even though it has quickly woven itself into our lives. The early days of the web, when it was just hypertext links and basic sharing of documents and information, that was, of course, revolutionary at the time, but it's very simple compared to what we expect now with the internet. And now you can see, as you get around to this end part, we have things like linked open data, web of science, social networks of all kinds that we all, and these create challenges for us and the challenges for us that we need to learn how to navigate. So speaking of early internet, this is a snapshot of the first page ever on the World Wide Web, uploaded by Tim Berners-Lee in 1991, and it's been restored recently to its original address. You can see that this first page is just text and hyperlinks to other text documents. Crawling and archiving something like this is a snap, of course, and would likely take very little time with modern web archiving technologies. Of course, our jobs as archivists, curators, et cetera, would be made a lot easier had the web-stated collection of documents linked by hypertext, or if brutalist web design were more widely embraced by the web design community. For example, content is readable on all reasonable screens or only hyperlinks and buttons respond to clicks, or the back button works as it's intended to work. It would be super easy to crawl lots of websites that were designed using these principles. As it is, we have all kinds of different kinds of websites and these are just three examples. You can see here the YouTube page for the peach tree and pine. Homeless Shelter in Atlanta has videos, images, text that we have to navigate. In the middle is a representation of a library exhibit from Georgia State University, which includes 3D models of one of the buildings of the university that was recently demolished. And one of the questions I'm trying to grapple with is how do we preserve those 3D models? Might be easy to get the files, but how do we make sure that people can experience those 3D models in the way they were intended to be experienced? And then on the right side, you can just see a Twitter feed for Atlanta Journal Constitution. And we know how complicated that can get with video, with images, with lots of text, as well as issues of privacy and copyright and things like that. So no longer are we just crawling very basic text documents. Thankfully, web archiving initiatives debuted not long after that first website. And by the mid-1990s, we start seeing web archiving technologies and initiatives. On this site, you might be able to read that Australia's web archive started in 1996, 25 years ago. So we have had a while to start building these technologies, although that has been a very small part of our efforts in archiving. Also starting in 1996 is the most famous web archiving initiative, which I'm sure some of you have used in the past. Internet Archive's way back machine that has been going for 25 years and has now archived over 581 billion web pages as of June 2021. A nice part about this is that you can put in any page in their little box down here on the right. And save that page to the way back machine, which is a very free way for you to make sure at least there's some copy of a page that you think should be copied. At the very least, that might be something if you're concerned about something disappearing, have it crawled by the way back machine. So basically, we turn to the question of why are we archiving web resources and websites? And this quote that I've put on here, I think kind of gets at a lot of that. This one was very personal for me because I was, I lived in Knoxville for a long time and the predecessor for the Knoxville Mercury was the Metro Pulse, which is a newspaper that was very similar to Creative Loafing in Atlanta. It was an alternative news weekly. And Jack Neely was one of their columnists. And he was, he's talking in this article about how the online archives of his own work that he used would just disappeared one day because the company that had bought the paper didn't want to keep them online anymore. So really, you know, archiving web resources is still about the same reason why we archive paper resources, having access to data, access to things that inform the historical record, as well as government documents, institutional records that we have always been used to collecting in paper are more and more living online. And it's, and it's much more ephemeral than even the paper was. So, so in addition to all the fun things that we can do on the internet, we have lots of important material that we need to preserve as historical societal and cultural records. The 2016 article from Costa Gomez and Silva reported that 80% of websites do not exist in their original form after only one year. 13% of web reference of web references and social and scholarly articles disappear after 27 months and 11% of social media posts are lost within the year. And I suspect that's just getting worse. Another thing to consider about the importance of archiving web resources is not just data access, but data analysis. Web archiving can allow us to explore and analyze complex topics. By archiving websites and using emerging analysis tools, we can have rich analysis and modeling of historical and other and talk topics, particularly around say, maybe politics or societal issues. But this can only happen if we preserve the sites and resources that are important for these things. So, I think these are some of the things that you can think about. If you're starting out with a web archiving program, questions to ask yourself. And I think these first two questions are very related, if not the same thing worded differently. So is my institution uniquely positioned to preserve this material or is this site specifically related to my institution or to my institution's collecting policy. More and more, it's important for archives and cultural heritage organizations to provide something that's unique, something that really comes from their community and is for their community. So is this site likely to be widely archived by their initiatives? Obviously, it's not down to Georgia State University to archive the New York Times. Someone else, as many others may be doing that. Is this site uniquely valuable to the communities my institution serves? Or, and finally, I think this really gets down to it is, do we as an institution have the capacity to preserve this resource, store it, manage it for the long term, and make it accessible for our intended audience? If we're going to have problems with these things, then maybe that particular site is not good for our web archiving program. So I'm trying to stay within my time here, but I just want to look at some of these tools. So a lot of times, that's what we want to get down to is what can we use to make these things happen? So I want to look at a few of the software platforms or tools that you can use. Some of them will be downloadable. You can download them locally and use them on your own station. And then I want to look at a couple of the more popular online tools. And talk a little bit about those. So, HCT track is one that I've used before. And this is, again, this is one of those free as in puppy, not as in beer type of open source websites. In the sense that it's free to download, free to use, but you still have to think about where am I going to store anything that I've crawled? Do I need to have someone's help? Do I need to have a person in my institution who runs this? That needs to be paid? Those kinds of things. But it is a very useful tool for just really grabbing copies of web pages in an automated way, pointed at the page you want, it downloads all the files, all the folders, and it can quickly take up a lot of space on your machine. So that's probably the big thing is where am I going to store this? Another similar example here is the Heratrix archival crawler. This is actually based on the same technology the internet archive uses for the Wayback machine and for archive it. This is, again, open source. And it is, as they say, extensible. And it's basically going to, it's trying to aim at archival quality with crawling. Heratrix means heiress, a woman who inherits. So that's where they pick the name, since the future researchers will inherit our efforts. And this is available through Internet Archives GitHub site. A third sort of open source thing that you can go for. And this one is a little bit more complicated because you have to have some command line knowledge to use, gun use WGIT program. But this does crawl files across HTTP, HTTPS, FTP and FTPS. Again, this is where you need to think about where, once you've crawled it, you're going to store it. Not just, not just crawling, but also provisioning for the long-term storage. This one, WebRecorder, this is an in browser extension. So if you're not really looking to learn a whole new program, WebRecorder allows you to add an extension to your browser that will then allow you to grab web pages as you interact with them. I've gone through and done searches on my plan acknowledgment. Sorry about that. I think there was a little bit of. A little bit of feedback, but we're good. Okay. So anyway, so I've not tried this particular version of this tool that I have used in app or an in browser crawler before, and they can often work well, especially for basic pages. The big thing about the options that I've just gone through the HTTRAC, Heratrix, et cetera, again, is to reiterate that you have to have a storage space carved out locally in some way to make sure that you can eventually provide for some access to that. A lot of the online programs already allow you to have public access to for your. Researchers to use the material once you've crawled it. And you have to think about that as an extra step with these that I've shown so far. So we'll turn to a couple of the big online ones. First, these services are available for crawling websites and they do have a public facing side to their websites. One of the tools, the first tool is Conifer. Conifer came out of the same initiative to develop the WebRecorder. Extension we just saw. Users of Conifer can create an account and crawl sites as they interact with them in a browser. The good news is that you can make these crawls publicly available. Another piece of good news is that you can sign up for an account for free. Unfortunately, there is a limit to how much data you can crawl on a free account, but you can export all crawls that you do as .WARC or work files and save them to a local server or to your cloud server. But then again, that gets you back to the point of considering how you're going to make those files available, whether you're going to make their content available or you're actually going to emulate the website the way it looked the day you crawled it. There are pay versions of Conifer available, but I've not yet tried those, so I can't really speak to that. This is just a back end shot. You can see that you can create collections and crawl your websites. You can set each of those to public or private. This is an example of one of the oops, we go back. This is a crawl of a blog post from the Atlanta study symposium website on the closing of the open door community in Ponzi Highlands area of Atlanta. This was done by one of Georgia State .This is a blog post about one of Georgia State University's history professors. Compare that to the original screen capture of the actual blog post. All I really had to do was go to that site, click on it, interact with it, and it created an almost exact copy of the website. I can make that publicly available should I choose to do so. But that of course gets into who has copyright to that. There are all kinds of issues about whether I want to make it public, but at least I have that as a research item in my archives. Now I'll turn very briefly to archive it. I just want to close off with that. This is probably the one that you all are most familiar with. It's kind of the big daddy of web archiving tools. It's kind of the big daddy of web archiving or using this right now. It does cost real money to use. It can be too expensive for smaller institutions and individuals to use. But like I said, it is widely used. And it is a very good overall turnkey solution for crawling sites. That said, it does take a fair amount of time to scope your seeds or URLs correctly and tweak crawl parameters to get the crawl that you really want. This is just the back end. This is one of the sites of our 512 gigabyte budget for the year. This is a website for the peach tree and pine shelter that I crawled recently using archive it. This is what it appears to. To archive it users. This is the same website as it appears. You just go straight to it in your browser. So you can see that it was, it works very well. I did have to tweak this though to ignore robots. You can see that it's a little bit different from the other sites. But it's not the same. It's the same as the XT and things like that. So that it appeared exactly the same. This is the same website as it was crawled by the wayback machine. And back in March. So you can see that there's a difference. If you don't play with those scoping rules, you can still get kind of a wonky playback as they call it. But the content is all there. It's just whether or not it renders. Properly. You can see that there's a difference. You can see that there's a difference. You can see that there's a difference. Access. Well, it's key, but you also have to think about what level of access you're going to give. Are you just giving the content? Or are you providing a true representation of the original website? So here's. Some of the challenges. This just illustrates some of the challenges of social media. This is the YouTube site for the same people. The same thing. I don't know if you can write that for playback. You have to right click and open each video in a new. In a new. Browser for our new tab in your browser, just so you can watch them. So there are all kinds of little quirks that you have to figure out to make the playback work. The way you want it to. So I'll just offer. I'll, I'll share these slides with Gail so she can send these out to you. But here are a few helpful resources in terms of. Web archiving. The 2016 survey is a few years old now, but it does show you a lot of the practices that people are using in web archiving. So with that, I'll say thank you very much for inviting me and. We open the floor for Q and a, I'll just open it by saying that. I'd love to hear anybody's insight or experiences they've had with web archiving initiatives or any ideas they've had for. Web archiving initiatives at their institution. Thank you very much. Thank you so much, Josh. So if anyone has questions, you can just feel free to either unmute yourself and go ahead and ask or you could type it into the chat either one, whichever works for you. I don't know if I've put people to sleep or. Confuse them. It's a big topic. So I hope that that was something that was helpful. And not just something you already knew. That was my big concern. Yeah. Yeah, I mean, I definitely learned some things. I did not previously know. I don't know how have any of the folks in the room had any experience. Using any of these tools or doing anything. Oh, Dakota. Yeah, go right ahead. You can either type in your question or unmute yourself. You should all be able to unmute yourselves. I think. Hi. Can you hear me? Yes. All right. This is perfect. So one of the questions that I had is that I noticed that there is a lot of open source software that is used to. Web crawl for these specific sites and be able to capture them. And knowing the nature of open source software. With people being able to fork it in case. The original developers are not able to. Maintain such projects. I'm wondering if there are any alternatives. Well, not necessarily alternatives to these programs, but. Is screen capture just basic screen shotting. A viable way to web archive. It might not be able to capture at least the UI elements and how they animate. But for the basic sense of. Capturing images. Is there any sort of use for that? I think that's a great question. And I would say yes, I think there is a use for that. It's probably a data friendly way. To at least get some representation. And I think that's another question that I should have included in the questions you should ask yourself in terms of, you know, how far do you want to go. With the playback and the interactivity. You know, it may be sufficient for the communities that. Your particular institution serves to at least just have a representation of the way that website looked and to get the content off of that website. If it's, you know, especially if it's just a lot of text. If it's a new story, for example. Or a government document. So some of. I think. That could be a cost effective way for some institutions. Okay. Yeah, that's my, that's my take on it anyway. Looks like we have a question in the chat. Does web archiving require any metadata or does it just preserve URLs only? I think that. It can include as much metadata as you want it to include depending on the software that you use. If you use archive it, you can put a fair amount of robust data into the. Seeds ahead of time into your collections. And specify what it is where it comes from the date and it will capture things like the date you crawled it at the time you crawled it, things like that that are technical metadata that will automatically extract. But yet, but you can just preserve the URL in that technical metadata. If all you want to do just to do quick and dirty scan of or crawl that site. Does that answer your question? Another question chat. When you web archive, you're also grabbing all, oh, more just a comment. So when you web archive, you're also grabbing all the metadata baked into the sites. Absolutely. Yeah. Yeah. Anything in the headers. Anything that the web designer. Added to that you will get. So I see one about. Screen capture. Yes, I think, yeah, if you if you're doing screen capture and you're getting lots of large image files that can quickly eat up your storage. Unfortunately, even though storage is getting cheaper, it's still not super cheap. I think that's, you know, why archive it charges for what they do. The more the more storage you buy from them and more expensive. We get a half a terabyte at Georgia State University. But you can get many terabytes if you want to do that. And it basically doubles each say half terabyte. That you at the price doubles. For and really it comes down to the storage. So yeah. If you really want a high fidelity representation, it's going to it's going to eat up some. Some chunks. Another question that about what are the copyright issues that institution should consider. That is, I actually was going to include that. And then I thought it's going to take forever to talk about, talk about those issues. It's a can of worms. Obviously in archives. I'll give you just an example of what. What I just did with the peach tree and pine homeless shelter. We were, we had a donation from them when they shut down of all of their records. And we had a lot of them. We had a lot of them. We had like 50 boxes of paper records. To my knowledge, we did not have them sign a deed of gift for. Their electronic. Materials. So we have a deed of gift that covers the normal donation. But we've also got an addendum document. And so the copyright. Should still probably belong to. The people, you know, the person who designed that or. The executive of that organization who donated the records to us. So that's something I probably need to swing back to the other option. Another, another big issue is like social media. Technically, if you tweet, you own the copyright to that tweet, but then Twitter also on some rights to it. And it becomes this whole thing. If you're going to make those things public, how much right do you have to make those things public? And then that also brings in privacy issues. So websites considering their design by people who are not us. We have to be careful about whether we make them publicly available on a website or whether we just make them available in our reading room for researchers in a limited capacity. But yeah, that is, that is a very good question. And one that we grapple with every day. So any other. Anything else about copyright? People might have thought of. How would creative commons licenses apply? I think, yes, I think so in terms of. Ultimately, if. If a web designer or an organization made all of their material available on a creative commons license, if you crawled that material and you used it consistently with that license, say if they said non-commercial use an archives could certainly grab that website and allow it to be used for my researchers and students of their organization. But you know, sometimes hard to determine. What's the licensing status of things like that are people don't normally think about that. When they set up their site. But that's usually what we ask for in. When they donate those materials is to either. Transfer the rights to us or give us that kind of license that creative commons license. Any, any further questions, follow up questions. Painful web archiving experiences to share. I'll share the painful archival experience of dealing with Facebook. You know, starting in December, they started redirecting all crawls to log in pages. So you have to pass them your login credentials. And then when you're crawling, which I did with my personal account as a test, and it got personal, some of my personal information in the crawl. So I set up. I set up a work only Facebook page. And I use those credentials and they decided I was a bot and blocked blocked me from using that account. So I've been trying to crawl a Facebook page for about three weeks now that I, and it just keeps getting getting me in trouble with Facebook. It's been an awful bot trying to steal people's information. Really the worst thing happening on Facebook today. Yeah, just, just another thing to complain about around Facebook. Oh, wow. Yeah, that's, I hadn't heard about that issue, but I just know from just long term reading that Facebook is always kind of posed issues in terms of web archiving, but this is just adding a new layer of. Complication. For me sort of wrap this up, just want to make sure I give any everyone ample, ample time to type out any further questions you may have. Once, going twice. I think I guess, if it pops up in the chat just, we'll go with the flow, but otherwise I think that kind of wraps it up. Thank you so much for joining us today, Josh, and sharing some of your knowledge about web archiving. I know it's a topic folks in this group have been interested in hearing more about for a while, so we really appreciate it. In terms of upcoming news with it being summer, we don't have a ton of announcements, but definitely keep an eye on the DLF Twitter, DLF community calendar, DLF announced list for upcoming working group meetings, forum news, et cetera, we'll be, you know, posting it all to the usual channels there. Our next meeting will be Wednesday, July 28th, same time, 2pm Eastern 11am Pacific. If you're interested in speaking in a future meeting, whether it's July's or later on, feel free to contact me at info. If you have any suggestions for future topics, definitely send them my way and we'll see what we can do to find someone to speak to us, but otherwise before we sort of close out, does anyone else have any additional news or anything they want to share with the group? Well, I guess I'll take the silence as no news is good news. So I guess we'll get to that. We'll wrap it up a little bit early today, but thank you all so much for joining us, everyone. And we will have those slides available. And the recording of this will go up on YouTube. As soon as I have a chance to process it for you. And I will make sure to send it out to the listserv when it's ready, but otherwise have a great day, everyone. Thank you, everyone. Thank you, everyone. I'll have to get back to you next time and see you. I'll be back with a new video as I have a chance to process that for you. And I will make sure to send it out to the listserv when it's ready, but otherwise have a great day, everyone.