 Hey everybody, we're back again. This is Tools for Archiving Sites. I have 25 minutes and a lot that I want to demo. So I'm just gonna go right into it and there's not gonna be a lot of pretext here other than kind of mentioning the different types of categories of tools we're gonna talk about. One thing I want to point out is on the main workshop site, you should definitely check out the session description. So if you click Tools for Archiving Sites, I've got pretty much all of the tools I'm gonna talk about linked up here and some of them will be spending more time on than others. So you may want to find more information, that's there. So the first thing I kind of want to mention is the difference between, and I should say what I even mean with archiving. So by archiving, I just mean taking a version of a site that exists and making a copy for yourself that you can store someplace else. And that can mean a lot of different things. But the main two things we're gonna talk about is HTML flattening. And I don't really know what to call the other one. I'm gonna call the other one web recorder because it's kind of its own thing. They call it web archiving, but I feel like that's maybe too general sounding if you're not familiar with it. HTML flattening is great and all it does is it's a way to take a site like a WordPress site or it could be of other technologies. And basically it's a little virtual browser that clicks on every link and makes HTML and CSS and JavaScript only versions that you can then get a folder of and put anywhere. So the advantage being you could say, take a site made in like a really old version of WordPress that you don't want to run anymore. Maybe it's like WordPress, I don't know, three, really old and probably very insecure. And you're like, we want to keep this site but we never need to change it again. We don't need to log into it or do anything like that. You could use an HTML flattener to make that site just HTML files. And then you could put that on your domain of one's own in using the file manager or an FTP client or any other web server for that matter. So that's really a great method. The first thing that does this and is probably the first thing I'd recommend if you're unfamiliar is simply static. Simply static is a WordPress plugin. So it only works with WordPress sites but basically you can install it and it will make a static version of your site. And then you can then take that static version of your site and move it to wherever you want to archive the site. Once that's done and you're sure that the static version has been made and is all good, you can remove the WordPress site. You won't need it anymore. A couple issues with it though, like I said, it's WordPress only and then the other main thing that is kind of problematic about it is it's got to make a static copy of the site in the same location or in the same account as your WordPress site in most cases. So sometimes you can end up running out of space because it's essentially making a duplicate copy with all your images and stuff. So that's one thing to think about. That being said, the other reason I'm not gonna spend a lot of time talking about it is I think simply static is relatively easy to use and is also pretty well documented but I think a lot of times when people are trying to archive these things they're talking about things that aren't WordPress. So the next two tools I'm gonna talk about together. The first one is Site Sucker and it's a macOS only thing and the other one is HTTrack and it has a glorious like 1997's best website. It's amazing, I love it. And it's got a Windows GUI or it's command line based. So I use it a lot at the command line on Reclaim Cloud or on my Mac sometimes. But Site Sucker for macOS is also great. Both of these tools you just give it a URL and it will go make a static version of the site. The thing to keep in mind though is it's going to crawl the entire site. So it's gonna click on every single link and try to make a local version of the site. Usually the default settings are set so that they will not go to other domains because if they did you could imagine that it could crawl like hundreds of web pages thousands of web pages and never finish essentially. Like what if you linked to Google.com? What then? Or Facebook or something, that's pretty common. So usually they will say, all right, like my blog's jaden.me and they're gonna go through and click on every link on jaden.me but skip the ones that go out to external domains. So those are two tools. I'm also really not gonna demo them because, well I actually will demo HTTrack here in a little bit, but Site Sucker for macOS is really easy to use and the Windows version of HTTrack is pretty easy to use. But the problem is I think the more advanced versions of these tools are not easy to use at all. The first one being HTTrack's command line version which is what I use most of the time when I'm making these, it's not too bad but you need to set it up and folks aren't always familiar with how to do that especially if you haven't used much command line stuff. It's not necessarily simple. And the other one is web recorder and web recorder is a whole project that's really fascinating and it can do, it's like a extension, a Chrome extension I believe and the main difference between web recorder and HTTrack or other HTML flatteners is web recorder makes a singular file of whatever you're trying to archive that is viewable via a lot of different ways. So basically what you can do is you can install a Chrome extension and visit a page and it will save the page and then when you're done you end up with this WARCZ file that you can then go to replay web.page and view and that's really, really cool. The problem is by default archive web.page does not have any way to crawl an entire site. So it doesn't, it's just gonna do the single page that you visit or a couple pages that you visit and if you wanna archive like an entire blog you're gonna be doing a lot of manual clicking. The big advantage to web recorder though is because it packages everything in that WARCZ file it's really great for folks who want to view a copy of a website later but isn't necessarily hosted on a web server. So maybe you have a student or a faculty member who's like, I want a copy of my site and I'm not moving it anywhere. I'm just gonna, my site's gonna go away. I wanna be able to open it and view it on my computer. That's actually not super straightforward with HTML flattening because sometimes links are set up in a way that they don't point to the right locations when they're flattened. It's a little bit tricky. So that's where web recorder is really great. So then you get into the issue of, okay well sometimes you want flattened sites and sometimes you want this web recorder thing but they do slightly different things and there is a tool that kind of bridges the gap called browser tricks crawler. It is a super mouthful but essentially it uses what web recorder does but it is an actual crawler. So it'll go and click on every single link like HT track. What I find is browser tricks crawler is not at all easy to use. It's extremely oriented towards developer folks or folks who are very, very familiar with Docker. So it took me a little long, it took me a little while to wrap my head around it. So what I've been working on is a script that kinda does best of both worlds and I'm gonna show you what you can do with it and I'm hoping to in the next, like this summer essentially wrap this up into something that anyone can use, run on Reclaim Cloud if they want to or their computer and basically just type in crawl and then the URL that you wanted to do and it will make both a flattened version for you using HT track and a archived version with browser tricks. So I'm gonna show what that looks like. This is gonna be a very live demo so hopefully it works really well. But- And for those of you who haven't seen this space this is the Reclaim Cloud dashboard. Yeah, and that's a good point. We haven't really showed Reclaim Cloud on stream. By the way, I am planning to make this a tool that you could run on either Reclaim Cloud, another cloud service if you really wanted to or your local computer. It won't really be possible to run this on Domain of One's Own for a variety of reasons. The main reason actually being it would be a lot of load on your Domain of One's Own server and it would probably make it slow or unresponsive while you're crawling sites so that would be not good at all. So basically what we do here is, and I'm gonna go in, I did some messing around with this earlier. I'm gonna delete my previous crawls here and I'm gonna make one, we are just gonna go in and crawl a slides.jd.me which is a version of my, where I keep some PowerPoint slides and things like that on my site. So I do archive.shslides.jd.me and what it's gonna do is it's going to start crawling the site. And this might take a little bit, this is a pretty small site, but the cool thing that I've got set up here is that this is also a webpage. So when it's done, we'll be able to view the results of the crawl and go ahead and download that web archive file so that we can go look at that and see how it turns out. So right now this is super bare bones and ugly. If you're familiar with this, you'll know that this is just a patches like default. I don't know what to show you view, but it works. So it's still going right now. Taylor, while that's running, Ed has a question. He says, do you intend to offer this as a service? He's imagining, hey Taylor, here's a list of sites. Could you run your script, send me the archives and then deactivate them on the network? I'm intending to offer this as something that is so easy to use that you don't need it as a service. And I know that, yeah. So I know it looks a little complicated now, but the idea will be, this could be something where you would go into marketplace and reclaim cloud, hit go. And then when you're done, you would pop open a terminal type archive and then give it a URL, which would be pretty simple to do. And then we'd also of course document it as well. Cause right now I know that this is a very fast overview. And the reason I'm doing this this way is some of this stuff might change a little bit by the time I have this ready for everyone to use. But yeah, that's the intention is I want this to be something that admins can spin up. And then when they don't need the archiver, they can delete it from reclaim cloud. Cause you could always take what you archive here and FTP it over to domain of one zone or any other place. And on that topic, Jackie had a question about what the system requirements would be for an admin that wanted to spin it up. Yeah, it's just gonna run in Docker. The entire thing will run in Docker. So the advantage of that being is we can run it on reclaim cloud, any other cloud service, but you could also run it on a Windows or Mac OS machine. So your own machine would be just fine for that. The advantage being if you run it on a cloud service, it'll be faster because you're bandwidth. Your actual internet speed on reclaim cloud is extremely fast compared to on your local machine, you have to let it run using your home's internet connection and you need to make sure your local machine doesn't like sleep or shut down or anything like that. I will say if you're doing a couple of crawls doing on your local machine is fine, but if you wanted to schedule this to happen automatically, you probably would want to put it on a server. But that being said, I think making it available as a one click installer and reclaim cloud was sort of bridges that gap where folks can just delete it when they don't need it anymore. So they don't have to pay for it or shut it down. Okay, so the crawl's done. So it made an HT track version. I'm gonna look at that first. And this is just sort of HT tracks, defaults, output. But basically I can click on this and this is my archive site. You can see that it's at crawler.ca.reclaim.clouds slash slides, that's the whole thing. So I can view it and that's all there and it works. The other thing I wanna mention here because I only briefly talked about what HTML flattening looks like is I wanna show you what that looks like in the file manager. So if I go in here, there's a slides.jdn.me folder, then HT track. And then under this folder, HT track makes a little slides.jdn.me. So this is the folder that I could download and move to any other site and it would be archived. If you're curious what this look like after you move it, I did do this kind of thing when I was an admin at St. Norbert. And so I would throw them on archive.night.domains and they would look like this. Taylor, we have a question from Jackie. Will this continue if it gets interrupted? Yeah, so the way I have it set up is so you'll notice that I had the terminal open here. If you close the terminal, it'll keep working in the background. It's not gonna be interrupted in that way. It kind of depends on what you mean by interrupted. Like if you shut down the environment, no, it wouldn't continue. So there are occasions where it would not continue. And this is gonna be something that's always gonna be, it's gonna require an amount of patience, right? Because archiving sites actually does take a while. Like there's a reason I did this tiny little one for a demo. I've been using it for plenty of sites, but oftentimes when you're doing this on a larger site, it'll take 20 minutes to run. So yeah, if you start an archive and close your terminal session, it's just gonna keep doing its thing in the background. So that's the HTTrack version, but I also wanted to show, I'll go back to the main site here, the WebRecorder version because WebRecorder does this, or BrowserTricks crawler does this really cool. And some of this stuff, this like, all these sub-directories that I'm having to navigate through, this is the stuff that I haven't automated yet. And what I wanna do is make this really simple so that you don't have to deal with all this. But if I go and download this, and it shouldn't be a super big file, so we're just gonna do that. So it's 48 megabytes. This capture-date-time.wacz file, this is a totally portable self-enclosed version of the website that you can view if you go to replayweb.page. And then WebRecorder also has an ability to, there's an app you can install, if you wanna do it in a desktop app, you can, it's pretty much the same thing as a website. And if you want to, you could even actually embed this viewer on another webpage. I haven't messed around a lot with that. I know that, I think Edbeck has done that. He's talked a little bit about site archiving before. And so that is also a possibility. So now that I'm at replayweb.page, I can just select my file here. You can even store these in like Google Drive and stuff. It's very flexible replayweb.page. And you hit load and it'll show you, okay, these are the sites that are included in this. So if I click on this, it's got a little virtual browser including the original URL, which is really nice. So I really love this view because it gives a pretty, it's maybe a small thing, but when I'm looking at an archive of a site, excuse me, sorry. I kind of do care about what the URLs were. So it's kind of nice that it will show you that. And that's really, really slick. So yeah, and so that's what a WebRecorder archive ends up looking like. So the nice thing I think about my tool is that it makes both versions because I think there are advantages to both. I like the portability of the WebArchive standard, but HTML flattening is often what domains admins actually want when they wanna throw it on a server to exist in a different location. So I personally think in most cases, you kind of wanna have both and you could like I've got. Well, yeah, just every site's built a bit differently, right? Totally. So I think that's one of the things you have to archive that in some fashion, the better, I think. Yeah, and you could do things like, you could just zip this up and keep both versions or put one version on Domaino One's own. And then you can, all you have to do is send this WACZ file out to, say it's for a faculty member, you just email that to them and say, here's where you go and can play back the webpage. Best of both worlds is what I meant to say. How much time do I have left? Got a little over five minutes. Perfect. Okay, so last one, this is a bonus tool that I, is honestly, this one's rough, but I'm excited about, and it's a screenshotting script that I actually started working on like years ago. And then as I was prepping for this session, like on last week, I was like, oh, this belongs in the same conversation, I think. Absolutely. Even if it's not exactly the same thing. So what, sorry, we got you muted, Jim. What? It better get the best side of BobaTuesdays. So what screenshot collector does, is it is a script that uses Puppeteer, which is a, I think Google made, it's basically a command line version of Chrome. It's not a Metallica song? Yeah, yeah, Master of Puppeteers. And so basically, this is not relevant, but basically what this script does is it makes Puppeteer, I think, easy to use if what you wanna do is take a lot of screenshots all at once. So this is something that I probably will take me longer, but I think I would include in the site archiving toolkit, eventually. And what you basically be able to do is just make a text file that has however many URLs you want on it, and then tell it to do its thing, and it would take a screenshot. I did this with, I made this for night domain, so I can tell you that last time I used this, I think I used it to take like 580 screenshots, which takes a while. But the way the script works is it takes them in batches. So I think it took like three or four hours, which is not that long, compared to some of the earlier versions of this that I was trying to work with. So I'll demo this really quick, and again, this is extremely early. So if this looks complicated, that's because I haven't had a chance to make it less complicated. So sorry. It's cool to see the process though, and I'm particularly excited for the script that you shared just a minute ago too, so. Yeah, so basically in my case here, it's going to do all of the things in URLs-example.text, and I have a couple like whatever URLs like my blog and like Amazon and stuff. And then throughout the day, I've just been kind of collecting, I did solicit URLs earlier in a Discord thread, but I've also just been, any times people have been linking the stuff, I've been throwing it in here. So this I think is 36 URLs, it's not that many, but so my point is some of these might not work, because I haven't really tried this, and it's fine. We're all going to be fine with it. So basically you tell the script to do its thing, and what it's going to do is make a folder called screenshots, it's going to figure out how many URLs it has to work with, and then do it in batches, and it's configurable how many it does at a time. So I have it doing eight, which is great on Reclaim Cloud because you have all the bandwidth and resources, but if you're doing this on your own machine, you may need to play with that. I will say when I did this for night domains, my old computer could only do five before I ran out of RAM. So, or I don't remember, maybe it was not RAM, doesn't matter. So you can see it's already done, it's on batch three of five, and yeah, it gives a little report of what it's doing. So the plus sign means it took a successful screenshot, and I think it'll put an X if it doesn't. I know there'll be at least one X because it currently has a bug where it reads a new line character as a URL, and that's just because I'm not even close to done with this, but I wanted to show this because I wanted to see if this would be useful for anybody essentially. I think it could be, like I think one version of this could be, you could have it do this like once a month or once every six months or something on your domains project and feed it the list of URLs from the last login script and get an overview of like, this is what every account is, this is how many hello worlds we have and stuff like that. Also just going back to where we started this conversation, which was trying to see the arc of sites, right? And I know we talked about trying to track disk usage over time is one way, but even just a simple screenshot to see how these sites are changing and adapting over time. And I've even enjoyed that experiment going back and looking at my domains with internet archive for example. So I think there's value there. Can you imagine screenshotting a site like every week and then playing that back as a small video over a year to see how it changes, right? Yeah, like a course site or something. It would be great. It would be crazy to see the theme switchers, right? That everybody knows the people that switch themes like twice a year. I feel like there's always some of those. And I'm not one of them. It looks like Tom said it would be fun to tie in timeline JS for each site. Ooh, yeah. Okay, so this is what the screenshots folder looks like and so the cool thing is if you view this as thumbnails and if I zoom it in a little bit here. In the last minute. Yeah, okay. If I zoom in, you can actually very quickly, it's amazing how fast the human eye can be like, ah, yes. So you can see here's our workshop site and I don't know, I find this very useful and I will say when I did this for night domains it was interesting to see like, ah, we have a lot of people using the 2018 theme but not many using the 2019 theme just because you can kind of pick out even structural elements like that. So anyway, that's all stuff that I'm hoping I'm actively working on to make accessible to admins. This is something that you think would be useful. I guess sit tight because it's coming but if you have like a feature request, feel free to send it. I'm a really bad programmer. So like it's very likely that I'll be like, that would be cool if I could figure that out and that's all I'll be able to say. But I definitely want to hear. A couple people inside in the chat. Yeah, thank you all so much for joining. We will see you in a couple of minutes.