 I'm good. I'm good. I just came back from a camping trip this week. So I'm showered and feeling good. You showered for this? Yeah, yeah. Oh my God. You wouldn't be able to tell, but I did. So what I'm doing right now, why I might seem distracted, is that I am actually looking at this new reclaim.tv and like and subscribing on Mastodon. Yeah, we got all kinds of new stuff. I feel like we had the Reclaim EdTech year in some ways culminated with the conference, which was, of course, not just the whole company. And then we kind of took a couple of weeks in the August, like two weeks. And now it really feels like year two. And so one of the things that you and I were talking about is we need to get back to more just sort of off the cuff of regular streaming. So we're going to try to make a thing of this a little bit more, is just, hey, we're going to go talk about X. And this is going to be maybe the first one of these, or not maybe, it is the first one of these. You might be watching on YouTube, you might also be watching on Reclaim TV, which is our new owncast install that is going to be something we'll be trying to stream in both places most of the time. And owncast is a really cool tool. We'll talk about owncast at some point on its own possibly. Did it come up in our open source media ecosystem? Yes, it did. Yep. And I think you and I both played with owncast quite a bit over the years. But it had been a little bit since it had been about a year since I had really dug back into it because we had been going all in on Peartube. And Peartube is great too. We're not abandoning Peartube. But we're trying to, we won't talk too much about it because we have limited time, but we're trying to kind of rethink and kind of situate like, well, we want people to have, we want people to know that all the streaming and video stuff we do is both on YouTube and on the Fediverse, basically. And what does that mean and how can we make that easy for people to navigate and know where to go? And so what we're kind of landing on at the moment is, okay, well, we'll keep doing stuff on YouTube, of course. And then we're going to use owncast as kind of like the front end. So if we are live doing anything, you just go to Reclaim.tv and you should be able to see us there. That's at least in theory. It's actually happening. I can see us there. Yeah, yes. I should say, I know we are live today because I have the thing up or whatever. But yeah, so one of the cool things that StreamYard lets us do is send our streams to multiple places. So we're using that. And then I think the next step is, every once in a while, I like to do, when I do streams myself, I like to use OBS so I can really just like crank the quality to max and do like 4k and stuff. And I want to see what that means for simulcasting. I don't even know if it's going to be possible, to be honest, bandwidth-wise and everything, but I'm going to mess with it. So yeah, I do realize I have to put our title in because it just says test right now. But that's one thing I have to mess with. But yeah, this is the first one of these simulcast Reclaim ed tech streams. So we're mass communicating. Is it one at a time in here? Yeah, that was for year one. This is year two, we have to do two streams. Yeah, exactly. Exactly. Year two, two streams. You're three, who knows? So for today, we're going to talk about, though, Taylor, what is the news? We want to talk about UMW blogs. And this is going to be about the technical sort of background detail, well, technical details about the archiving of it and how it was done and, you know, sort of what we did to make it work. I don't think it's going to be the last time we're going to talk about UMW blogs archiving project. In general, I think it would be awesome to talk with some of the folks from UMW right now and have you talk about maybe like, you wrote a great blog post, a eulogy of sorts to UMW blogs and, hey, nobody's going to be creating new content on it anymore kind of thing. But I would love to talk more about that in general. But we thought it would be good to separate out into its own little informal stream of just like the nitty gritties of like, how do you take a site with, you know, 10,000 sites and archive it to flatten the HTML in a way that is possible? Yeah, let's hear it. How did you do it? I'll start with the question. Taylor, how did you archive UMW blogs? So, yeah, so basically we used a tool kit called the site archiving toolkit, which I've talked about before. And this is a toolkit that I put together that really is just a nice, easier to use wrapper. It's a script around two tools, WebRecorder and HTTrack. WebRecorder and HTTrack are really interesting tools. We've done a little bit about them before. I'm sure we'll talk about them again. I'd love to do at some point maybe a whole course on archiving if that is interesting in the far future. But basically they do two things different. They do the same thing different ways. They take, you give it a URL, these two tools, and HTTrack will give you a flattened HTML structure. So, it'll give you a folder of HTML files basically. And it'll be representative of what the website looks like right now. So, you can take a dynamic site like a WordPress site or really anything and get a copy of it as it looks like in that moment in HTML. WebRecorder or a browser tricks crawler, which is sort of like the souped up version of this that you can run on a server. WebRecorder does the same thing, but instead of delivering just a folder of HTML files, it gives you a sort of zipped up web archive file. And it's very sophisticated. WebRecorder has all kinds of capabilities to capture like funky things that the modern web can do, like sometimes fancier. This isn't so much a consideration with WordPress sites, but sometimes other types of tools will do things like use JavaScript to load in other style elements dynamically after the page loads, actually. And WebRecorder has capabilities to do that. So, HTTrack is great because it's sort of simple and it gets the job done. And what you're left with is HTML files, which is the same kind of HTML files we've had as long as the web has existed in a lot of ways. And WebRecorder packages it into its own thing, which is still based on a standard, but you need a special way to be able to view those files, basically. And what my toolkit does is make it really easy, well, relatively easy. You give it a command of archive and a URL you want to archive, and then it will use both of those tools to make those archives and then set them up so that you can view them and move them to a web server for long-term preservation. Is it true, Taylor, that you have a basically an installer in ReclaimCloud to do this? Like, I'm not being facetious. You have created one, right? Yep, yeah. This is a thing you can run in ReclaimCloud right now. So, if you go to the marketplace and I'll kind of talk about like what this meant for the BUNW blogs archiving process, but if you go to the marketplace and search for archiving, site archiving, or go to the education category, I have it bookmarked here, you can just click a button and get your own copy of this. And I'll demo what this looks like in a second, but basically, once you, I've already got one deployed that I use from time to time, but once you've got one deployed, you open up a terminal in it and you just type archive and then the URL you want to archive. So, I'm going to give it a few of them. We'll do a small site that I have that I know is going to finish relatively quickly and we'll do like the Reclaim.Cloud website. And so, you can give it one URL or you can give it several and it will just one at a time go through the list basically. And if you give it several, you need to separate it with a space. Can I give you a URL to try? Yeah, I actually already started this one. So, let me quit it. This is the first, because this is talking about UNDW blogs. There was a proto UNDW blogs called like blogs.dlsweb.org that's still online. Well, okay. So, we can do that and we're going to see this will be a good demo depending on how long it takes of why this isn't super straightforward for WordPress multi-site, right? Why couldn't I just say, hey, just go archive umdwblogs.org. Done, you know. And why isn't it that simple? Because this made simple a lot of elements, but there was still quite a bit of extra stuff. So, yeah, we can definitely do that and see how it does basically. Yeah. Let me just delete the things. Okay. So, what was the, I am still going to have it do that tiny site just so we can see the completed one because this one probably won't finish while we're talking. So, what's the other URL? It's HTTP. I don't know if it's HTTPS. I don't know if that will create a problem. Let's see. Let's visit it and we'll see. Yeah, it does. It loads securely blogs.dlsweb.org. I can't believe it's still online. WordPress is resilient. Yeah, it's crazy. So, mine loads not over, but if I... But if you force it, it will do it. Yeah. It's got parts of it are insecure. We'll see how this goes. I'm actually going to, because it looks like some images are insecure, I'm going to have it do the non-secure version, but this is a good first step is if you're going to use my toolkit to archive something. These tools are not smart. So, that my toolkit certainly isn't smart. It's basically just a script that's giving my favorite command line options for using these, and you don't have to worry about it. That's nice. But if you do something like, let's say I wanted to archive my blog site, and you gave it HTTP colon slash slash jaden.me. If you did that, it would... Oh, I guess I'm not redirecting HTTPS on my website. I need to do that. But a lot of websites force you into HTTPS. And if you give the not the right version, right? If you give it HTTP, and it redirects, what you'll end up with is one blank page because my... The web recorder and HTTRAC will be like, cool. We recorded that redirect. Oh, did you want us to actually record the page behind the redirect? No, we won't do that. Too much. Yeah. So, you do want to be careful. And it's just one of those things you got to play with it, right? Like, because every site has a little bit different things, little bit different things with it. So, I'm going to make a guess just because some of those images aren't loading securely. Let's not do the HTTPS version just to see how that goes. So, anyway, I'm going to give it two URLs, and it's going to go ahead. The cool thing about this is it's actually doing them simultaneously, both the browser tricks crawler and HTTRAC thing. Right now, we're looking at the HTTRAC's command line outputs. You can watch it, but you can actually close this, and it will just keep doing its thing. Oh, really? So, you can close it, and it's basically already in the background? Yeah. If for the Docker inclined, it's running as a Docker... It's running as a daemon. So, with the dash D command line switch, so that it just will keep doing its thing. In fact, if you want to quit it, you have to type a special command that will go, and it will quit the container. Is the command quit it? Yeah, it's... I should make that. It's quit dash crawlers. So, I do have this roughly outlined on the GitHub page, and I'm working on documentation for it, like good documentation, but this has the basics. So, basically, archive, space, the URL, and you can give it more than one if you want, and then quit dash crawlers. That's the command. Look at you. That's awesome. You're so blown. You can also run this on your own computer if you install Docker on your own machine if you want to. There are a lot of downsides to that, because while this will run in the background, even on your own machine, if your machine goes asleep, it's not gonna stop, right? So, I wanted this to be an option for folks, and I wanted to say, like, hey, if you don't want to run this on the cloud, you don't have to, but this type of work really is a lot easier on the cloud, especially because you can let it loose with as much resources as you want to give it, as you want to pay for, and then as soon as the crawl is done, it costs you basically nothing. In fact, you can delete the environment if you download your crawls. And there's a story there that you'll get to when you talk about even though you blog. It's like, what you expected the time it would take, and what the time you did it in, right? Yeah. So getting to the UMDB blog specific. That's how that tool works. We're gonna, if I refresh the page here, this is, sorry, let me do this again. So this is the URL for my archiving environment. If I click on it, it's just running a basic Apache server. Like, if you've ever used Apache and not put an index file in a folder, you'll see something like this. That's literally what my toolkit's doing because I am not a web developer. And so this recipes one is done. So what you see here is, hey, this one's incomplete. That's the ELS web one. The recipes one is done. This site is like nothing. So it probably took a very short amount of time. There's six things in this folder, three of them for web recorder, three of them for HD tracks. So I can download the completed crawls with these zip files here and I can just click on them and they'll download. I can look at the log if it's stuck. This is great because you can actually look at the log even while it's running in the incomplete folder and we'll look at that. Or if you're just curious, you can look at this and go, all right, it started at 13 minutes past the hour and it finished at almost 15 minutes past the hour. So it took two minutes for HD track. And you'll notice if we go to web recorder here, it's log. Now this log is really messy to look at, but this one started 13, 55 and finished at 1504. So in this case, it took about the same amount of time for both of these to finish. I can also click on the ones here that have no other thing here. I can just look at it and browse the actual work of it. And you'll notice here that the full URL is kind of messy because it's in this archiver environment, but it does work. So we can look at this kind of broken site that I have that I never did anything with. Someday, I will. And if I do the same thing with web recorder, this is kind of the cool trick. So web recorder is, it's easy to make web recorder archives, but they don't have documentation on it. Well, what if I wanted to host an archive somewhere, but I wanted to make this like dead simple so that people could run my tool, download the zip file and literally go to shared hosting and unzip it and just you're done basically. That's the goal. And so this is what you wind up with. It takes a second to load because that's just how web recorder works. But you get this fully browsable web recorder archive. It'll have the URL, the original URL here, which I think is kind of a nice touch. You hilariously get a refresh button. Why that's there? I don't know because it's never going to be any different, but whatever. There's back and forward buttons in there. And then this will actually search through the entire text. So if I look for the word soup, it's, hey, it shows up on these two pages. Nice, which is cool. And so that's the difference. Now, the actual content of the archive, if I go in here is kind of simple. So if I go to the crawls folder, that's where they all get saved. And let me zoom this in a little bit. So this one here, here's the two folders. These zip, again, these zip files are literally the same as these. It's just zipped up so you can really quickly download them. And if I go in the web recorder folder, here's everything. So there's an index file. This is the actual stuff. This is the archive. If you didn't want to host this, you could just grab this file and throw it in Google Drive something or something for later saving. Or if you want the whole thing so that you could host it someplace else without even having to remember how it works, you just keep the zip file because the zip file is going to be, I accidentally went back and reclaimed cloud. The zip file is going to be basically the same just with that extra HTML file. So yeah, this is the WACZ, web archive zipped. So this itself is compressed. And then the replay folder just has the JavaScript that makes the player we were looking at, the website work. And then the index file is real basic. It's just saying, hey, here's the name of that archive. And here is the URL that you should open the archive to and embed. And I have some, hey, if you don't want it to look this way, they have some documentation on how you can change from default to replay. So if I change this to replay only, it doesn't actually show that address bar, I believe. Is replay web dot page, is that the, what's the name of the tool? Yes, replay web dot page is their little way that you can host, or sorry, host, you can read a web archive file. So my little thing here is designed to make this unnecessary, basically, for certain use cases, right? My goal is like, imagine if the way we archive UMW blogs was to say, all right, well, here's 9,000 files. If you want to look at them, you have to download one at a time and go to replay web dot page and click choose file, and that would suck, right? Like, nobody would want that. So my little thing is to basically do what replay web dot page does, but in a way that feels more like the web, basically. And again, I'm not to, I'm not trying to blow this out of proportion. All my script does is just simply write this index file for you and grab this replay folder for you. This is documented on the replay web dot page site. They have a thing that's like, hey, you know, if you want to do this yourself, download this folder, write an index file. But I wanted to make that like something you didn't have to think about. Is that something the web recorder folks built? Yes, yep. Okay, gotcha. Yeah. So problems with doing this for a 9,000 site, a WordPress multi site like UMW blogs. And I should say it's 9,000. We archived around 9,000 sites. That was after the Shannon and possibly other folks at UMW did a lot, a lot of cleanup work. So they went through and helped people move to their new WordPress multi site for probably, I would imagine that was that work was happening for basically the entire year before. And also archiving some stuff that was just broken and stuff like that. So, sorry, archiving, archiving the WordPress multi site sense, right? Yeah, exactly. We take it offline. I think there was as many as 13 or 14,000 sites at some point. Yeah, yeah. So this was, we narrowed it down to 9,000. And what we found was, A, the way my toolkit works, where it takes both versions, the HTTRAC and web recorder version, well, we don't really need that. We really only need one good version, right? The reason my toolkit works this way is so that you can look at both versions and decide which one you want to keep, basically, because those tools work differently and sometimes one will do a better job or just spit your use case better, right? One of the things that's cool about the HTTRAC thing is that all of the links are, if you put these files in the right place, are the same links. But that's not how web recorder works out of the box because it's this funky JavaScript application, basically. So what I originally was thinking was, okay, let's use HTTRAC because then all the links will resolve. Well, what I found in my testing is if I just use HTTRAC, HTTRAC is much slower for larger sites. And I'm not really sure why that is. It just is that way. Web recorder is really fast. By my guess, if I gave HTTRAC 9,000 sites to crawl, it would have taken up to nine months for it to finish. I did a test of like 300 sites and it took multiple days to finish. So I was like, okay, well, that is not going to work. But the web recorder one went much faster. To speed it up even further, I decided what I was going to do was split this. I was going to make nine and I ended up actually doing 10 different environments in Reclaim Cloud that were all just doing web recorder archiving for this project. So we took our list of 9,000 URLs to archive and we split it up into basically nine pieces with a 10. I'll talk about the 10th one a little bit. And so they were all simultaneously working. And I should say Jim had to pop out for a second, but he'll be back. So this was kind of a big wrangling of files and stuff to make that all work. So what I was just saying is that we took this 9,000 URLs and I made nine environments and split it into roughly 1,000 URLs per environment. And so that we could do nine environments simultaneously all crawling UMW blogs at the same time. And I will say I was a little bit worried that UMW blog itself was going to crack under the pressure of that kind of traffic. But it did fine. It really was. But who hosts it? Yeah. Yes, exactly. Well, and that's the thing, if we, if it was hosted someplace else, let's say that it didn't crack under the traffic pressure, it probably would have gotten blocked for what like malicious traffic. Can I ask you something with that 10 instances that you were running? How did it know which ones to do? Did you say, do these X amount of sites, you do these X amount of sites? Did each of them have a predetermined site? That is an excellent question and a great way to transition. So what I did is I made a special version of my site archiving toolkit just for this project. And I wouldn't have to do it this way. But it was just easier for me to basically take my site archiving toolkit and I made a branch called UMW blogs. And this is where all of my work happened for this project, basically. So what I did and let me again zoom this in a little bit, it's really the same basic thing, but I went in and I disabled some functionality. So I said, I disabled the stuff that would make HT track archives. So that we just weren't doing that. We're skipping that part. And then I made a script for each archiving server. So basically, first I use the WPC li to get a list of what are all my URLs. And then I made that into a text file. So this is a text file of all the URLs. And then I just manually in a text editor, I used VS code to select 1000 of those at a time. And I put him into a script like this. So it was archive.sh. That's the that's the archiving script that I made. And one line here per URL. Right. So the nice thing about doing it this way is I could look at in the logs and see, well, where is it right now? It roll right now it's working on summer in Spain that unwblogs.org. So I know it's at site 49 of 1000. And so this is really nice during the monitoring aspect of it. And then I made one of those for each of them. And then that's how this started. And I ended up having to get even a little bit more funky with it. You'll notice these resume files. These are scripts I made where basically I found that the web recorder version we were using has some bugs. I mean, every software has bugs, but there was a particular bug where once in a while it would get stuck on a site. And so I would look in there and notice that it hasn't made any progress in like three hours. And so that I would go in and I would quit it. I would take the site that it was currently working on. And I would move it to this stuck.sh file. And so there was really only 23 sites I had to do this with, including the front page, by the way. Did they have something in common that made them do this? I'm really not sure. I think that this something to do with the size of the sites, I think it once they get to a certain point that this version of web recorder was just choking on it. And it was just not having a good time because most of these, actually that's not true. Some of these were smaller, but some like the tag site, this site is massive. It's a copy of every post on the network from 2008 to 2015. So it's just huge. But yeah, I honestly, Jim, I didn't take the time because it's not like I could do anything about it anyway, right? Because it's not my software, right? Ultimately, web recorder would have the folks in that project would be best served to fix that. So what I did is I set them aside for the moment, I said, great, I'll return to these. And one at a time I would add to the list. And then the resume file is how I resume. So I would go through and say, all right, well, like for instance, UMW seven got stuck on ethics and lit was the one right before it. So ethics and lit was 827. So it probably got stuck on this one, the one right before it. And so I would delete, I'd make a copy of the file, delete everything before it and restart the crawl. But let's start with the next one. Basically, that was a manual process, right? But, but honestly, here's how crazy it is. That MOPO site that I got stuck on is Mara Scanlon's modern poetry site. So that's crazy that I know Oh, and that one looks like it was busted. So, yeah, so anyway, so that's, that's how that worked basically. It was a little bit of manual process restarting these, but not that bad. I mean, honestly, 23 problematic sites out of 9,000, like, I'll take it. Good numbers. Exactly. So I gave nine. So I created nine of these environments and I went in and ran the commands and each one, right? But the nice thing is I made this GitHub branch so that I would have sort of the same set of files on each that was really just me going in and typing bash UMW two dot sh basically, and it would run the thing. And so then I also, I also made a different version of my installer specifically for this. Now this isn't, this wasn't published. I just took this JPS and use the import button and reclaim cloud to run it. But what this one does is it's really the same thing as my normal installer, except it's specifically getting the branch version, right, the UMW blogs version, so that when I would deploy these nine archivers, it automatically had these set of files right away. And the cool thing is I could use get to manage all of this, right? So if I made changes, I could just do a get pull and it would all have the same version of everything. Now there, there are more sophisticated ways to do this. Like I even thought about because I had nine of these, like if you had like, say, 40 environments, you could use something like Ansible to actually run the same command in each place. But I wasn't going to set up and configure Ansible for just nine environments. That would be a nightmare. And I'm not really an expert at Ansible anyway. So while we at the company, like infrastructure has expertise, that's not really something I know that much about. So exactly. But my point is to say there are ways to avoid having to type the same command in nine times, besides copy and paste, which is of course what I did. So my installer here grabbed this special version and deployed it. And then I ran that installer nine times. So I had nine environments to work with. And then I did end up making a 10th one where I ran the stuck script. And this one, I had to modify further so that it would use the latest version of web recorder, which actually seemed to fix some of the bugs I was running into. So that was a little bit of a process to figure that out. So then what you've got is 10 places where all of these files are. So the next step for me, and I'm not going to go into a lot of detail, but the next step for me was to back up all of these to Google Drive. I wanted to have a copy in case I messed anything up. So I used our clone to do that. So they basically uploaded right from the environment to Google Drive. We've used our clone around here for a few different things. And it's basically like a command line interface to cloud services. It's really, really cool tool. Then I are synced all from all of these different places to one container that was going to host all of the archives, basically. And so I used, I set up the private key, a public key pair thing and did the archiving. And some of this I was able to kind of automate and make a little bit faster. So let me, I have a whole set of notes here that were sort of my like, all right, let's formulate this terrible long command that will do five things in a row so that I can paste it in nine times, basically. So when you mean archiving the files, right? So when this was done, right, say umw4.sh, it created a web recorder archive that you then uploaded into one space. Like, how did that work? Good question. So it was making one web recorder archive per site, per sub-site, just like we're seeing here on my little test one that I was messing with. So I would look at UMW, it was like the URLs were like, umwblogsarchiver4.us.reclaim.cloud. And I actually ran these across every region in our infrastructure, just to balance the load. Again, I was, I was worried that it was going to cause problems in terms of storage and things. And to be perfectly honest with you, it would have in terms of storage, but not really CPU or RAM or anything. So, so then I basically would look at, I'm gonna click on like, you know, like 50 of them, be like, did these all work? And then I are, I are synced all of them over. One other change that my special version of the UMW blogs archiver did is it actually did not zip up any of them because I wasn't going to be downloading them, right? I didn't need that extra zip file. So it was designed to just make the files that we needed as to save space, basically. Then, so, so they all got synced to this environment, which I could pull up, which is, let me have to zoom out and zoom back in here, just because I'm making this text bigger. But this is an environment I made called UMW blogs archive. It's actually in our EU region, which I just cause honestly, it was a new region at the time. It still is, I guess. And they all end up in the web root directory. So this, this environment is just an engine X environment. It has PHP and engine X and that's it. It's very bare bones. And so all of those wound up in here, if I do an LS here, it actually takes a while to do it. And all 9,000 are in here. So there's 9,000 folders, basically, and it's not exactly 9,000. I think it's like, like 9100 or something like that. I don't remember off the top of my head. And so in each of these is that archiver archive.wacz file that we were looking at. So if I go in here and look at Z M A L I K, how about that one? Same thing as, so there's that index file, the archive file and the replay folder. So the cool thing with this, and this was kind of part of the project is that the folks at UMW one is they wanted the ability to kind of easily delete or hide archives if people chose to. So they, they've said like, Hey, if we have someone that comes to us and says, I don't want this hosted anymore, we have to have a way to take it down, which was a great thought. Honestly, like you, you, yeah, it was good, good foresight there because if they hadn't said that, I might have done this a little bit differently. So that was all in one archive. But I'm actually going to reach out to them about a site. If I can share this for a second. Yeah, I might get this one. I might get this one. Take that one down. Is it slanderous all kinds of garbage on that site? It's just the header images from Tom Woodward. And it's just wonderful because he took that image and what the image says is sometimes I stay awake watching you sleep. And that's the baby with me photoshopped it. Anyway, I don't know what the hell he was thinking, but it's amazing. Anyway, yeah, that's a fun and it looks amazing. It looks great. It's yeah. So this is the end result and we'll talk a little bit about because to be honest with you, the archiving, I know it sounded complicated in some ways it was, but I had a process. I kind of knew what I was going to run into and it worked really well. Really the part I was had the most figuring out to do going into this just because I've been working on this site archiving toolkit kind of for like a year and a half now in some ways. So I'm getting more and more familiar with the web recorder and HD track stuff. But the hosting part, how do we host 9,000 of these was really challenged for two reasons. So A, just a lot of them. And B, UMD blogs is a sub domain site and I was not about to go and make 9,000 sub domains in like C panel manually or something. And this isn't even running C panel by the way because ultimately they don't want to pay a lot like we don't want to have UMW blogs cost a lot of money for UMW to host, right? Because C panel has license associated with it. That makes no sense. So what we did is basically fancy redirects to make all of the links still work. And that was that was some figuring for me because I'm just like doing redirects and HT access or in this case in engine X configs is tricky. I messed around with it. And I frankly, I just didn't I couldn't get it to work the way I wanted because I wanted two kinds of redirects. First of all, the sub domain should redirect to the right folder. So you can see this right now in your example site. So if I type in, well, A, if I just go to umwblogs.org, that's an example too. But if I go to Jim groomed at UMW blogs, .org, it's going to the web server is going to take that request and say, ah, he he wanted, Taylor wanted to visit Jim groomed at UMW blogs.org. But I'm going to take him to UMW blogs.org slash Jim groom instead because that's where the archive lives. The second part is individual pages should also redirect, right? We don't want to break links here as best as much as possible. And that's a challenge because there is only one real web page in there. And that's that index file that loads the archiver or sorry, loads the web recorder playback replay web page thing. Cool thing about a web recorder, though, is they have a capability, this address bar, right? Like, so if I go to this blog, if I click on this link here, it shows me the right where it used to live. And it actually does, if you configure this correctly, it does actually change the URL up here so that you can, so that's a real link on the web. So my thing also takes any other part of the URL and parses that out and passes it to web recorder so you land in the right place. So if I take this whole URL right here, Jim groomed at UMW blogs.org slash blog and put this in an address bar, it's going to parse it out correctly so that you land in the right place. Because I really, I don't 100% know, this wasn't something we like, you know, with working with UMW like worked out all of the like, oh, what should redirect and what shouldn't. But I was like, let's be good citizens of the web here, we should try to preserve as much of the links as possible. Certain things won't always work though, right? Like if someone was linking to your to this header image here, well, that's sort of in the archive and that link probably has changed. But to be honest, that type of linking for embedded images from other sites is never a good practice. So, you know, those things change anyway. So the cool thing is if you do something like on a search engine, like look for a site, UMW blogs.org and just look at search results, these should work, even though they are at the old URL. Yeah, there we go. Wow, look at that. That's crazy. So the way that redirect worked, I'm pretty proud of this because it took me a long time to figure out. And again, I think it's someone who's more familiar with PHP and this kind of stuff. Maybe maybe this is old hat. But I wrote, I had trouble parsing the URL to do all of those two things at the same time, the subdomain and the rest of the URL. That wasn't working for me in nginx or hdxs properly. So I actually wrote a tiny little PHP script to handle this. And the way it works is in the environment, and if I go back up here, there is an index.php file that gets, you know, loaded if you don't specify anything else. And it in here, it just says, hey, if the host names that's UMW blogs.org equals the URL that people requested, go to this specific one, right? And that's to make the homepage work, basically. But for anything else, this is kind of the actual magic, take the URL and parse it out like this, grab the subdomain they asked for, and put that in, and then grab the rest of the URL they asked for and put that in. And this handles basically all of the traffic. So the funny thing is here is, technically, while we flattened UMW blogs, we do, it's not actually a flattened thing. It's using a little bit of PHP. But this is no database calls. No database. And this is 10 lines of PHP, you know, like this is not much. Can I ask you something on this? What is the, so a lot of this is happening in web recorder. What is HTTracks role? So in this case, we didn't use HTTrack at all because we only needed to wind up with one version, right? And HTTrack was so slow that it was going to take forever. So we decided, no, plus the storage here. So UMW blogs beforehand used like, I want to say like, it was like two to 300 gigs of storage. So it's a pretty large, you know, pretty large site. The problem is that was a dynamic, dynamic site, right? So like, if you, if you think about it in terms of like the way what's actually stored there, well, the database contains all the text for all the posts and anything that's common, right? So if you have 15 sites on UMW blog, sorry, 15 themes on UMW blogs, let's say one of them, one of those themes is used on 7,000 sites. It only needs one copy of the theme, right? But that's not the case with this. These are all individual sites. So that, you use more storage. This is using about 500 gigabytes of storage. So almost double basically, which is honestly not as bad as I thought. I was a little worried it could be worse, but there wasn't really a great way to estimate this ahead of time. So I could, what I did is I, I archived like a few hundred and then extrapolated from there and said like, well, it uses about this much for this, you know, times 18. This will be about the storage. But obviously I couldn't know that for sure because there are some larger sites and some smaller sites, but it actually ended up being the cool thing is the scale of this, that, that law of averages type math actually worked. So yeah, but I wasn't sure, right? And the cool thing though is while it uses more storage and of course the storage will cost more, right? Because twice as much, the cloudlet usage here is nothing. This thing sits at one cloudlet basically all the time. And let me, so that's crazy. It's literally using one cloudlet. We only even let it get up to 16. I've never seen it use more than two because it's not doing anything. It's just, it's just providing one file. It's running a 10 line PHP script, you know? And then, so for perspective, UMW blogs used like 120 basically just at any given time. Gigantic amount. So they're gonna save a lot of money based on that. Can you get analytics about who visits on this? We could. I don't write now, right? But we theoretically could do some things with, honestly, what I would probably do right away is just use the web server logs to get some basic stuff. But we'd have to feed that in a tool. To be perfectly honest, I'm not that familiar with engine X, really. So I know that where those logs are, but we would have to parse them in some way that would be useful. And the cool thing about cPanel is those tools are built in for Apache logs, right? The AW stats and stuff. But there would be ways to do this here. We could do something like use Matomo and put a Matomo tag on that index file. So what we would do is go in and right at the bottom of this index HTML file, throw a Matomo tag on every single one. And I wouldn't have to do that 9,000 times. We could, we could script that to just throw that at the bottom of each. And that would give us some analytics. And that's probably the way I would do it. And honestly, I would host Matomo just in cPanel probably, so we don't really have to think about it. But you could also host Matomo in maybe its own cloud environment. I wouldn't want to put it in this one just because Matomo does require a database and stuff like that. So like, why complicate this environment? Keep it separate. Yeah. But yeah, you could right now. We're not really tracking anything. I would be interested to see or you could use Google Analytics too. But like, why do that? Yeah. So that's kind of how that works. Oh, one, one other thing is the way that this file gets called that index file, right? Because if you visit, you know, the way web servers work, if you visit unwblogs.org, it's going to actually visit unwblogs.org index.php, right? But if I ask for anything else, that's not what it's going to try and visit. So the way I had it call that script is in engine x's configuration. And I'm looking through my notes right now to find out even where this is, because I just don't remember off the top of my head. It's it's etsynexconf.d, sites enabled, default.conf. In here, I set the error page, the 404 error page to that index file. So if you request a file it does not know about, it's going to run that redirect script. So that's cool. And and was great because then that that just that kind of handles everything in the way that you would expect it to work, because what I ran into with my previous attempts at this was like, I could make redirects work the way I wanted, but then it would break any site that actually existed. So this is great. This only runs if you don't ask for a site that exists, wait, that's important, right? Because you should be able to type in jimgroom.umwblogs.org, and this is going to be great when this gets removed, but whatever. Or you should be able to go to undooblogs.org slash jimgroom, because that's where the actual archive is now located. Both of those should work. It's so good. It really it's an awesome project. ELS blogs, I'm glad you took me through this, because now I see some of the issues it's going to run into with redirects and subdomains. I think I ran that as a subdomain, but I have to check. But this is what I wanted to wrap my head around. The good thing is I want to archive the 100,000 posts in DS106. And that'll be relatively easy. They're all in one database. Yeah. So if it's at one base URL, then you should be able to just because keep in mind the way the archiving stuff works for WebRecorder and HDTrack is it's literally firing up a web browser. In the case of WebRecorder, it's actually literally Chrome, which is really interesting. It uses a dockerized version of Chrome. And it just clicks on every link it can find. That's all it does. I shouldn't say all it does. That sounds reductive. It does a lot, but it doesn't know magically that these other subdomains exist. So if I was to point this at umwblogs.org, well, the only thing it would capture is every page on here, but there's tons of sites that aren't linked to from here. We've got some broken images, unfortunately. I'm going to tell you that was not the archive it was broken before I archived it, but there are some downsides. I've been through enough of these that broken images don't scare me as much with an archive anymore. Yeah. And I'm not obsessed enough anymore to go back and fix everything. Yeah, for sure. Once I was. So the search, search isn't going to work because this called the database, right? So if I search in here and go like this, it's going to say, uh, sure. But you can, so there are some downsides, but that's where this little, this little tool up here is really cool because I can look in here and um, I don't know. Go Hitchcock, Hitchcock. And so it's on all of these pages in this archive, right? This is the first site. So we can look at, this was talked about on this post from 2014 and apparently one of the search, maybe not, is in a featured category, you know, so you get some capability to search, which is, I think, good. And that's one thing that HD Track won't be able to do for you. So that's one downside to that. But, you know, for smaller projects, using HD Track is actually nice because you don't have to do quite the level of redirection that I did to make all of the links still work. Yeah. But, you know, now that I've figured it out, we can reuse that till, you know. I think I might use HD Track for Dia's 106 and see how that works before going web recorder, but I now know, like, when to use what and why. So that's awesome. Well, and the cool thing is with my Site Archiving Toolkit as it exists is you can just let it loose and see how it goes and decide after the fact. Exactly. Now I understand better. Yeah. And for a thousand sites, that's a big one, right? Like that will take a long time probably. But, or sorry, a thousand posts. Yeah, it will take a while. But let me say we got one minute left before we have to jump. What's your last word on the Archiving Toolkit? Well, the thing is I'm constantly improving it. We do have a few people using constantly. I'm improving it bit by bit. You know, I have upcoming features I'd like to add like I would like to give you a flag or maybe a config file where you can tell it I only want to make one type or the other things like that. But, you know, this is really just to try to make these tools that already exist easier to use. And I do have some folks who are using it and giving me feedback on like, oh, this didn't work. And sometimes I'm like, sorry, I talked to Web Recorder. Actually, I haven't really had to say that too much. These tools do typically do a pretty good job. But, you know, if you have something you want to archive, check it out, use the Toolkit. I want to hear about what does and doesn't work. I can't promise I can fix everything, but like, I really want to make this something that doesn't feel like a daunting task. Now, I don't mean 9,000 sites is an extreme case. But if you want to archive one or two sites, this shouldn't be that hard. And that's what my Toolkit is trying to do. Yeah. You won't need 10 different server cloud-led driven instances to do this. First, spin up your own cloud infrastructure. Exactly. Yeah. Exactly. Can I have a last word? Yeah, of course. Thank you. My last word is I find it really entertaining and joyful when you're like, you know, and I'm constantly, and then you're the first one to challenge yourself. I'm not constantly, what am I talking about? Hey, I don't want to oversell my commitment here. It's just like, this is something we do, you know, I don't want to. On the fly self-editing. You're like, no, not that word. Cross off this word. It's awesome. I don't want to oversell and let me think that like, that this is something I'm spending 20 hours a week doing. It's like, once in a while, I'll spend an hour or two playing with it. It's a brilliant tool. You archive looking for Whitman, which was a project using a bit of a smaller WordPress. And that was a HT Track 1 in that case. And that worked out great. It was a great archive of a great project. And then UMW blogs and there's many, I'm sure to come. So it's an awesome tool. And I hope you constantly update it because we need it. You'll have to talk to my boss about the time allocation. Yeah, yeah, yeah. Let me see if I can get him on the phone. I heard his real jackass. Cool. Well, hopefully this was, I don't know, useful or interesting to somebody. I'm going to try it this weekend on DS 106 and I'll also try it on the ELS blog. Let me know and how that goes. And you know, like I said, I think we should have a larger conversation about the project of retiring your new blogs and what that meant. And this is one small part of it, but I wanted to get all this technical stuff out of the way because that I think this is cool and I'm excited about it, but there's a lot more to talk about. And I really hope that some folks who are in charge of web services that maybe have to be spun down, consider something like this and you know, talk to us about it. Maybe we can help you think and the cost is not so crazy. I think that was also a good thing that came through. Like you can do this and keep it alive for a pittance. And you know, there is some work in an archiving and there's some work and spinning anything down, but saving it for long term. I agree with you. It's a responsible move. If you're building this kind of web content as part of a higher ed organization, I mean, people will miss it. I believe, and I firmly believe that. And I think it's worth the investment to preserve it. Thank you very much for coming to my TED Talk. I said that for you, Brian Lam. Jerk. Okay. See you all.