 Hello, and welcome. I have had a lot of requests for some more web scraping tutorials, and although I'll do some that are completely scripting and automating stuff, lots of times if you're just doing something, if you're not going to be doing a whole bunch of pages and you just want to scrape something in particular, like in this case we're going to be pulling some yearbook pages down, it's easier just to do part of it manually. So I'm in Chrome browser here, I'm in incognito mode to prevent any of my plugins from messing things up. So I am getting ads, these ads on the screen are not mine, they're from classmates.com which is where I'm going to be pulling these yearbooks from. Very easy, you can create an account on classmates.com in a couple of minutes using whatever email address you have, and then you have access to pretty much any yearbook that they have on their site, which they actually have been scanned by users. When I pulled down my four yearbooks, they all had all the signatures in there, and have a great summer scene next year from people, so it's interesting that they've had users scan these books and upload them, and then I guess they're trying to sell prints of them, but we're not doing that, we're going to be pulling down these images, they even try to prevent it, like if I right click on here, it's not going to let me save the image, it has a save as, but it's not the image, but don't worry, it's not a problem, it's not going to slow us down, but I'm going to show you this. By the way, I was not born in 1975, I was born in 1981, and I did not go to Lely High, so don't look for my picture in this yearbook, but this is a local school nearby, and I do know people who went in that timeframe, and again, these ads are not mine, and forgive me for any ads that come up, because earlier, right before I started recording, there was one for P-Protection Underwear, so anyway, what we're going to do here is we're going to load up our Developer Console, which normally in most browsers, or at least I'm in Chrome, actually Chrome Yum right now, F12 is what would load this up. Normally, if you don't have F12 on your keyboard, because certain computers like Chromebooks don't, you can always hit Ctrl-Shift-I, the same as hitting F12 in this. Now we want to go here to images, so that we're not getting anything more than images, and there's going to be a lot of stuff loading here because of the ads, but what I want to do here is I want to refresh this page, hopefully grab this cover page. In fact, in test runs of this, the cover page did not come through when I was stripping through the pages, and I think it maybe it might be a different size than the rest of the pages, which is why I didn't capture it with what I'm about to do. Anyway, I can start clicking through here on the pages when I get to page like four or five, and ads going to pop up normally. No? Okay, there it is. Close that, and then we're good, we can keep going, and basically, this is the only real manual part of this as far as getting the information, and you don't have to click. You can actually use right and left arrows, and you can go pretty fast. These pages load pretty fast here. Just if you get to a page, and it hasn't loaded yet, give it a second to load to make sure that you get all the pages. We also kind of want them to load in order. That way, when we turn this into a PDF in the end, all the pages are in the proper order, or at least close. There you see, one of those pages loaded a little slow, but I did see it load, but because it loaded after the other pages, it might be a page or two off in the order. You know, okay, that page loaded. I just went back because the page didn't look like it loaded. I'm going through this pretty fast. Again, we're getting a lot of information down in our console here. Again, it's the ads. I tell you, when I did this, I ran two tests on this. One with ad blockers and one without. The pages in the yearbook alone should only be about 50 megabytes, but when I did this with the ads going, I got 200 megabytes worth of information. I also found it funny in 1975 that they have a ad for whiskey in their yearbook. Anyway, going to keep going. Okay, we're at the last page here. Oh, there's that P protection. You can't see it because my console's here. Now it says underwear for leaks, but earlier it said P protection underwear. Anyway, what I'm going to do down here, this is a list of all images that have loaded while we've had this page open. I'm going to right click this. You have a lot of options here. You can copy a bunch of different stuff, but what we want is save as HR, which is a HTTP archive with content. Don't get that confused with copy copy all as HR because that's going to copy any text loaded to the page. We want content and that's going to copy any image, any sound, everything that's loaded in the web browser since we started recording this and this particular case images into a single file. So I'm going to click that. I'm going to save it as whatever.har. I already have a folder setup that's empty. I saved that and now we're in that directory. And there it is. If I list this out and look at the size, 15 megabytes. Okay, it's still saving. It's still saving. Yeah, those ads literally quadrupled the size of this project because it's almost 200 megabytes and should only be about 50 as you'll see when we scrape away everything, but we saved all these images. A lot of them are ads and stuff. And if we look at this, I've talked about HR files in the past. They're JSON format. So they're plain text and they contain all the information of the stuff we downloaded. What we download one time it started downloading and how long it took, where it was requested from the URLs, the host name, all that different stuff. And if we jump ahead, I'll just look for base 64. We saved all the images. So this one's encoded as base 64. And if I jump ahead to the text, not that text, not that text, not that text. Okay, that's the front of the base 64. Again, there's a lot of information here. We're gonna scrape through all this momentarily. Here we go. So here is an item. And it's base 64. For those of you who don't know what base 64 is, basically any file, even if it's binary, such as a image or a sound or whatever can be converted to base 64, which basically just encodes it so that it's all typable characters, ASCII characters. So this is our image in ASCII, not like ASCII art, but in typable characters. So we want to do is go through here and scrape out all this information, decode them, find the ones that are just the pages for the yearbook and convert them. So real quick, I'm gonna escape out of here. And what we're gonna do is we're gonna search for all the lines that say text. Then we're going to cut it based on these quotation marks. And we want field four because we got field one, field two, field three. And then all of this is page four for this image. So let's go ahead and do that. I'm going to grab all lines that have the word text. Then I am going from that yearbook file. Then I'm going to cut with a delimiter of the quotation mark. And let's say field four. And then I'm going to pipe that into a while loop while read. And there's lots of different ways you can do this. I'm just going off of what comes to my mind first line do and then I'm going to echo that line that variable each line into base 64 dash D for decode. Okay, so that will decode it. We're gonna get a lot of errors because not all the text is going to be base 64. That's fine. We're gonna filter through all that. I do want to say over here, let X equal and I'm gonna use a fairly big number because we have lots of files in here. And I'm just doing this to prevent having to make placeholder zeros. I want to keep the pages in order. So what I'm going to do here is I'm going to say dollar sign X dot JPEG. And then I'm going to say, let X plus plus done. So what we're doing here is we're going to look at each line and try to decode it as base 64. And actually, I forgot a part here, decode it and pipe it into a file X. So it's going to be a number. It's going to start at 100,000, 100,000 dot JPEG. And then it's going to add one to that. And it's going to loop through them all. So we're going to do that. It's going to take a little while. Again, because of those ads, there's a lot of files for it to go through here. So it's going to take a little while. We're getting a lot of these invalid inputs because they're not base 64. But we're going to filter out all that even the ones that are images. If it's not the yearbook page, because there's going to be a lot of JPEGs, a lot of gifts or GIFs, or maybe some PNGs. We want to remove all those. So we're going to use the file command. The file command is going to look at the header of each file and determine what type of file it is and give us information on that. So when you look at a file, you look at the file name, you'll see dot JPEG or dot PNG or dot MOV or dot MP4 or MP3. The computer, for the most part, does not care about that extension unless the programmer told us to look at the extension of the file. The computer knows what type of file it is based on the header. And that's what the file command does. The file command, you give it a file, it looks at that file, and then it will give you the file type and a little more information about that. So we're going to do that. It's going to take a little while because there's lots of files here, but then we're going to remove all the files that are not pages of the yearbook. So I'm going to say file and I'm going to give it the asterisk sign here. You can see it's going to start listing all these files, but we only care about ones that are JPEGs. So real quick, I'm going to run that again. I'm going to say grep capital JPEG. And now it's listing them all. And you can see there's different ones. It's telling us, you know, what types are and it's giving us information on like the size of the image. I'm just going to kill that right there. And I already know because I've already looked, but you would have to look at one of these to figure it out. But 1100 by 1514 is the page size. So we know any image theoretically that is that size in this directory is the page we want. So what I'm going to do here is instead of grepping for JPEG, I'm going to grep for 1100 X. And now it's just listing items that are pages of the yearbook. Well, I don't want pages of the yearbook. I want all the pages that are not pages of the yearbook. So I'm going to say dash V. Now it's going to be listing all the pages all the files that are not pages of the yearbook whether they're JPEGs or not. And what I'm going to do here is I'm going to pipe that into cut dash D for delimiter backslash colon field one. And that will give us the file name of all the files that are not pages of the yearbook. I'm going to go from now, I'm going to say X args RM. And that will now remove all those files and we should be left with just approximately 150 or so images that are pages of the yearbook. And once it goes through and deletes all those files, I can now list these out and even remove our HRA file, HR file. Yeah. And here I could say word count 151. I'm pretty sure there's 152 pages in this yearbook. And I bet that we missed the cover. And again, I think it's probably because it's scanned a different size. So I'm going to say xdg dash open dot and that should open up this in my file manager there. You can see all the pages of the yearbook minus that front cover it's got the back cover and theoretically all the other pages. If I go back to the here, you can see we're at page 152. If I just go boop, and come here and go back to the first page of this yearbook. I am going to what I'm going to do is I'm just going to manually grab this one. I don't know why I'm gonna say open a new tab. I don't know why it's not showing up again. My guess is that it might have been a different size and all the other ones save image as I'm going to go to this directory. And yeah, I'm just going to name it that it should go to the beginning of our list here. So there it is. So now that we have that you have to have image magic installed for the next step. But I'm gonna say convert all dot jpeg to yearbook dot PDF. And it now is creating a yearbook. And again, it should be about 50 megabytes. If I list this out, du dash h, you can see there's just over 100 megabytes in this directory, 50 for the images, 50 for the PDF, because it just put those pictures in the PDF. You can give it other convert other options for compression and stuff, but I just left them at the full jpeg because they're already compressed from being on the web. And now if I say xdg dash open your book dot PDF will open up in my PDF viewer, and go to the first page. And I can go through here and you can see it's 152 pages were on page three, I can just start going through all the pages of this yearbook. And that is how you pull down your books and put them in PDF files. Or just use the images if you want to use the images. That is pretty much it. Again, if you go to classmates calm, no affiliation, you can create an account with any email address, just make yourself a junk mail account, go in there, choose any high school that's on their list. And then look at their. So if I go to view all your books. Let's see 2001, do you remember this yearbook? So you put in your state. So you know, I'll say California, I'm just I'm just picking random stuff here. Military Academy. They only have one year for this one. I was going to show you the list all the years. Again, not very many. Let me just go ahead and pick Florida. Oops, that's not what I want to do. Florida, the bulls. I'll say again, Laylee, Laylee High School. And you can see they have they don't have every year because it's just what people have scanned and submitted. But 1975, 76, 77, 78, 79, they don't have 78, 79, 80, 81, 82, 89. So you can go to any of these years and do basically what we just did. It's hit and miss. Again, it's just whether they have them or not. But that is a great way to get a digital copy of your yearbook. So please visit filmsbychrist.com. That's Chris the K. There's a link in the description of this video there. You can search through all my videos, watch more videos like this. And I hope that you enjoyed this. Hope you found it useful. I went through things pretty quick, but I assumed that you knew some of these commands. Yeah, and, you know, some of the stuff I get that like if you're not used to scripting something like this can look intimidating. But grep and cut and while loops and echo are commands that you would use daily once you start using the shell. So it's like this is just and there's many different ways you could do this. I mean, I used grep and cut and a while loop. People who are good with Ock or said could probably have done that in one command, you know, so thanks for watching. As always, I hope that you have a great day.