 Hello and welcome. Occasionally I just do things as exercises to practice, and the other day I started filming them, people seemed to like them, so that's why I've been doing occasionally. And the other day I did a video on scraping images off Flickr. It was just a little sped up video of me playing around with it. It wasn't really meant to be a tutorial, but one of my Patreon supporters requested that I actually go over what I was doing in that video. So that's what I'm going to work on today. So here we are at Flickr. I did a search for Linux. Now I could try to use, you know, WGet, like so, and again, output like this. That's what I originally did in the video, which gives you the full HTML. And then I can start picking through that and scraping some of the images. And that works. It's a little more complicated than needs be, because there's another way of doing it. And that's only going to give me the first couple of images on the page, because the rest are loaded from JavaScript, and WGet's not going to run that JavaScript. But it's simple enough. Let me refresh this page here to make sure we're working with a French thing. Okay, I'm going to hit F12. I'm in Chrome. All major web browsers nowadays have some sort of developers console. I'm using Chrome. I hit F12. It opens this up, and I've clicked on the network tab. This will show everything that loads, every image, every file that's loaded on this page. And what I'm going to do is I'm going to scroll down, and as I scroll, you'll see more images being loaded, which is great. But what I want to do is I want to find this load more results, but that way I could look through here and find what file is loading it, and I can actually narrow it down with these different categories here. But just for now, I'm going to leave this on all, and now I'm going to click this little clear button. That clears everything out. Now I can click Reload, and the first file that loads is going to be the request for new images. So I'm going to click that, and there you go. You can see this rest in file. So images load on a scroll back up. This file, when I click on it, and it'll go to response, you can see here that there's some JSON return that has all the information for the new images. So that's what we're going to use. Now you might think I can copy this address and use it with WGet. So again, let's go WGet. Just output to the screen for now. This here, but we don't get anything output. That's because Flickr is using cookies, which a lot of websites do, and WGet can work with cookies. But what I would have to do is figure out where the cookie request is from, run WGet against that, save the cookie to a file, and then call this command and load up the cookie, which is fine. And really, if you're going to be writing a program, that's the way to go, because what I'm about to do is I'm actually going to use a curl command that Chrome is going to give me, and it will give me the cookie for this session, but the cookie will eventually expire. So it's not good for actually writing out a script. So this is just for fun. If you just quickly need to download a bunch of images, you can use this method, but not for writing a full out program that you're going to distribute to people. Anyway, if I right click again on here and go to copy, instead of link address down here, I can copy the response, but I can also copy as curl. So I'm going to choose that coming here and then just paste that in. And you see, we get a big long thing here. And this is basically everything that Chrome passes to the server is now put into this curl command. So we have our curl command here, we have the URL that we're requesting. It gives other information like the origin that we're requesting it from, what type of encoding we're accepting, the user agent, which is Chrome in this case. So using this command, I'm using curl, but it's still the server is going to think that I'm using Chrome. It doesn't know that I'm using curl. And a few other things. And then the cookie is the big one here, is that what we want. And there's a bunch of cookies in there. I don't even know if you need all of them. I haven't even gotten that far with this. But if I had to enter now, I will get that JSON output. So let's play with this for a second, and then we'll cut trim it down a bit. Well, real quick, actually, let's trim it down a bit. As I said, there's a bunch of stuff in here we don't need. So what I'm going to do is I'm going to go back here to where cookies are. Cookies start right here. You can see this header that's being passed says cookie. And I'm pretty sure that that's all I need that. And the address now some some websites might see that using curl or W get and reject it. They don't let you because they think you're a bot crawling, which is basically what you are. So you might want to leave the user agent in there. But here we go. I trimmed out a lot of that excess stuff. And we can still get our JSON output. So what I'm going to do now is clean this up and get our images. So let me just go in here and I'm going to first, there are programs out there that will parse through JSON for you in a shell script. But I want to use commonly found tools on your system. So if you're running this on a server somewhere, or something, a minimal system, you know you have the tools. So what I'm going to do first, I'm going to use TR. And I'm going to say this. And what this is saying is find all quotation marks and make them new line characters. So when I do that, since everything has question marks around every new piece of that JSON file is going to become a new line. At this point, we know we want JPEG files. So I can say grep for JPEGs. There we go. And you can see we get them, but you also see that we get repeated stuff. Here's one image and it's listed here multiple times with different extensions. Q, M, T. And those I believe are different resolutions when I was playing around with the other day. So just to make this quick and simple, let me quickly look here. What I did in the video the other day was I took the last image here. I wonder what we get if, I don't know, let's see now I'm thinking about other stuff. Let's go here. And we don't need the ones with question marks. Let's go ahead and real quick do a grep-v. I didn't do this the other day. Question mark. So the v is going to invert the search. It's going to find everything that doesn't have the v in it. And here we can see we have images. It's also grabbing from different servers. So let's find out which one of these images is the best. So we've got 2, 4, 6, 8. So I'm going to run that same command, but I'm going to say tail-n8. Now it's just going to be the last eight lines there. And what I'm going to quickly do is pipe that into a while loop. I'll say while, read URL. We're going to say I'm going to use wget. I probably should use curl since I'm using it in the first half just to be consistent. But I'm going to say wget our URL. And done. And actually some of these are going to come out with the same extension name. So it's going to put .1 at the end. But we'll see. Go ahead and what did I, oh I forgot my do while do wget. Okay so we downloaded those eight images. Let's see them here. You can see they're all the same. Just different sizes. Cropped a little different. The Q looks like it's cropped. The S looks like it's cropped. So we're looking at T and M. Which one of them is a better option? Let's go ahead and list out all the files in here. JPEG. Oh let's just list out all the files because again they were the same file names so wget automatically puts a .1 at the end of any repeat files. But you can see even though they're coming from different servers they're the same size. And the biggest file I would assume is the highest quality which in this case is M. So we can assume that the highest quality is going to be M. So I'm going to remove everything in this directory. I'm going to run my curl command again here but not download the images yet. And what I'm going to do is I'm going to say instead of grepjpeg I'm going to say grep underscore Mjpeg. And then get rid of any ones with question marks. There we go we have a list. But we're still getting duplicates because we're getting it from two different servers and I'm pretty sure they're going to be the same image. So we're going to choose one of those two and since this is C it can be C1, C2, where these can be farm, 1, 2, 3, 4, whatever. We can now grep for any line that has farm and I think would be the quickest way to do this. So I'm going to grep farm. And since these three grep commands are in a row they can actually be put into one command but I'm going to pipe them into others because that's just how I do it. So there we go. Now the only other thing is we have these backslashes here that are escape characters in the JSON. We don't need those so let's remove those characters. So what I'm going to say is said substitute. And we want to remove backslashes but we can't do this we have to do backslash backslash because we have to escape this escape character. And so we're replacing it with nothing. So now we should get URLs that are now clickable in my shell here so I can right click one of these and say open. And there we go and that really doesn't look like the highest resolution there. That looked pretty small but that might have been a small picture to begin with. No we could definitely get larger pictures than this. So let's see let's go back before we grep out this let's change this to .jpeg again. And let's go ahead and let me run my W get command again here. Quickly look here forward forward forward forward. All these are kind of low resolution. I was getting higher resolution the other day I wish I forget I just I remember. Let's run this again like so. What other extensions do we have? Because that one image might not have the full resolution so then I'm cutting out all my other images since it's a trial and error until you know how things work here. So let's go ahead and that one has an N and this one doesn't have an N. So if we if we start repping for something like that where the N we're going to lose some of these images that don't have that. I'm also wondering what about these ones that this might be the full resolution here that doesn't have the full extension. So let me go here and I'm going to rep JPEGs and I'm actually can I just highlight this and go to Chrome. I paste this in here and oops I guess that worked. So it looks like without the underscore might be the original image. The other day I was downloading one with the underscore. So what we can quickly do here is this is different than what I came up with the other day. I know what we'll do. We should be able to grep for we should say we want all lines that have JPEGs. But we're going to do a reverse search grep all lines reverse that have underscore dollar sign dot JPEG. And a dollar sign should be a placeholder for one character if I'm not mistaken. Nope, that didn't work. Yeah, that didn't work that wild card. I could say I think if I do this. Okay, yeah, that's working. Let me add in and let me do this and I'll explain it ZB that's looking good and see. Okay, so what I'm doing here is I am grepping and I'm doing a reverse grep so I'm showing lines that don't match this. We're saying any line that has an underscore one of these three character or one of these characters dot JPEG. And I think I put every character that they had listed there. So we should get all those images without those little extensions, which I'm thinking might be the original images. So let's go ahead and add our said command back in like so, and that will fix our URLs. Let me go ahead and open up one of these. Let's see. Yeah, I mean it's not a high resolution, but it's good for a quick little grab. And now let's take that command and put it in a script so we can play with a little bit. Again, this is not going to be a permanent solution since the cookie will eventually expire. Let's go ahead and grab this and real quick. Just go Vim. I'll just call my script. I'll say bin bash. Boom. And if I want to clean this up a little bit, I can put these on new lines here, like so, just so we can see each command individually. They're still piping into each other. Okay, so let's run that make sure it still runs. We'll make it executable, change mod plus X, go.sh, go. There we go. We're getting our images. Oh, something else we need to add into there. Again, we're getting our two servers. Let's go ahead and grab only ones that are on the farm servers. So we'll add to this, grab. There we go. We got these clickable images. And now we can run through and download them. Now this is just for links and we're getting so many. So let's go ahead, let's run that command and say line count. We're getting 94 images from that list. Let's go ahead and look at what our command is here. And if we look through here, you're going to see that right here, this is our text that we're searching for Linux. And just before that, you can see that we're starting at page five because we scrolled and auto loaded. And it asked for us to load more at five. So if we wanted to get the first couple of images, we'll set this to page one. And we'll say per page 100. We have it set to 100. We got 94. That's pretty good. So let's go ahead and try changing this to 200. I'm doing a word count again to get a line count to see how many we got. We're getting twice as many. It's taking a little bit longer. Six seconds, eight seconds, nine seconds. There we go. We got almost 200. We got 197 images listed there. And so let's go back in here. Let's see what happens if we change this to computer or run our command again without the count. And I'll just grab one of these images and we should see a computer output now, a computer image. There we go. So we've changed our search, but let's go in there. Something else I realized the other day. Let's change this to computer to, we'll say, DOOM video game. Let's run that now. We get nothing. And why is that? Because of the spaces. You might have guessed that. So let's go ahead and change this. You just have to do some HTML annotations. So we'll just change anywhere we want to space. We'll say 20 percent 20. Now we should be able to run this script. And there's, it looks like a Lego DOOM set. Again, these aren't the highest resolution. We could actually, if we want to go further, I didn't do this the other day. And I might be going down a rabbit hole right now. Let me click on one of these. How big is this image? Let's go view image. See, they're not letting me right click the image here. Let's see. So what we can do here, save as image as, say F12. And we're going to do our element select and select this. And we're going to say, look at this and see the image. Let's try something else. Let's go ahead and go back to our network here. Go back to search, back to search. Here we go. Let's clear this out. And let's click on all these images now. And here I'm going to go images only. And I'm going to find that image preview. That's it right there. There we go. It looks like B is the biggest option, which is what I think I did the other day in my script. So actually we can go back to our script and make a change to that. Because I think that we're still not pulling down the full resolution images. Because I know I was getting bigger images the other day. And that's why I do these exercises. I'm learning stuff. I'm practicing. But what we can do is let's comment out this line. And I think I said B, so underscore B. Let's run that again. Open. There we go. That's a bigger image than we were getting. So B seems to be the higher. Did I happen to grab the same one twice? No, there we go. That's an evil. I have no clue what images I'm bringing up, so I apologize for anything. So yeah, B seems to be one of the larger images anyway. So that's better. We get the underscore B. And it's a little cleaner of a code. We don't have all those characters wiping out. So now what we can do is we can go back into our code. And we can set these to download if we want. So we can change that really, I'm not going to get into it, but we can put this into a variable at the top or ask the user. But again, these cookies are going to eventually expire. I don't know if they're good for an hour or 24 hours or what. I might even say somewhere and all that. That's basically what I did the other day. A little different than what I did the other day. That's the sort of thing when you do stuff like this. Each time I do it, I'm probably going to do it a little different. And you learn a little bit more each time. Let's just say while, read URL, do, W get, dollar sign, URL, done. And another thing we didn't put into here that we should. Let's do this sort, right? Because I didn't do the sort unique to make sure we hopefully don't get any duplicates. And if I run that command now, I'm now pulling down a bunch of Doom game images. Let's go ahead and open this up. We have our tux penguin that we originally downloaded, but the rest of these don't look like Doom. But these are the images that come up with a Doom video game search. What if I kill that script? Oops, kill that script. And I change my search from Doom video game just to Doom because that's what I did the other day. Save that. Remove all .jpeg and .jpeg and go. This is why I didn't really do this as a tutorial originally because it's just me playing around. It's not like a prepared tutorial. Let's see. Man. There we go. Now we're getting more Doom-like images. Kind of. Tux images there. Graffiti. Here are some Doom images. They all seem to be the most current game. But that is how you can scrape some images from a website like that. Again, it was just like a little practice thing for me, but I was requested by a Patreon supporter to do a video on it. And my Patreon supporters, when they ask for something, I do my best to deliver. So there you go. And I hope that you have a great day. Please visit filmsbychris.com. That's Chris the K. There's a link in the description. And I'll throw the script up on the paste bin and put a link in the description as well. And obviously it can be improved. Thanks for watching. And as always, I hope that you have a great day.