 Hello and welcome to my last video on doing some web scraping, seem to be very popular, so I thought I'd do another one. We're going to go and just grab some mug shots from the local sheriff's website. So I'm here and I'm going to search here for Smith and I'm going to say search, and it's going to search over the last 10 years of records and bring up mug shots and name some other information here. I'm doing Smith, so there are returns more than 100, so it's telling me I've reached the maximum number of records. But we have 100 mug shots here, and I just want to grab these thumbnails. Now our process today is going to be partially in the web browser and partially scripting, so it's not fully automated. And there are cases, obviously, if you're going to do something regularly, you're going to want to script it out completely, but there are times where you just want to grab something quick. So sometimes doing it in the web browser with a little bit of scripting after might be a good idea. So that's what we're going to be working with today. So again, our goal here is to get these thumbnails here. So in a web browser, I'm using Chrome, but these sort of functionalities should be available in pretty much all modern web browsers. You can hit F12, or if you don't have an F12 key, Control-Shift-I will bring up your Developers tab. And we'll make sure we click on the Networks tab, and we have All selected. Even though we're only looking for images in this particular case, we're going to go All, so we can grab all the information just to, so we have all that information. Now, notice nothing shows up. You have to refresh the page once you open that up. And then down here, as things load, everything that loads on this page should show up down here. And you should be able to do this on any website, and you should be able to grab any pictures, videos, sound, bytes, unless there's some, sometimes some websites have some DRM stuff going on, but most time, you can get everything down here. And one way to grab this information is we can right-click, go Copy, and there's a few options here. But I'm going to do Copy All as H-A-R, that's a HTTP archive, I believe is what that stands for. Switching over to the web browser, I'm going to real quick dump that into a file called 1.h-a-r, and then I can paste that in there, Ctrl-D, and now I can cat out 1.h-a-r, that's the files, the information we just dumped. And if I was to grep real quick, I could grep through here and look for PNG from 1.h-a-r, and we get some information there, and you know, we didn't know what we were looking for. So, I mean, we can look here, and we can know that we're looking for things that say Pick Thumbnail, so I can actually go Pick TH, and there, you would think that I'd be able to grab these URLs and open them, in fact, if I right-click on one of these and say Open Link, we get an error. And that's not, that doesn't have anything to do with our script, it just has to be, do with the way this website, particular website is designed, a lot of websites that would work. I now have a list of all the images, and I'd be able to just pull those images as long as there's no login, information, requiring, you know, session keys or anything like that. Well, you notice that these aren't just PNG files, that they are ASPX files, which, I forget, off the top of my head what that stands for, but basically, it's a server-side script. So, these images are actually being generated on the fly by some server-side script, which you'll see a little bit more of once we actually pull these images. So, even if I come here and I right-click this and say Open a New Tab, we're going to get an error. Most websites, you'll be able to do that and view the picture. In fact, if I right-click on this thumbnail here and say Open a New Tab, let's see what happens. That time, it does work, and I'm not really sure the difference on why, because pretty much same info. Anyway, we're not going to worry about that. If, for some reason, the links don't work, that's okay. We can actually pull all the information from this page. So, real quick, I'm just going to, well, I'll leave that there so we can reference it. But I'm going to right-click down here in our network tab again. We have this Save All as H-A-R with content. So, what's the difference is, the other one's grabbing all the information on requests being made, what time they're being made, how long they took to load, what the responses, excuse me, what the responses, what servers we're hitting up, time and date and all that stuff. Well, Save All with H-A-R with content does all that, but we'll grab any type of media, anything on the file and put it in that file as well as Base64. So, if you have images or sounds or videos, we should be able to get all that stuff in one file. So, I'm going to say Save That, and I'll just say two. So, our first one was one H-A-R, second one was two. So, again, if I go into our one H-A-R, you can see here that it is a JSON file that we can go through and get information about what we have downloaded, but it does not have, as we scroll through here, any of the Base64, because you would see it as you're scrolling through here. Then, for those of you who are not familiar with what Base64 is, Base64 is, again, it takes any binary file. It could be an executable, it could be a program, a ROM file, an image file, a music file, a video file, and it converts it into plain ASCII. So, plain characters, all characters you can type on the keyboard. This is very useful when you want to send things certain ways. For example, if you have attachments in your email, they're probably in there as Base64 and your email client will convert them to whatever file they're supposed to be. Now, if I VIM into 2.H-A-R, it's a very similar file, but as we scroll through here, you'll see batches of Base64 information. So, like right here, right here, this right here, all this information, as you can see, it's all online, so I can't really scroll through it. This is an image that's saved as, you know, it's not like ASCII art. It's actually information that's got to be converted back into a binary file. So, if we compare the two files, if we are in that file, if we list out these files, you can see that one of them is, you know, like about half a megabyte, where the other one is eight megabytes because it actually contains all the images. And we're going to pull those out today using a script. So, we've already downloaded everything we need. We just need to convert it back into the format we want. In this particular video, I'm going to use grep because grep is a very common tool on most Linux systems, and it can be installed in pretty much any operating system. But it's a little sloppy to use it this way. And in the next video, we're going to look at using a different tool that I've talked about before, JQ, to parse through the JSON. But real quick, this is what we're going to try to do here. We are going to grep for image slash PNG. Because in the JSON file, there's the MIME type that tells us what type of file each section is. So, we want to do 2.json and not JSON. It is a JSON file. HAR is what we called it. And here, we've got a list. So, we're going to want to look for these lines. And we want to find a line or two after that. So, what I'm going to do is I'm going to do that same search, but they add dash capital A2. So, that will find the MIME type and the two lines after it. And we do that, you can see that we get some base 64. Now, if we go back into our file and look at it, you see here that the tag for the base 64 is a text tag. So, we should be able to now grep again for text. And that will give us just the lines with base 64. Now, as we're searching for PNGs here, luckily this website is very split up. Where logos and stuff up here are actually JPEGs and these ones are PNGs. If we were doing it this way and they were all PNGs, we just have to sort through them later on after converting them. But right now we got our base 64. But it still has the text tag there and quotation marks and this trailing comma. We want to get rid of that. So, right now, if we look at our file again, you can see that we have a quotation mark. The word text, quotation mark. And we have another, you know, we have a semicolon space here. And then we have quotation marks. So, what we're going to use the cut command. And we're going to cut these into fields based on these quotation marks. So, we have column one or field one, field two, field three. And then our base 64 is field four. So, now we just take this command where we're looking for all files that are image PNGs. We're looking for two lines past that, which gives us all that information and more. Now we're just going to look for the lines that have the word text in it. And now we're going to cut with the delimiter of the quotation mark field four. And that'll give us just our base 64. Now, if I was to take that and I was to say tail dash N1, that will give us just the last one. I should be able to pipe that into base 64. And that will convert that back into what should be a binary PNG file. But you'll see when I do that, sorry, got to do dash D for D code. Otherwise, it's re-encoding it as base 64 again. There you go. But this is confusing because we have some HTML here. Well, let me scroll up a little bit. Look, oh, there's our binary data. In fact, if I open this up in VIM, or let me dump it into a file real quick, we'll call it 1.png. Then we'll open up 1.png here and we'll go to the top of the file. You can see that it has a PNG header. And then we have all the binary information for the PNG. But if we jump to the bottom of the file, there's some HTML in here. Now, I don't know if this is a mistake or somehow this is on purpose. I actually think this, again, this image is being generated on the fly by a server-side script. And for some reason, it's generating some HTML at the end. But luckily, most programs don't care. If there's anything after the end of the file, you can pipe extra information to the end of files and most programs don't care. So if we actually use the file command, which we'll look at the header of a file and tell us what the file is, we can say 1.png. And it tells us right here that's a PNG file. This is the resolution. This is the color of the bit. And this is that it's not interlaced. So we know that it is a PNG file. And in fact, I can use display, which is part of the image magic command. To display an image, I will give it 1.png. And there is a very happy lady who was arrested. Very, very happy, you can tell. Anyway, now, luckily, with image viewers such as display, not all image viewers can. Let me clear a screen here. We can actually redirect the base 64 decoding. So the PNG, we can put the output of that directly into display. So we don't even need to save it to a file. We can just say dump it to display. And display will display again that lady. If I was to take this same command, and if I have the shuff command for shuffling stuff on my computer, which I do, I should be able to say shuff-n1. And that should give us a random thumbnail from all the thumbnails, the mug shots that we downloaded. So there's one, if I hit Q to quit, run it again. So each time it's going to give us a random, that guy's actually happy. I got arrested, yay. Anyway, so we're doing that. But now, we want to go through, and we want to dump all of these into individual PNG files. So what I'm going to do is here, I'm going to go to the beginning. I'm going to say let x equals 0. So what's that do? We're creating a variable called x. We're saying it equal to 0. The let part is just one way of saying this is an integer. This is a number we're going to do math on. So then we're going to go through all of this. We're going to get rid of our shuffle command here, because we want to go through each one, and we're going to pipe into while, read, and then we'll create a variable called m for mug shot. We're going to say do echo $m into, pipe that into base 64. Instead of displaying it, again, we're going to redirect that into a file. We'll call $x.png. So the first time around this should be 0.png. But then we're going to say let x plus plus. It says equal or x, add one to x each time we loop. So I run that. It should take maybe five seconds. It's going to go through all 100 or so photos. And it's going to extract them. And if we list it out, you can see we have all these images. I can use the file command, $, or asterisk, png, and enlist them. And look, it says that they're png. So that means that we are hopefully successful. So let's go ahead and use my file manager to open up the current directory. And sure enough, we have all of our mug shots. Looks like we have 100 of them, 0 through 99 besides the 2 that the Sheriff's Office had pulled. Now let me point out something real quick. Again, using grep is a little sloppy for this. Again, I found image.png and then I said dash a2 to get the two lines past that. When I originally tried this out a couple of weeks ago and wrote my notes out to do this video, I only had to put one here because the base 64 was the line after the png line. Well, sometime between then and now, it changed order. And so I had to go two. So that isn't the best case. Now, here is an alternate option that might get you better results if it changes again in the future. I can say dash e here. Let me actually go to our single image. So again, there's one. So instead of looking for lines that say image.png, we're going to do dash e means look for that. And then instead of saying two lines after that, we're going to say also look for lines that have text in them. Right. Is that what I wanted to search for? I forget. I should have wrote notes on this. I want that and that. Yeah. So I want lines that say text on them. And then I should be able to grep for lines that say text. Again, that seems a little weird. Oh, and we want this over here. We already have that there. I think that's what I want at har. Let's see if that works. Yeah. So what that's doing is it's cutting out the middleman. So even if there's five lines between the two, we're getting all lines to say png. And then we're also getting all lines to say text. So even if there's a number of lines between them, we're getting those two lines. And then I'm cutting out all the lines that say png. I hope that makes sense. So that's a little bit better. Again, in the next video, we're going to be going through parsing through the JSON file like you really should if you have the proper tools. Again, grep is a very common tool. And this is just a good practice exercise. But that's just an alternate option for that. So I hope you did find this useful. And again, the next video, we'll get into it a little more. And then after that, we'll get into a video. Again, this is not fully automated. We're grabbing stuff from the web browser, which is part of why I wanted to show you, because that is very useful if you're just at a web page and it has lots of pictures or files on it. I mean, as long as the files are loaded in the page, it can't be just linking to it. But if files are displayed on the page, then you can dump that content into an HRA file and then pull them out similarly to this. And I have found that very useful in many cases. So I do thank you for watching. Please visit Filmsbychrist.com. That's Chris McKay. And I hope that you have a great day.