 Hello and welcome to a tutorial from filmsbychris.com. I'm Chris, Chris with the K. Today we're going to be looking at pulling down web information from websites, scraping web pages from the shell using curl or WGet on both static and dynamic pages. So what does that mean? So when you go to a web page, things load and it could just be plain HTML, which is great. It could just say curl or WGet this URL, it gives you all the text, and then you can cut through it. But most of the time these days you have JavaScript in action on pages, which may seem a little tricky in some cases for scraping. But really, this is actually better in many ways. First of all, here's my website. Now my website's actually set up that if you use curl or WGet to grab it, it actually leads you to a text space script where you can search through stuff and it gives you directions and stuff like that. But most pages aren't going to be that nice to you. But having a dynamic page with JavaScript allows you to do stuff like this. I can type in Linux and it will filter all my videos that say Linux or Bash or Doom. And if we didn't have JavaScript and we just did dynamic pages, you would have to reload that entire page every time and it wouldn't show up as you're typing. You'd have to type and hit enter, wait for the page to reload. So having these dynamic pages, although a lot of people would argue with you, it's a great thing. It allows you to do so much more. And really when it comes to web scraping, it makes things easier if the page is designed properly and in most cases it will. Today I'm going to show you examples that I created which are very basic. But in the next video we're going to go to a real life scenario that a friend asked me about. So if you go to filmsbychrist.com forward slash scripts forward slash 2020 forward slash scraping, I'll try to remember to put a link in the description of this video. You'll see three files. Now pretend you don't see this names file because in a real website you wouldn't see that because you wouldn't have an index like this in most cases. But I have 01 dynamic and 01 static. Let's start with the static page. So we do that and again this is a very, very basic example which makes it very simple for scraping. But scraping can get complex. But here it is. It just shows a list of names. Look at the HTML. As you can see HTML loads and then we have div tags and here's a list of the names and they all have a class name. This is going to be very, very easy to scrape. So what I'm going to do is I'm going to grab the URL for this. I'm going to copy it. I'm going to go to my shell and I'm going to type in I'll use curl but you can use wget as well. Wget by default saves it to a file so you have to do dash capital O dash for standard output where curl does standard output by default. They just do things in different ways. But I do that and I get this and then again this is I can use then type that into grep and grep for name and then I can say cut dash D for delimiter. I can say this delimiter of the greater than symbol field two and then I can type it into cut again. And again there's lots of different ways you can do this and there's probably cleaner ways with like sed or oc but I'm just piping it through that. And there I got my list of name and that's my command. So that was a static page. Very easy to pull but then you have to cut through all this. And if they change the design or look of their web page it could throw your script out of whack because things may be in different places or label differently. Plus this was very easy to grep for name where in a lot of cases you might have objects on the page, elements on the page that have names then you'd have to figure out and then lots of times you'll have the div tags and the stuff will be in between so instead of just cutting I'd have to find that, find the line after then tail out that bottom line. It can get very sloppy. Let's look at a dynamic page. So I'm going to clear this out and I'm going to go back to my web browser and go back here and now I'm going to go to the dynamic page. Looks the same, right? Let's look at the code of the page. Ah see there's just a few lines of JavaScript that requests a file and then it's what's an undoing a very basic text file here. Most cases you're going to get JSON output which is great as well especially if you're looking for something more than just a basic name. Again we'll look at that more in a realistic scenario in the next video. But you can see the list of names isn't here because this is what the HTML looks like. It's got to run and curl and WGet are not going to run that JavaScript for you. There are tools out there. I like using phantom.js but there's other ones for Python that will actually render it in a browser for you or an invisible browser sometimes and give you the rendered output which is useful sometimes. But in most cases it's not needed and it's a lot simpler. So let's look at this. Again if we go back to here, this dynamic page and I grab the URL and I say curl, I paste in that URL, I get this. Again doesn't have the list of names. Now I can look at the code here and right away see what's going on. But on a real page you might have a whole lot of JavaScript and it's hard to go through. You don't have to go through all that. What I'm going to do is I'm going to open up in my browser. I'm using Brave which is like a Chrome based browser or a Chromium based browser but Firefox almost all web browsers nowadays have a developers tab or console. F12 is common to open that. If you don't have an F12 key like on my Chromebook, Control Shift I will open that up. It might be different in other browsers but you should be able to get a developers tab, go to your menu and say developers console or whatever. And then down here we're going to choose network and I'm going to leave this to all at first here and what I'm going to do is I'm just going to refresh the page. And you can see some things loading here. And in certain cases I'll show you a minute, actually let's just go back to filmsbychrist.com. I'm going to do the same thing. I'm going to open this up and notice nothing's loaded because you have to have this open when it's loading, when stuff's loading for it to show up. But if I come in here and I wipe this out and I type in bash, you can see that it's doing requests and I can click on one of these and see the output of that here. That's the preview, there's the actual output. Let's go back to our basic example here and again I'm just passing it plain text and splitting it based on lines while essentially you'll get JavaScript. Now again you'll get lots of stuff loading here, especially if you're typing, you might be loading multiple stuff. One way to narrow this down, most of the time your data that you're looking for is going to be, unless you're looking for images or something else like this, you're going to want the XHR. You click that and look it narrows it down and I can click on this and click preview or response and see, is that the information I'm looking for? Yes, it's the list of names. So all I have to do is click here and say copy address link and now I can curl that information and I have the list of names. I didn't need to pipe it into grep or cut or anything like that. I got the list of names and that's what I wanted and a lot of times you'll get JavaScript in which case you can, or sorry, JSON in which case you can use JQ which I've talked about in the past and we'll look at in the next video. But that's it. So we have two scenarios. We have this which is curl, the URL, grep-ing, cutting, cutting, and again that's a very basic example and if the web page changes it might break it. Obviously if they change the backend things can break too but lots of times if they just change the look of the page they're going to leave their backend scripts the same and so it's still be easy to grab that information the other way. So I can have this long command which really isn't that long but could be a lot longer or I can just have this and get the list of names I want. So instead of scraping just requesting the information from the website is a lot cleaner and easier and not that hard to find. So again if you go to a page and the information loads when the page loads make sure you have this open and then click refresh but in this case what we can do again is I can clear this out and start typing so I can type in that and you can see that it gave me one return on that but right away I know this is what I'm looking for there's the information I'm looking for and again if I click XHR that would narrow it down so if I say all and I refresh this page you can see lots of stuff's loading and now I can type in bash and again I'm there's a few things loading here obviously it's these ones here but on some pages maybe a little harder XHR and I can click on this and look at the preview or the response is that the information I'm looking for right click it and say copying or if I just need at the one time I can just say copy as all as a copy all as but copy as h a r is there one that's just copy there's copy all I guess that would copy everything in that list I thought there was one that just copy the one that's highlighted but one more thing to look at here so I can copy this URL and get it but sometimes it's passing post variables or there's session information and either case if you're trying to scrape a page that requires a login or some sort of session ID or however you're doing it you're going to want that information well if you come in here on any of these pages you say copy as curl I can paste that in here it's gonna be longer than need be there's information in this case it's not needed but it will paste in your cookies your session ID information all that in here and you will get the information that you're looking for and again that is great if you have that X information but lots of times just right clicking this and saying copy well not lots of times lots of times that copy as curls very useful with the cookies and the session keys and whatnot just remember session keys and stuff expire so you use that in a code that code may not work an hour or a day from now but if you just trying to pull stuff down quickly that has a quick and easy way to do it what else I want to mention here so yeah so scraping the page can be sloppy we're just requesting the information and there are pages that make this hard it's rare but like for example like there's Google pages like if you try to pull that information from Google there they try to hide it and it's just in my opinion horrible programming because you're working harder at trying to hide stuff and now you're adding all this extra code that you don't need but in most websites coming here grabbing the information you need is super simple and I have seen and I have done back in the day something that's very very poor and ugly is most time when you're requesting information through JavaScript like this you're gonna get properly formatted information either plain text like this most time JSON maybe some XML but occasionally you get a programmer who on the server side generates the HTML and spits that back out to your web browser oh horrible I used to do that it's rare you see that occasionally because then now you're pulling information and you still have to scrape it and it's just sloppy we're just requesting that information and as a programmer that's that's not good if you're the programmer doing that because you do that on that side then if you want to make a change to the information you got to change it like you want to change the look of the page you got changed on the server side we're having this constant information for example going back to my website films like chris.com if I wanted to change the look at this page let me see let me go back here five that's still around no six is that still around okay so here's my old web page see I changed the look of it but it's still using if I type in bash the same look at me walking across I forgot about that you can see that I'm still searching through and getting the information here actually it looks like on this page I did exactly what I was talking about I outputted the HTML instead of plain text up but if I create a new version of my page I'm going to not change that back end I'm going to still be requesting let's see now I'm getting intrigued by my my old coding I can still use this link to request that information for my new site so I can change the way my page looks but still use this back end and pull the information I want where if I was outputting HTML I'd have to change both my front end and back end anyway that's just a little bit of information if you're designing something I would advise against generating HTML on the server side although there are people out there who are going to argue oh you should do everything on the server side and output static information and that may makes it backwards compatible with computers and operating since I shouldn't say computers because newer computer computers from 20 years ago can get new software on it if you're using Windows 95 or Windows 98 you know they're not going to be recognizing the JavaScript so you're not going to be backwards compatible with them okay in that case make your page completely static I see more of a concern of being forward compatibility so I'm doing now is still compatible in the future you can't always be backwards compatible should be a goal but it's not always a possibility because you're trying to add new features that didn't exist back then anyway I do thank you for watching again filmsbychrist.com that's Chris the K that should be linked in the description for go to software and click here on scripts and then click 2020 scraping and you'll see the code from today and again you can see right here the names file but in most cases you're not going to have an index like this thanks for watching please visit filmsbychrist.com link in description my patreon link in description and I hope that you have a great day