 Hello, everybody. In this lesson, we're going to be scraping data from a real website and putting it into a panda's data frame and maybe even exporting it to CSV if we're feeling a bit spicy. Now in the last several lessons, we've been looking at this page right here. And I even promised that we were going to be pulling this data. But as I was building out the project, I just, I honestly thought it was a little bit too easy since in the last lesson, we kind of already pulled out some information from this table. And I want to kind of throw you guys off. So we're going to be pulling from a different table. We're going to be going on to Wikipedia and looking at the list of the largest companies in the United States by revenue. And we're going to be pulling all of this information. So if you thought this is going to be easy and a little mini project, it's now a full project because why not? So let's get started. What we're going to do is we're going to import beautiful soup and requests. We're going to get this information and we're going to see how we can do this. And it's going to get a little bit more complicated, a little bit more tricky. We're going to have to, you know, format things properly to get it into our pandas data frame to make it looking good and making it more usable. So let's go ahead and get rid of this easy table. We don't want that one. And we're going to come in here and we're just going to start off. This should look really familiar by now. We're going to say from BS for import, beautiful soup. I don't know if you've noticed, but I've messed up spelling beautiful soup in every single video. I've noticed. Let's run this. And now we need to go ahead and get our URL. So let's come up here. Let's get our URL. Say URL is equal to and we'll just keep it all in the same thing really quickly because we know this by heart by now, right? We'll say requests.get and then URL to make sure that we're getting that information and give us a response object. Hopefully it'll be 200. That'll mean a good response. And then we'll say soup is equal to and then we'll say beautiful soup and we'll do our page dot text. And now we're pulling in the information from this URL. And then we use our parser, which will be HTML. And let's go ahead and run this. Looks like everything went well. Let's print our soup. Now this is completely new to you. It's completely new to me. I don't know what I'm doing. But it looks like we're pulling in the information. Am I right? So we got a lot of things going for us. The stuff was imported properly. We got our URL. We got our soup, which is not beautiful, in my opinion. But let's keep on rolling. Let's come right down here. Now what we need to do is we need to specify what data we're looking for. So let's come and let's inspect this webpage. Now the only information that we're going to want is right in here. We're going to want these titles or these headers. Whoops. So we're going to want rank, name, industry, etc. And then we are for sure going to want all of this information. Let's just scroll down, see if there's anything tricky in here. All right, that looks pretty good. And there is another table. So there's not just one table in here. There are two tables in this page. So that might change things for us. But let's come right back. And let's inspect our page by using this little button right here. And let's specify in, let's see if I can highlight just this page. Oh, it's not good. Oh, let's do that right there. So now we have this wiki table sorter. Now I'm going to actually come right here. I'm going to copy. And I'm just going to say copy the outer HTML. I'm just going to paste in here real quick. And that's a ton of information. I didn't think it was going to copy all of it. And we're just going to delete that. I just wanted to keep that class because I wanted to then come right down here at the bottom and just see what this table looks like. I don't know if it's part of it or if it's if it's its own table. I can't tell. Let's look at this rank. And let's come up. So it says it's under this table. It looks like it's its own table but it says wiki table sort of sortable jQuery table sorter. wiki table sortable jQuery table sorter. So it looks like there are two tables with the same class which shouldn't be a problem if we're using find to get our text because we should be taking the first one which will be this table. And this is the table we want. And if we wanted this one, we could just use find all and send it to list. We could use indexing to pull this table right. But I think we're going to be okay with just pulling in this one. So let's go ahead and let's do our find. So we'll do soup dot find. And we could find all or we could just do find table. Let's just try this and see what we get. And if it pulls in the right one that we're looking for that would be great. Now this does not look correct at all. I don't know what table it's pulling in. Oh maybe it's this right here. This might be a table. Yeah it is. So we have this box more citations. So actually we are going to have to do exactly like what I was talking about. Let's pull this. And we got what we could do comma class right here. And let's do both. You know what this is a learning opportunity. Let's do both. So let me go back to the top because I need these. And what we're going to do is come right down here. I want to add in another thing. Actually I'll just push this one up. There we go. So we're going to say find underscore all. Let's run this. So now we have multiple. And again we got that weird one first. But if we scroll down here's our comma. And then here's our wiki table sortable. And then we have rank name industry all the ones that we were hoping to see. And I guarantee you if we scroll all the way to the bottom we're going to see potentially Wells Fargo Goldman Sachs. I'm pretty sure those are let's see. Yeah here we go like Ford Motor Wells Fargo Goldman Sachs. That's this table right here. So now we're looking at the third table but again this is a list so we can use indexing on this. And we'll just choose not position zero because that's this one right here which we did not like. Well now we'll take position one. Let's run this. Let's go back up to the top. And this is our table right here. Rank name industry. This is the information that we were actually wanting just to confirm rank name industry etc. So this is the information we're wanting. And we're able to specify that with our find all. And this is the information we want. So we now want to make this the only information that we're looking at. So I'm just going to copy this. We didn't need to use our class for this one. You could probably could have. But we could. So let's actually put this right down here. This will be our table. We'll say equal to. But then I'll come right here. And I'm going to say soup dot find. And this is just for demonstration purposes. We'll do table comma last underscore is equal to. And then we'll look at this right here. Whoops. Do this. And let's see if we get the correct output. And let's run this. And it looks like we're getting a none type object. If I remember it looks like the actual class is this right here. So let's run this instead. And I got to get rid of the index. There we go. Okay. So we were able to pull it in just using the find. So the find table class. It says wiki table sortable. At least that's the HTML that we're pulling in right here. Let me go back because I don't know if that's what I was seeing earlier. Let's just get this rank. Let's go back up. Where's the rank. Go rank. There we go. So here's our rank. And let's go up to the table. And there's our class. Yeah. And that's just to me that's a little bit odd. So it says wiki table sortable j query dash table sorter right here. But in our actual our actual Python script that we're running, it was only pulling in the wiki table sortable. So it wasn't pulling in the j query dash table sorter. Why I'm not 100 percent sure. But all things that we're working through and we were able to we were able to figure out. So we're going to make this our table. We're going to say tables equal to soup dot find all. And let's run this. And if we print out our table, we have this table. Now this is our only data that we are looking at. Now the first thing that I want to get is I want to get these titles or these headers right here. That's where we're going to get first. So let's go in here. We can just look in this information. You can see that these are with these th tags. And we can pull out those th tags really easily. Let's come right down here. We're just going to say th and we can get rid of this. Let's run this. Now these are our only th tags because everything else is a tr tag for these rows of data. So these th tags are pretty unique, which makes it really easy, which is really great, because then we can just do world underscore titles is equal to. Now we have these titles, but they're not perfect. But what we're going to do is we're going to loop through it. So I'm going to say world underscore titles. And I'll kind of walk through what I'm talking about. Is it a list? And each one is within these th tags. So th and then there's our string that we're trying to get. So we can easily take this list and use list comprehension. And we can do that right down here. So I'm going to keep this where we can see it. We'll do world underscore table underscore titles that's equal to now we'll do our list comprehension should be super easy. We'll just say for title in world underscore titles. And then what do we want? We want title dot text. That's it. Because we're just taking the text from each of these, we're just looping through, and we're getting rank, and we're looping through getting name looping through getting industry. That's it. So let's go and print our world table titles and see if it worked. And it did. This looks like it needs to be cleaned up just a little bit. So let's go ahead and do that while we're here before we actually put it into the pandas data frame. Oops. I just wanted, I just wanted this actually. So what we're going to do is try to get rid of those backslash ends. If we do dot strip, that may actually not work. Yeah, because there's a list. What we need to do is we can actually do it dot dot text dot strip right here. Let's try to do it in there. There we go. So now we have this and now this world tables is good to go. Now I'm actually noticing one thing that may be odd. Yeah, so we have rank name industry goes to headquarters. But then in here, we're getting rank name industry and then the profits, which is from this table right here, which we don't want. Let's scroll back up. Let's kind of backtrack this and see where this happened. We did find all table. We're looking at the first one, right? Then we're doing headquarters. So we're doing print table. Okay, I think I found the issue here and let's backtrack again. This is we're working through this together. We're going to make mistakes. The table is what we actually wanted to do. We just did soup dot find all th, which is going to pull in that secondary table. Oh, geez, we were not thinking here. So now we need to do find all on the table, not the soup because now we were looking at all of them. Oh, what a rookie mistake. Okay, let's go back. Now let's look at this. Now it's just down to headquarters. Okay, okay, let's go ahead and run this. Let's run this. Now we just have headquarters. Now let's run this. Now we are sitting pretty. Okay, excuse my mistakes. Hey, listen, you know, if it happens to me, it happens to you, I promise you, this is, you know, this is a project is a little, a little project we're creating here. So we're going to run the issues and that's okay. We're figuring out as we go. Now what I want to do before we start pulling in all the data is I want to put this into our pandas data frame. We'll have the, you know, headers there for us to go. So we won't have to get that later and it just makes it easier in general, trust me. So we're going to import pandas as PD. Let's go ahead and run this. And now we're going to create our data frame. So we'll say PD dot. Now we have these world table titles. So what we're going to do is PD dot data frame. And then in here for our columns, we'll say that's equal to the world table titles. And let's just go ahead and say that's our data frame and call our data frame right here. Let's run it. There we go. So we were able to pull out and extract those headers and those titles of these columns. We're able to put it into our data frame. So we're set up and we're ready to go. We're rocking and rolling. The next thing we need, let's go back up. Next thing we need is to start pulling in this data right here. So we have to see how we can pull this data in. Now, if you remember that we have those th tags, those were our titles, as you can see, I'm highlighting over it. But down here, now we have these TD tags. And those are all encapsulated within a TR tag. So these TR represent the rows, right? Then the D represents the data within those rows. So R for rows, D for data. So let's see how we can use that in order to get the information that we want. Let's go back up here. I'm just going to take this because again, we're only pulling from table. Not soup. Not soup. What were we thinking? And let's go ahead and let's look at TR. Let's run this. Now, when we're doing this TR, these do come in with the headers. So we're gonna have to later on, we're gonna have to get rid of these that we don't want to pull those in and have that as part of our data. But if we scroll down, there's our Walmart. We have the location. These are all with these TD tags. And then, of course, it's separated by a comma, then we have our TD two. So above we had our TD one. So row one, row two, row three, all the way down. Now, we will easily be able to use this, right? Because this is our column data. And we can even call it that column underscore data is equal to one. We'll run that. And what we're going to do is we're going to loop through that because it was all on the list. So we're going to loop through that information. But instead of looking at the TR tag, we're going to look at the TD tag. So let's come right down here. We'll say for the row in column row. We'll do a colon. Now we need to loop through this. We'll do something like row dot find underscore all. And then what are we looking for? We're not looking for the TR, looking for the TD. And just for now, let's print this off. See what this looks like. Apparently, I didn't run this column data. That's why. And let's run this. And what we actually need to do is something almost exactly like this. And I'm going to put it right below it. Instead of printing this off, because again, this is all in a list, we're using find all sort, we're printing off another list, which isn't actually super helpful. For each of our all these data that we're pulling in, what we can do is we can call this the row underscore data. And then we'll put the row data in here. So we'll say for, they'll say in row data. So we'll just say for the data in row data. And we'll take the data, we'll exchange that. And now instead of world table titles, we can change this into individual row data. Right. And now let's print off the individual row data. So it's the exact same process that we were doing up here. And that's how we cleaned it up and got this. And we may not need to strip, but let's just run this and see what we get. There we go. And strip, I'm sure, was helpful. Let's actually get rid of this. Yeah, strip was helpful. It's the exact same thing that happened on the last one. So let's keep that actually. Let's run this. And now let's just kind of glance at this information. Let's look through it. This looks exactly like the information that's in the table. Let's just confirm with this first one, two, five, two, what am I saying, five, seven, two, seven, five, four, 2.4, 2300, five, seven, two, seven, five, two, four, 2300. So this looks exactly correct. Now we have to figure out a way to get this into our table. Because again, these are all individual lists. It's not like we're just, you know, putting all this in at one time, we can't just take the entire table and plop it into into the data frame, we need a way to kind of put this in one at a time. Now if you're just here for web scraping and you haven't taken like my Panda series, that's totally fine. That's not what we're here for anyways. But what we can do, we'll have our individual row data, and we're going to put it in kind of one at a time. Now the reason we have to do that is because when we had it like this, and let's go back, when we had it like this, it's printing out all of it. But what it's really doing, and let's get rid of it, what it's really doing is it's kind of doing it like this, it's printing it off one at a time, and it's only going to save that current row of data, this last one, it's only going to save that as it's looping through. So what we actually want to do is every time it loops through, we append this information on to the data frame. So as it goes through, and eventually it's going to end up with this one, but as it goes through, let's run this, as it goes through, it puts this one in, and then the next time it loops through, it puts this one in, and the next time it loops through, et cetera, all the way down. So let's see how we can do this. So we have our data frame right here. Let's get rid of this. Let's bring our data frame in. Now again, like I just mentioned, if you don't know pandas and you haven't learned that, go take my series on that, it's really good. And we do something very similar to this in that series. So I'm not going to kind of walk through the entire logic. But there is something called LOC, which stands for location when you're looking at the index on a data frame, and we're going to use that to our advantage. So we're going to say the length of the data frame. So we're looking at how many rows are in this data frame. And then we're going to say that's our length. Then we're going to take that length and use it when we're actually putting in this new information, pretty, pretty cool. So we're going to say df.loc and then a bracket, and we're putting in that length. So we're checking the length of our data frame each time it's looping through. And then we're going to put the information in the next position. That's exactly what we're doing. Let's go ahead and put in the individual row data. So let's just recap, we're looping through this TR. This is our column data. So these TR, that's our row of data. Then we're as we're looping through it, we're doing find all and looking for TD tags. That's our individual data. So that's our row data. Then we're taking that data, each piece of data, and we're getting out the text and we're stripping it to kind of clean it. And now it's in a list for each individual row. Then we're looking at our current data frame, which has nothing in it right now. We're looking at the length of it. And we're appending each row of this information into the next position. So let's go ahead and run this. It's working. It's thinking. And it looks like we got an issue and not set a row with mismatched columns. Now we're counting an issue, not one that I got earlier, but we're going to cancel this out. We're going to figure this out together. So let's print off our individual row data. Let's look at this. This one is empty. This is I'm almost certain is probably the issue. I didn't encounter this issue when I wrote this lesson, but I'm almost certain that this is the issue right here. So let's do the column data, but let's start at position. Let's try one. And not parentheses. I need brackets because this is a list, right? So it should work. And there we go. So now that first one's gone. So now we just have the information. I didn't even think about that just a second ago, but I'm glad we're running into it in case you ran into that issue. Let's go ahead and try this again. And it looked like it worked. So let's pull our data frame down. I could have just wrote DF. Let's pull our data frame down. And now this is looking fantastic. Now, these three dots just mean there's information in there just doesn't want to display it. But it looks like we have our rank. We have our name. We have the industry revenue revenue growth employees and headquarters for every single one. So this is perfect. Now, this is exactly what I was hoping to get. Now you can go in and use pandas and manipulate this and change it and dive into all the information in there. But we can also export this into a CSV if that's what you're wanting. So we could easily do that by saying we'll do DF dot to underscore CSV. And then within here, we're just going to do our and specify our file path. So let's come down here to our file path, then we'll go to our folder for our output. So we're just going to take this path. And let me do it like that. So I have this path in my one drive documents, Python web scraping folder for output. So, you know, I already made this. I'm just going to put this right down here. Now, I do have to specify what we're going to call this. We'll just call this companies. And then we have to say dot CSV. That is very important. Now, if we run this, I already know just because we have this rank and this index here, we're going to keep this index in the output. Not great. Let's run it. Let's look at our output. There's our companies. And when we pull this up, as you can see, this is not what we want. Because we have this extra thing right here. Now, if we're automating this, this would get super annoying. So what we're going to do is go back and just say index equals false. Let's go out of here. Now we're just going to come right down here. We're gonna say comma index equals false. And so it's going to take this index and it's not going to import or actually export it into the CSV. Now let's go ahead and run this. Let's pull up our folder one more time. And let's refresh just to make sure it'll be good. And now this looks a lot better. So we will take all of that information and put it into a CSV and it's all there. So this is the whole project. So if we scroll all the way back up, let's just kind of glance at what we did here. Scroll down. We brought in our libraries and packages. We specified our URL, we brought in our soup. And then we tried to find our table. Now that took a little bit of testing out. But we knew that the table was the second one. So in position one, so we took that table. We were also able to specify that using find, but then we use the class. And of course, we just wanted to work with that table. That's all the data we wanted. So we specified this is our table. And we worked with just our table going forward. Of course, we encountered some small issues user errors on my end. But we were able to get our world titles. And we put those into our data frame right here using pandas. Then next we went back and we got all the row data and the individual data from those rows. And we put it into our pandas data frame, then we came below and we exported this into an actual CSV file. So that is how we can use web scraping to get data from something like a table and put it into a pandas data frame. I hope that this lesson was helpful. I know we encountered some issues. That's on my end. And I apologize. But if you run into the same issues, hopefully that helped. But I hope this was helpful. And if you like this, be sure to like and subscribe below. I appreciate you. I love you. And I will see you in the next lesson.