 Right, so hey guys, and welcome back to another Python tutorial in this video We'll be looking at how to create a web scraper that scrapes trending video data from a website like YouTube We're using a proxy to change our geo location and also rotate between various IPs to avoid getting weight limited and have Access to YouTube's trending video list from various countries slash regions So now there is a bunch of free poxies out there that you could use to basically change your location But the issue you're going to run into is that most of these free proxies aren't really reliable due to latency downtime or other reasons and most of them will also not allow you with the ability to Easily auto rotate between various IP addresses or change your geo location of the proxy easily within your code So in this tutorial I have used the residential proxy service from IP Royal were kindly sponsored today's video IP Royal is a proxy service provider providing safe private and unrestricted access to online information With a pool of over 2 million plus reliable IPs IP Royal allows clients to use a proxy server as an intermediary between their devices and the web Which which allows clients to maintain their privacy and use resources. They can't access directly due to geo restrictions, etc IP Royals data center proxies can serve as a great product for businesses or users looking for premium high-speed anonymous private proxies Which usually have unlimited bandwidth and no extra charges For data scraping though I would recommend using IP Royals residential proxies as they not only allow anonymity when scraping websites Which can help avoid in getting weight limited but also let you select the geo location of the proxy either through the dashboard Or by making minor changes in your code Lastly IP Royals handles automatic IP rotation, which not only allows easy integration within your code But they also provide the option to use static IPs in case that you decide to keep an IP for a longer duration of time In this video IP Royals has provided me with a discount code that will give you guys a straight 50% discount on their Royal residential proxies The code is your hand 50 and I've also included a link in the description which you guys can use to obtain the discount So guys make sure to use the discount code before the deal expires or you'll be missing on a great deal Okay, so now that we have looked into a bit more about IP Royals and their services as well as sort of how proxies work We're going to get right into the code. So for this tutorial you're going to need to install a few dependencies Which have all been stated down here So go ahead and run the pip install command for web driver manager Selenium wire beautiful soup for and pandas Also be linking these in the description to sort of make your life easier So to begin with we're basically going to import all of these dependencies So I'm going to get rid of this these can be found in the description. So feel free to go get them from there So we're going to be using Selenium wire Which is mainly going to be used to basically scrape all the content from YouTube and since YouTube is a dynamic website We are using Selenium wire. Otherwise, we'd be using something simpler like requests, which Basically helps us scrape static websites. So nothing to do with JavaScript next thing we're going to need is beautiful soap beautiful soup which will basically help us pass the Text content off the website that we're scraping and then we're going to also need the web web driver manager Specifically the Chrome Driver manager, sorry, so we do Chrome Chrome driver manager what this is going to help us do is avoid us having to many Avoid us having to manually install the Chrome driver that's compatible with the Chrome version. We have installed on our computer Since Selenium Requires the Chrome driver or Firefox driver depending on what you're using. We are also going to be needing Pandas to basically Nicely format our table of data So import pandas as PD and then lastly we're going to need to import time to solve basically add a bit of Gap in the execution of our code So let's run all of this and hopefully all of it should run fine for you guys if you've installed all the dependencies correctly I've got a little warning here due to clash in the version. I have but it should be fine. No issues to worry about so first things we want to do is we want to initialize the Chrome web driver, so we create a new variable and then assign that to web driver Dot Chrome because we're using Chrome the Chrome web driver and we need to provide the Usually you need to provide the environment path or the path to your driver That's compatible with the Chrome version you have installed But since we're using Chrome driver manager, this will automatically install the Driver that's required for the version of Chrome we have installed and handle all of that by itself Now now that we have this sorted we have the driver initialized the next thing you could do is Click on type in driver dot get and then you would put the URL of the website You want to get content prompt so like I was saying we're going to be scraping some data off of YouTube's training feed So let me go ahead and open a browser here Let's go to youtube.com and We need to basically find the page that include you need to find the URL that includes the trending page so we're going to go to Explore I believe is and then trending so essentially this is the page we need to get down to and I click on accept all and All we're going to need is just this bit here. So youtube.com feed trending So that's the URL that we're looking for Once we have that URL we can check that in here because that's the URL We want to grab and now if you quickly run the code what you're going to notice is that a Chrome Window should pop up Like this and what you're going to see is it basically opens up the URL that we asked it to open Now the first sort of hurdle we have before getting to our data is that we are prompted with this sort of Google prompt that we need to click on accept all Before we even get directed to the website that actually has the content. We're looking to scrape So we're using selenium. We need to figure out how to click on this little button right here Now the way you do that is you would inspect the element and then you would need to find So inspect again and then since this element is a button in html You'd need to find the class of this button. Now if we look at the class this Has multiple classes. So that's one class each class at the class name is sort of separated by a space here So that's one class. That's another class and so on so forth now to save you guys some hassle I have already found out what the class of this button is. So it is the second class, which is we have ppkd and then ending with p Qpj so that's the class we're looking for Now classes are nothing more than just a unique way of identifying different elements So that's just a unique way of identifying that specific button So I'm going to close that window down for now and what we're going to do next is basically use something called driver dot find element by class name element, sorry by CSS selector and then The CSS selector whenever we're trying to obtain an item using its class We use a dot to refer to class and we use a hashtag to refer to id which is a bit more specific But we use a dot and then we specify the class name of the button. So I have a copy made of the class name Which is right here. So that's the class name we discussed that refers to the button now that we grab the button By its class, we just use the dot click method and essentially what selenium will go ahead and do is look for this button With this class name and then click on it Let's run this again to see in action So give it a few seconds because it does take a moment and here it is loading up Give it a quick moment and you will see it disappear because basically what selenium is going ahead and done is clicked on the Accept prompt and now what you can see is the YouTube trending page that we actually are looking for now at this stage We haven't incorporated any proxies, which is why we're seeing the UK Trending feed once we have incorporated proxies will be able to use auto rotating IPs Select proxies from various countries as well as select specific countries regions or cities, which is amazing So I'm going to close this down and then we move on to the next bit Now what I noticed while making this tutorial is that the prompt which we need to click accept all Only sometimes appear and sometimes it doesn't so when it doesn't appear if we try to select that element Which doesn't exist on the page and try to click on it will be faced with an error So we're just going to add a simple try and accept statement here Which is basically going to try to click on the button if it exists and if it doesn't exist We're just going to accept the exception and then pass I mean not the best practice, but does the job for what we're trying to do So if it doesn't exist and if this error is due to the button or screen not existing We're just going to pass and not do anything about it So the next thing we want to do is we want to grab the content that's rendered on the page Which basically contains all the list of trending videos So we're going to use a variable called soup and then we're going to assign that to beautiful soup and basically pass The page source of the driver So now that the driver has clicked on the button and has redirected us to the trending page We'll basically take the HTML code that was rendered by the driver and then pass it in beautiful soup So that we can find a specific element that we're looking for Now that this is done I will just quickly run this to show you guys what we're looking for specifically So let it run Give it a quick moment to accept everything It's accepted and now we're redirected to this page Which at this point we've already passed this whole page using beautiful soup now We want we only really care about the videos and for the sake of this tutorial We're just going to scrape the title of the video and the URL of the video and probably the country Which from which this trending page was scripted from so I'm going to click on inspect just to see What elements we're trying to grab so click on inspect again over here and If I get rid of This Let's just try and find the a tag Which should basically contain if I'm not wrong So that's the Different class, so we need the We need the anchor tag which basically should contain. Oh there is so anchor tag is basically a URL tag in HTML and basically what it lets you do is Assign a link to it and when you click on it or redirect you so this anchor tag Basically is on all of these videos right here So this anchor tag exists on all of them and YouTube has decided to give a unique ID called video title So all of the anchor tags have this unique ID code video title The reason I know this is because I've already scraped a version of this beforehand So we are going to want to grab all these anchor tags from this page Which have the ID of video title and once we have this anchor tag We can see that one of the attributes of this anchor tag is title and the title is you know the title of the video So and the other bit that we were also looking for was the URL Which we've got in the href attribute which we can grab from here So we want to try and grab all the anchor tags with the ID of video title from this page So let's try and do that now that we have the page passed in the soup object right here To do that you basically just want to create a new variable called trending videos Which is going to be a list and you're going to use soup Which is the entire page and then the find all method because we want to find all the elements that have an anchor tag So we're going to say we want to find all the anchor tags with the ID of Video title now we got this ID From the HTML inspect that we did and you can see the ID is video title So we want to grab all the anchor tags that have the ID video title and what we're going to do for now is just Print sending video it should end up being a list of all the anchor tags Just leave this here as well Just so you guys can see should end up being a list of all the anchor tags on this page with the ID of video title So let it load give it a moment and then once it's done loading We will scroll down over here and as you can see amazing we have a list of anchor tags Which contain all of the videos on the YouTube page so we can confirm this by looking at the first Sort of tight so that we have which is siblings or dating in real life edition Compare this to the normal page. That's the first one. Then let's look at the last one and the last one is Isn't it flicks just making bad movies on purpose? So we scroll down in our list and we look at the title Where is it? Where is it? Where is it? Title title. Yeah, there you go. Isn't it flicks just making bad movies on purpose and it's correct So we just compared the first and last element of our list and it matches with our page Which means we've got everything we need Now each of these elements in this list is an actual element like an actual beautiful soup elements that we can like pass Or navigate through so the bits that we don't need two bits that we need from each of these URLs or anchor tags Is we need the title which is Where is it gone? So we need the title attribute so this text right here and we need the href which includes the link So to get that all we need to do is loop through this array. So we're going to do for trending video in trending videos what we want to do is Do trending for now, I'm just going to print trending video And then we're going to print the title attribute of each element And then we're also going to print trending video and we're going to print href attribute Which should include the link now I'm going to break after the first iteration because I only want to do this once so let's run this again and I'm not going to show you the window of chrome that opens up because I'm pretty sure you guys are bored of it But it's doing its job in the corner of my screen and Let's give it a few seconds to load Hopefully any time soon and As you can see it's printed out the title for the first video since we're breaking off the first iteration and the href Or the URL for the first video now. We're just going to create a variable Maybe in the cell above called data and then we're going to assign that to be an empty array We're going to write all of our data to this to this list. So we're going to write Data.append and then we're going to write an object Sorry a dictionary. I'm going back from using JavaScript. So Sorry about the use of object. So We're going to use data.append and then So we're going to use data.append and We're going to write in a dictionary and the first key is going to be title And the value is going to be trending video and the title Attribute and the second key value is going to be the URL so the key is going to be URL and the value is going to be trending video and href as the attribute So let's quickly run this up and this time we're going to run it for the entire Trending videos list so it should hopefully do it for the entire Entireity of the YouTube page. Let's run this And I'm not too sure why it's giving me a data is not defined even though it's defined up here Probably just glitching or we shall find out in a second So it's running right now doing his job give it a moment and Voila, it's run completely fine. I think my lint is bugged out. But if we print out data What you should be able to see is each Video isn't in a form of a dictionary and we have a title and we have a URL for each of those videos Now we're also going to add in a country key and value later on but we're going to get into that in a second So now that we have this just for the UK what we're going to try and do is Incorporate the ability to change proxies now, which is where it gets interesting. So what you're going to need is to What you're going to need to do is is to basically use something called Losing words here, but we're basically going to need to set up the options for Selenium connection. So where we initialize our driver We're going to have to give it an additional additional parameter for Selenium wire options, which will basically specify a proxy and stuff. So first things first what you want to store in here is You want to have the proxy so Since we're using IP Royal a proxy is going to be authenticated. So we have a username password and The URL and port to the proxy So what I'm going to do is I'm going to make an f-stream and I'm going to open up the dashboard for IP Royal Which I have right here. So that's what the dashboard looks like for IP Royal and for the sake of this tutorial They've given me a generous amount of two gigabytes to demonstrate to you guys feel free to use Feel free to use the proxies if you would like to which is why I'm not blurring it So go ahead if you'd like to use it What we're going to need here is basically the We can set up the country for which we want the proxy to be in at the moment. It's set to random But you can select a specific country and then you can select a rotation So whether you want to keep a sticky IP, which means keep the same IP for a certain duration or just randomize Automatically, which is the option. We're going to go with at the moment now What we want to do next is grab this URL right here, which basically contains our username our password and The host name and port this will basically allow us to connect to this proxy via Selenium Now what what you'll notice is if we were to select a specific country Which we need to do anyway since we are trying to scrape data for you know USA trending videos or Spain trending videos in some other countries What we will notice is let's say if we selected United States Notice what happens to the URL right here? Everything remains the same the only thing that has been changed is there is now an underscore country Dash and then the two digit country code now This is pretty standard for basically all the countries if I were to change this to United Kingdom It will do underscore Country and then dash GB GB being Great Britain the two digit code for UK now as long as you know The country codes of the countries you're trying to scrape you could dynamically change this URL within your code Pretty easily by using an f-strip. So that's exactly what we're going to be doing now Let's just grab this for now without making any changes Don't copy the rest of it just from here to here copy it and then let's chuck it into our proxies Proxy over here. So we're just going to be in country United Kingdom for now. Nothing too fancy Now what we're going to do is we're going to have to create a few param an object with the parameters So options is going to be a dictionary. Sorry for calling an object again Very specified proxy and then we need to specify the HTTP proxy, which is going to be this one right here So up the same one is going to be used across HTTP and HTTPS according to IP Royals documentation So that's pretty much it for the configuration We just need to tell Selenium where our proxy sits and what the credentials are which we provided up here Now that we have the option set up in a nice little dictionary. We're going to use We're going to tell the driver where to find these arguments Slash where to find this configuration. So Selenium where options equals options should basically to you the trick for that now Let's maybe instead of doing Great Britain since we've already Where we are in UK. We can't see the trending list for Great Britain. Let's change it to perhaps Let's say I Think Spain should be interesting. So let's do ES Let's run the whole thing again and hopefully when Selenium tries scraping we should see the trending items for Spain Let's run this up It's going to take a second because we're now using proxy so it takes a bit of time because of how Selenium is Okay, so it's loaded up now We're just waiting for Selenium to click on the button and then redirect us to the trending page so it's redirecting to the trending page and It's going to take a while because obviously it's loading a lot of images and stuff But as you can see this trending page looks pretty altered to Spain. I mean Completely different to what we saw in the UK one the UK one had some stuff about dating in real life and whatnot now Once it's done this what it's also done is okay This is the previous list of UK one so you can see UK heatwave was another video in the trending But now if you look at data It'd be completely different because we've got basically everything for Spain and it looks like it's mainly football orientated which sounds about right? There's also a lot of Spanish within the titles so amazing we've managed to use a proxy to switch our location and in the dashboard we've also Configured it to basically use random Random IP addresses, which is amazing because now we won't even get rate limited by the site So now that we know how to do this for one country We can do it for several countries just by changing the country code right here to do this Let's just make a quick list, which is going to be called countries And this is these are basically going to be countries that we want to scrape So let's start with the US then we'll do Great Britain and then we'll also do Spain now let's run these two cells up and What we're going to do is we're going to basically want to run through each of these countries inside the country's list So we're going to do for I comma country in enumerate countries enumerate just gives you access to the sort of index as well, so the It's like a variable that's incremented by one every time and then country is just each of these strings right here So basically in our F string, we're going to replace ES with the country value So whatever it is that we are looping for at the moment So we'll start with the US then go to GB and then ES So we're going to do that and then the next thing you want to do is Since we're going to be dealing with multiple scrapes now in a in a row We want to once we're done scraping down here the training videos for a specific country and we've added it to our data We want to time dot sleep two seconds just to give selenium a bit of time and then we want to do driver Quick so that we actually close the instance of selenium We just use and then the next instance will open a new instance of selenium Otherwise, we'll sort of be hogging on our resources on the computer Now another quick optimization we could also make is like I was talking about before will disable images being loaded When selenium tries to scrape data from the site because we don't really care about the images We only care about the title and the URL and stopping images from loading We'll surely save us a bit of time with the scrape if we were to do it in a large-scale scrape So to do this for a new variable called chrome options And we're going to equal that to web driver, which we've already imported from selenium wire and then do chrome options Now we're going to do chrome options dot add argument Now bear in mind. There's a lot of configuration that you can do. I referred to this from the documentation so feel free to take a look at the documentation and Look at different ways of optimizing your code. But for the sake of this tutorial, I'm only going to disable the Images from being loaded in the scrape. So now when the scrape actually starts you'll notice that when YouTube trending page loads, it'll be a little bit quicker and Also the fact that the images won't show up for the thumbnails Now last thing we need to do is we imported pandas to sort of format the data in a nice way So we're going to do that. We're going to do Data frame equals data from is just like a table like an Excel table think about it that way and then we're going to use PD for panda data frame so to convert our data list into a nice little data frame that looks like Well, we'll look how it looks like in a second because obviously data is empty at the moment So let's run this up and hopefully it should scrape all of the trending videos for countries u.s. Great Britain and Spain Let's run it up Hopefully there's no errors fingers crossed Okay, the instance has opened up and now we can just watch this go Once loaded in and it was for USA and it does look like the feed is altered to USA. I mean pretty generic stuff Purge Halloween and all that stuff So it looks like one of the scrapes was complete since we instructed selenium to quit right after it's complete And if you guys have also noticed Load time is improved by a little bit because we are not loading images and stuff on the scrape So here's the second one. I'm assuming this is for UK. So hopefully the page will be altered to UK There we go UK page is loaded in as well and then now lastly we need to scrape off the What's the page we're scraping we're scraping Spain now so accept the accept all and then We're just going to be scraping the feed for Spain Okay, hopefully this runs just fine And then we should hopefully have data for three countries based on the trending pages It ran in pretty decent time to be honest as well So that's pretty good now that we have run that look at this We have a beautiful data frame which has the trending list for three different countries fascinating We cannot now that we have it in a data frame. I mean, this is not a pandas tutorial But just a bit of extra if you guys are curious We can use things like query so we can do dear query and we can say I don't know The title is equal to a specific thing. So the title equals Halloween ends official trailer and it what it's going to do is it's only going to show us the records that include Halloween and official trailer. There is some duplicates in this data, but we can drop them quite easily by doing DF drop duplicates and we drop the duplicates by the title and once we've dropped we are left around 426 records The last thing we want to do is add in a country column in here because we don't know which Titles and URLs are for which country. So to do that, we can simply create a new key value pair in here So I'm going to do country and the key is going to be country The value is going to be country which is being pulled Which is the two letter country code from up here being pulled from the country's list, right? So you can add a bit of jazz to your code just to see what the progress of the scape scraper is by doing something like this So say completed I plus one out of the length of Length of the country's list Basically, it will just give you the approach of progress a beta progress bar That will tell you how far the progress it's gone. Let's run this again And hopefully we should get a final output. I know this tutorial is already quite long But I'm really hoping you guys have learned a lot and enjoyed it so far So let's look at what's going on running it for the first one Okay, the first one's loaded in just waiting for it to grab everything and then start the second one cool As you can see we're going to start the second one now Second instance going to be started soon I might pass forward some of this during the edit just to save you guys some time But basically you're basically looking at the same stuff that we look at last time Just waiting for the data table to have the country column as well so that we can query the data by the different countries too So this is the feed for UK if I'm not mistaken and then the last one is for Spain Yep, spate you can add a lot more countries in there as well and obviously I have a data limit of two gigs at the moment which IP proxies, sorry IP world has been quite generous to give me If you guys were to use the 50% discount, of course, it should be pretty Reasonable price to get a high-performing proxy. So surely would recommend So that's the last one that is scraped Took slightly longer than last time, but let's look at the results. We have country as well this time amazing So we're going to drop the duplicates By title We end up with 246 records we can do In place equals true to actually save the changes to the data frame and now we have 246 rows Now we can query things like the country so we can say, okay, I only want to look at the Trending videos for us. So we're just going to do df.query country is equal to us And this will make sure that we're only looking at the data for USA Then we can do the same thing with GB. So this is basically only trending videos for UK and if we did the same thing for Let's say Spain We'll get the same thing for Spain. So only show you trending videos for Spain We can also use things like or so you can say if country equals Spain or country equals United Kingdom. So this will basically give us records for both and That's pretty much it guys. If you wanted to save this to a file You could just do df.2 csv give the file a name. So I don't know trending list YouTube dot csv and then Index false and by doing that you would have saved that to a nice Excel file a csv file Which you can view in future purposes. Hope you've enjoyed today's tutorial guys Please leave me any feedback or any tutorials that you would like in the future And I'll see you beautiful faces in the next tutorial. Peace