 Hey guys, and welcome back to another Python tutorial. So in this video, we're going to be creating a website scraper That's able to take prices data from Amazon relating to different Different items and store them in a CSV file for us So we can use this data for all sorts of things such as finding items that are the lowest prices of the best quality Based on our preference preferences So we can look for items like toys cameras protein or any other products that you're interested in And our scraper should be able to get to be looked for protein for example It should be able to look at how many pages of data there is in this case There's seven pages available and scrape each and every product on this page along with the URL to the image The price of the product and the title as well as a link to the product in case people want to click on it and Purchase it so that's what we're going to be doing in today's video now bear in mind Amazon is a massive website And they probably have all sorts of measures to prevent web scrapers from you know utilizing the Network bandwidth which could have otherwise been used by genuine customers. So to avoid getting rate-limited or banned from Amazon We're going to be using a proxy for the sake of this tutorial and I recommend you do so whenever you're doing any sort of scraping Activity so the there's a lot of free proxies out there But they can't be relied upon because they might be down or they might only let you so it's an amount of data So what I'm going to go with is IP Royal who have kindly sponsored today's tutorial IP Royal is a proxy service provider providing safe private and unrestricted access to online information With a pull-off over 2 million plus reliable IPs IP Royal allows clients to use a proxy server as an intermediary Between their devices and the web which can which allows clients to maintain their privacy and use resources They can't access directly due to geo restrictions, etc IP Royals data center proxies can serve as a great product for businesses or users looking for premium high-speed anonymous private proxies Which usually have unlimited bandwidth and no extra charges For data scraping though I would recommend using IP Royals residential proxies as they not only allow anonymity when scraping websites Which can help avoiding getting weight-limited But also let you select the geolocation of the proxy either through the dashboard or my making minor changes in your code Lastly IP Royal handles automatic IP rotation Which not only allows easy integration within your code But they also provide the option to use static IPs in case that you decide to keep an IP for a longer duration of time In this video IP Royal has provided me with a discount code that will give you guys a straight 50% discount on their Royal residential proxies The code is your hand 50 and I've also included a link in the description which you guys can use to obtain the discount So guys make sure to use the discount code before the deal expires or you'll be missing on a great deal so What I'm going to show you guys now is we have the IP Royal Dashboard from where you can view the different The different properties of your proxy so you can see the host port etc. The only bit we're going to be interested in here is Sitting up the country because we're in UK. So I'm going to set up the country to be United Kingdom You can put it up as random But I want products to be specifically in United Kingdom So I will set it up as that and then the IP rotation I actually want this to be random because I don't want Amazon to catch on to the fact that we are Scraping data to the from the site and for us to get IP banned. So I'll set it to random IP I do not care much about this the regional state. So we'll leave it like that now We'll leave this setup as it is and we'll come back to this later What we're going to need later on is this bit right here, which is the login credentials slash details of our proxy so first things first we're going to open a I've opened a Jupyter notebook file. You can open a Python script if you like your preference So I'm going to import Selenium Selenium wire I'm going to import the web driver class from it then I'm going to import a beautiful soup Which will basically let us take take the html text and convert into convert into a browser for html object So that we can easily Pass the content of our site and then we also for Selenium to work properly We need a web driver for Chrome. So there's either Chrome Firefox as far as I can remember. So I'm going to be using the Chrome the Chrome web driver, so I'm going to use the web driver manager for Chrome and Essentially what this module does is it will check if you already have the Chrome web driver module installed And it will check the version and everything for you if you have Installed and it meets the requirements. It doesn't do anything But if you don't it will install it and put it in the right directory for you. So it's pretty convenient Now we're also going to need pandas To do with all the data manipulation and just storing it as a CSV file because it makes it easy and then lastly I'm going to import time just to hold our program during certain moments. So let me run all of this and If everything wants fine, that means we should be good to go I will also be linking up a List of libraries for you guys to install using pip in the description for ease of use So now that we have all the libraries installed what we want to do next is I'm going to put a new variable called search term And let's just start with let's say you're hitting the gym So you want to look for some protein? So the search terms going to be protein this can obviously be changed to whatever you like as long as Amazon has products relating to the searcher and the next thing I'm going to do is create a new variable called scrape URL now This is going to be the URL that we're going to scrape from Now what I noticed with Amazon is that You've got Amazon.co.uk, which is your main domain and then you've got the K parameter Which is where you can put in the term that you're looking for. So all of this can we can we can get rid of and if you look for protein Enter it will come up with all the results for protein. If I change this to cameras It should come up with cameras. So what we're going to do is copy this link and Chuck it in here. So essentially this is the link that we're always going to be scraping from the only thing that will change is the K parameter which is the keyword that we're looking for so we'll turn this into we change this into an f-string and Then we'll replace the keyword camera. So it's not static into the variable search terms Now every time we change the search term, it should be embedded in here Now we'll also create an empty list in a variable called data where we'll be storing all our data that we scrape Cool. Now, that's all the basic stuff done all our variables ready We now need to set up the proxy that we're going to be using to scrape data from the website basically covering our back at this point So I'm going to create a new variable boxy equals and now we go back to the IP Royal Dashboard and then you need to copy this URL right here. It starts with an HTTP and probably ends with a port number Mine is 22323 so copy that Check that in here and that's basically your proxy that you're going to be using now along with this we also need we need to create a Dictionary which is going to be used by Selenium as the configuration for the proxy It needs to have a key called proxy and then another sub dictionary where we need to do put the HTTP proxy address and the HTTP TPS proxy address now luckily for us It's both the same so it's going to be the same address this one for both HTTP on HTTPS We're just going to pass this options object to the Selenium driver in a moment so that it can know What proxy to use while scraping the site for us? How that we have all our proxy detail set up We're going to create a variable called driver, which is going to be the Selenium driver And then what we're going to do is we're going to type in web driver Dot Chrome now web driver as I was telling you guys before all it does is it installs the relevant web driver for us in that instance we want a Chrome web driver and Then we do Chrome web driver manager dot install And basically what this is going to do is it's going to provide the Selenium Chrome method with the correct web driver and it'll make sure that it's installed and put in the right path as well now along with This we need to provide the Selenium wire options, which will essentially just be the options objects Be created up here in our case. We only need it to be configured with the proxy details and that should be good So let's run this and hopefully everything should run fine Okay, cool. So everything run fine And that should be good now the next step that we want to do Or the next step we want to take actually I'm going to move this to a different cell Because it opens up an instance of Chrome as well So now that our driver is ready essentially we've got like a human that's ready to do the actions that we want it to do So we're going to say driver go get and then we'll say scrape the URL and Now essentially what we're saying is we're telling Selenium to go to this site and open a new Chrome instance while doing so so While I was telling you guys about how the URL works here We also need to focus on another thing which is Amazon has multiple pages with items. So in this case, there's 20 pages of cameras And if we were to scrape just this URL, we will only get the first page of data Which is okay because it's the most relevant items, I guess But sometimes you may want all the data So you may want to scrape all 20 pages in that case what you can do and what I've noticed with Amazon is you can put an An sign and put a page equals one If it's page one then it just automatically gets rid of it But if it's anything more than page one so page two for example Press enter and as you can see the results have changed to page two and we can prove this by going down here You can see it's page two now do page three results have changed again And if you go down here, we're on page three and we can do this all the way to the last number Which is 20 will obviously have to scrape this from the website to find out how many total pages there are so our program can dynamically adjust the number So what we're going to do here is we're going to say we're going to our driver to get that page plus an and sign And page equals one so that we always start from scraping page one now I'm going to run this real quick and I'll show you guys what the output looks like it take a bit longer than usual because we're using proxies So let's give it a second And I have a new tab that's opened up for me here Hopefully it should come up with a link as you can see It's come up with the protein because we've looked for protein up here and it's got page one, which is amazing exactly what we need Now the first issue we're going to notice is that in order for us to actually get to the data behind this We need to get rid of this prompt which is asking us to accept the cookies So we need to use selenium and tell it to click on this accept cookies button so that it can get rid of this whole prompt Now to do so all you need to do is find the class of this button so you need to right-click inspect and then Right-click inspect again, and then you can see it's an input because obviously when you hover over it It's highlighted and then it's got an ID. You can either use a class or an ID, but ID is more more Better because it's usually unique and not shared by multiple elements. So I'm going to grab the ID of this It's SPCC accept. Let's copy that And now what we're going to do is we're going to go into our code And we're going to put a try and accept clause because sometimes the prompt for cookie shows up and sometimes it doesn't so If it does show up, we will click it and if it doesn't show up, we'll just pass. We won't do anything about it so we're going to say driver dot find element by CSS selector and The CSS selector for IDs is hashtag. So we're going to put a hashtag and then paste the ID in there So what this is going to do is going to find the button by its ID And then all we want to do is click that button so that we can get rid of the prompt now We'll accept any exception that happens Which is most likely going to be caused if there is no element with this name Which means the prompt hasn't shown up and in that case we don't really care. We'll just pass We won't do anything about it and we need to put as So that's that bit sorted now what we want to do next is once we've clicked on the button we want to basically Essentially when we've clicked on this we're going to grab all of the All this code that you see on my right-hand side We want to grab the process version of it so that we can extract all the So that we can extract all the items and their prices and all the other good stuff from the page so what I'm going to do now is Use the beautiful soup library to create any variable called soup We're going to take Initialize the object beautiful soup and then we're going to do driver page Essentially, this will give us the raw text version of the HTML code Which will then be passed into a beautiful soup object down here Now if I print this out, it will look like gibberish But it will be browsable html code that we can use Functions like find find all etc to find specific elements using beautiful soup. So I've got the Prompt shown up here. It's clicked on accept as you can see Encloses now and we can see that this soup has printed out Obviously, we don't need to look at this because it just looks like a bunch of gibberish at the moment It will make sense after some time So the first thing we want to want to do from this page is grab the amount of pages There are so we want to know how many pages there are up here in here So what I'm going to do is I'm going to inspect this element right here Inspect just so that we can find out what's going on Inspect and then we need to basically find the parent so that we can essentially grab the whole thing And as you can see when I hop over over this class right here as page nation strip It selects the whole thing so the bingo. That's what we're looking for And then we need to select the last number from it, which is 20 not in this case So I'm going to grab the class name and what I'm going to do is say pages equals fine Fine, and then it was a span and the class Was page nation strip. I know it's a span because it's the element type. It's a span and that's the class so That's the now that we found the page nation strip, which is this entire thing right here We want to find within it the other spans we include if we look at the other spans they include the numbers So we need the other spans. So I'm going to do dot Find all because I don't want to find just one I want to find all of the spans within it which have the numbers and I'm going to do don't find all So just find me all the spans within it now I'm not going to run this entire cell because it's going to run beautiful the whole Selenium driver again, so I'm just running here to show you guys what it looks like and we run it There's nothing special, but if you what you'll notice is The last element of this array has the number that we're looking for it has the last page number Which is seven is exactly what we need So in this case, there's seven pages when you look for protein So let me just prove that by Routine and Hopefully yeah, there's seven pages. So it's correct. Now. We just need the number itself We don't need all of this rubbish. So all I'm going to do is I'm going to say new variable Last page equals pages Minus one which should essentially give us the last element of the list Now when I print up large last page it will include the whole span and everything we only need this section of it So it's just a text. So we're going to do last element just a text and now if you print it We've got a string number seven We don't want it to be a string because we want to be able to loop using it So we're going to convert into an int by surrounding it with int and Casting it into an integer essentially now if you look at it. It's a nice integer as we need it to be So we're going to copy this code and paste it right here perfect So now we also have the last page and we know how many pages we need to scrape through now Essentially, we just need to loop over and every time we loop over we need to change the page to the next page Until we reach the last one So I'm going to write the loop. It's a simple for iron-ranger loop. So we're going to start from page one and go all the way to Last page which is whatever the number here is so the whatever the Total number of pages is plus one plus one so that we can account for an extra iteration. Otherwise, it will stop one early so what we need to do next is essentially find the Container That includes all of these products So we don't care about any of these things on the left or the headers or anything like that We only care about all of this stuff in the middle So what I'm going to do is just quickly go to the top product inspect and let's take a look if we can find the container So I have already scraped the site beforehand and Found what the class name was Actually instead of getting the entire container we just need to find out what each one of these Sort of blobs are called. So what each item? Container is called and if you know what each item container is called we can just grab all of them Using beautiful suit. So if I see here the div with a class of a section a spacing base Includes all the data that we need and if you go to the next one also has the same class name if I'm not wrong Well, so we go to the next one should also have the same class name And the same data that we need so essentially what we're going to be doing is grabbing all the divs that have the class name a section a spacing base So that's what we will do in the loop. We're going to say items equals See find all and we're looking for a div looking for divs With a class of a section a spacing base, which are all the items that we need Now once we have run this we're going to have to take a look at what we find within this items Within this items List so I'm going to run this outside of this cell because otherwise We'll have to once again run the selenium scrape and grab the soup again, which is going to take a while So since we already have a copy of the first page we can just print out items and I don't know if you'll be able to tell but this is essentially all the HTML code behind this one block right here Just this product. We just need to extract the stuff that we need Which is the image URL the text or the title and the price along with the URL to this product So let's start with perhaps the image so the image URL Let's inspect the image and as you can see it's just an IMG tag and what we need from it is the SRC SRC basically has the The URL to the image so if I click on it you go straight to the image now that's what we're going to do We're going to extract the image With the Yeah, we're going to just extract the image from this div so I'm going to say I'm going to say Four item in Items Also, I just realized It wasn't just one product. Actually, it's all of those products within a list. That's why the code is so long So we've grabbed all of those products each one of this individually is within that list So we're going to do it with Each item so we're going to say for item in items What we want to do is you're gonna grab the item image and that's going to be equal to Item, which is an individual item dot find and we want to find an image tag and We want to keep the SRC from it, which is essentially the link to the product now if I break right here and Print our item image We should have What should be an image from one of the products? I may be a bit different to what you see on your screens because Amazon likes to shuffle up the products that you see but It should it should resemble what your scraper sees So that's the first sort of image that we've grabbed and we can if we let it run It will do it essentially for all the items that are in that dip So that's page one and we've got the item image for the first container Now we want next what we want next is the item price So I'm gonna look for the price. Let's just inspect here and you've got Right here is what we need we have 43 pound 95 we want the whole thing so it's a span with a class name of a offscreen So I'm gonna grab the class name and I know that it's a span so I'm gonna say price equals item dot find To span and then also have it. We know that the class name is a Offscreen so we don't want the entire span. We only want the text from it, right? So what we want to do before that is also make sure that there is a price stated because some Amazon products are weird If they're second-hand and stuff, they don't state the product price You sort of have to ask for it or click and go on the next page So we're gonna check if there is actually a price on it. So we're gonna say if item price Results in non, which means there's no price found What we're gonna do is we're gonna set item price to in a okay If there is actually a price on it Oops, I forgot the equal sign here and if there is actually a price on it Then we will override item price with item price Dot text essentially it will just keep the text from the span and giving us the item price, right? So I will print the item image now and under it. I will print item price Let's go So as you can see we have the item image and the item price. Perfect So we have two of the things already now the only two more things are the item title So the name of the product and then the item URL So the link direct link to the product if we want to buy it for example So to get the title, let's take a look once again what we need to do. This is the title So I'm going to expect the title And let's go in one more step And essentially what it is is it's a span It's a Sorry, it's an anchor tag up here with the class of a link normal s underlying text, etc So let's grab the class name and all we want to keep from that is the text. We don't want the URL So let's go back to code vs code and we do item title equals item.find It was an anchor tag Oops, I'm doing this before the break, which is why it's pissed Okay dot find anchor tag and the class I need to copy. So let's go to the site And the class for this was that a link normal and all that other gibberish Let's go back Uh If we paste it in here And now we should Be able to just do dot text on this to just keep the text from the anchor tag because that's what we need And if I print now print item title, it should hopefully there we go come up with the um Title of the product so it says amazon brand amfit nutrition whey protein part of 1 kg strawberry 33 servings Beautiful and if we look at the image it should hopefully correspond to that title as well So let's look at the image look at that amfit amazon fit whey protein 33 servings. Perfect Now the last thing that we need is the actual URL to the product So when we click on this it open and open it in a new tab. We need the URL to this So it's the same thing again in the same This anchor tag right here Also actually contains the URL to the product. So instead of keeping the text we'll keep the URL here So we'll do copy the same line paste again item URL is going to be equal to item dot find anchor tag Same thing instead of doing dot text. We're going to keep the src attribute src basically is the Attribute of the anchor tag which tells it where to redirect the user to so that's the link So let's just print it out and let's see what we get It's a new URL src Chiara src Okay, that's a bit weird. Oh, sorry. It's href not src. Not that so it's href not src Now what we will notice is that we only have partial we have the partial link. We don't have the whole link It's missing the hgtps amazon.co.uk at the start here So if you try to paste this onto the web, it's not going to work But instead if you added if you go to if you added amazon.co.uk And then pasted it and then gone You can see the actual product. So we're missing this first bit here, which we'll copy and append append to our link in the code So let's go back And Okay here, so we need to add it to the start of the link. So I'm going to add that in Make sure you don't add the last forward slash because it's already included in the end link So we're going to add this to the missing Missing URL and run it again. And as you can see even vs code has recognized it now as a URL So if I copied this last link right here should be the link to our product If I copy and paste it voila, we have The link to the product or the data that we need Perfect. So essentially that's all the data. We're actually going to need And at this point, all we need to do is store this data within a the data The data list that we made or the data array that we made So how the way we save that is we do data.pend and we're going to append in a dictionary And the title is going to be item title The price is going to be item price The image URL is going to be Image and lastly the URL Is going to be the item URL cool So essentially that should give us Let's if we run this again, what we should get is a data list should have one item in it We should be a single dictionary and the single dictionary should contain all the information about one product Which is exactly what we want Now if I were to stop this loop from Breaking and I run this whole thing for a second view data again We have basically all of the products on the page one of the protein page, which is amazing Now all we need to do is chop this into the other loop that goes through all the pages and we should be good to go So I've copied all the code and We'll go back up here again. We paste it here Uh, and we need to tab this forward once because we're inside a loop here now We have appended all the data that we have What we want to do after Essentially after this loop is finished That means that we've finished scraping all the contents from one of the pages and it's time to move on to the next page We'll just finish the program. So we're just going to do a time dot sleep of two here And then what I'm going to do is driver dot quit just to Uh, free up all the resources that the previous scrape was hogging So you don't want to be building up on this in terms of letting it eat up free your ram and stuff So we're just going to close the driver And what I'm going to do next what I'm going to do next is open a new instance um Yeah, open a new instance of the Of the driver again. So copy these two lines And in the new instance what we're going to be doing is Same essentially the same code for initializing it but instead of page equals one We have to change that to page equals and plus the str because we need to convert this to a string i Plus one so plus one because in the first iteration of the loop This is going to be one And we obviously want it to be two as soon as it hits here So this will essentially get the fresh data for page two And now we need to uh do the same steps in here. So Copy that Paste down here Essentially, we're just clicking on the accept button of the cookies And then finally just do the soup variable again as well So this is going to be the soup for page two page three and so on so One last thing we can also do is along with our data We can save the page number from where we got the data because we have access to obviously the i which tells us which page we're on So if I put a comma next to euro where we're appending the data and put in page And then i that should basically in our data tell us which page the data belongs to So there was a lot of coding. Hopefully uh most of it made sense And I will try to run this now and hopefully it will just run first time, which is a bit of an ask, but let's see Run And run Okay, uh still waiting for selenium to pop up with the window here it comes Loading so we're looking for protein. We're on page one Okay, still loading I will probably end up Fast-forwarding this bit in the edit or cutting it through Um until we get to the last page Here we go It's closed one of the instances and it's open another one And as you can see the page has been set to two because it's now scraping from page two All the items are different. It's done page two. It's probably going to open another instance in a second for page three Here we are page three Now, obviously there's ways of making this process way faster by using multi threading Where you don't have to wait for one process to finish for the others to start You know multiple running at the same time depending on how many uh threads your cpu Can support And obviously feel free to do that because you're using proxies as well So there's uh, you're very unlikely to get banned for doing it quicker because you're using random IPs and stuff So yeah, feel free to look into multi threading. Uh, if you would like I have tutorials on the channel so I can link them in the description Looks like we're scraping currently from page five. So two more pages to go and we should then have the entire Um data set ready Uh, okay Page six coming up now So far no errors, which is a good sign page six And then the last one should be page seven after which we can just convert our data Array into a data a pandas data frame and save it as a csv file Which you can do whatever you want with of course You can analyze it use it for other products or use it to stock products to see price changes, etc Okay, that was the last page as soon as this is closed. We should Hopefully be ready to look at the uh data that it's collected So any time now Okay, done. Perfect. So That's the script finished. It's basically collected all the data in two minutes 15 seconds, which is not too bad I would say Well, let's look at the data that we've collected Uh, let's see how many items are in there 430, which is not bad. Um pretty good number So we'll create a new variable called data frame equals that to pd dot data frame And then we're going to convert our data array Into a pandas data frame, which will look like a nicely formatted table something like this Now we have obviously the title the price image URL URL and page number And now you can obviously just save this to a csv file I will call this Something meaningful actually I'll just call this data scrape keyword And I'll change this to an f-string so that we can obviously put the keyword search term here Dot csv set the index to false In the setting index to false just make sure you're not have uh having an extra column Which is just useless in the just the index of each of these rows Basically like this column. We just don't need that. So I'm setting it to false Let's run this and hopefully I should have a csv file waiting for me on my desktop. Let me check There we are just open it for you guys Here we go. We have the title. We have the price. We have the oops Don't need it to be that big. We have the image URL. We have the URL to the product Etc. And we also even have the page number from which we scraped it Which is just amazing. Now obviously feel free to do whatever you'd like with this data. So there's over 400 products that we've just scraped within the time of like Through two to three minutes in this case So it's a pretty powerful too. Hope you guys hope I've been able to help you guys out If you guys have any requests for new videos, please let me know in the meanwhile If you guys could share the video and like it would really help And I'll see you guys's beautiful faces in the next tutorial. Peace out