 guys and welcome back to another Python tutorial. So in this video we're going to be scraping the top 10 or top list of restaurants from TripAdvisor. So of course the scrape could be useful to you guys whenever you guys are trying to go on a holiday or just in case you guys want to scrape locations of restaurants from TripAdvisor in general. So what we want to do first off is obviously figure out the country or location for which we want the restaurants from. So if you head over to tripadvisor.com what you could do is type in in the where to section type in obviously the location of where you want your restaurants. So for me I'm going to go London and then I'm just going to click on London restaurants. What we want to do here as you can see now it's come up with the list probably sorted in a particular order list of restaurants that we can visit in London and you've got different information about each of these restaurants. So you've got the name of the restaurant which is the obvious one you've got how many stars it's got out of five you've got the cuisine the dollar signs represent how expensive it is and then obviously if you go to the next page you probably have more information which is going to be useful to scrape such as the rank of the restaurant in London so it's seventh the address because once we scrape this you guys can make use of it if you have the address you'll obviously know where to visit the phone number and of course those details could be are going to be the ones that we're going to be scraping. So without further ado let's begin. So I'm going to go back a page here and what we want to do first off is grab the URL to this page because that's the page that we're going to be scraping so copy this URL and we'll keep that at hand. So what we're going to do now is import all the libraries that we need so from selenium wire import web driver I will be giving you guys a list in the description of all the libraries you need to install so that you can easily do that now we need beautiful soup of course to do the parsing then we want to import the web driver manager so that we don't have to install the chrome driver manually and then finally we will import pandas because it will let us nicely structure the data and save it as a csv in the end as well hopefully it all runs fine and if it does we should be able to proceed with the next steps amazing. So what you want to do first is create a variable called scrape URL and we're going to set that URL to the URL that we just copied a second ago so the page that has all of our locations or restaurant locations. Next we're going to create a empty variable called data which is just going to be an empty array this is going to be where we store all of our results that we scraped so now comes the important part whenever we do any scraping activity I may have obviously mentioned this or focused on this point before but we will always want to try and use a proxy because when we do so the chances of us getting rate limited or blocked are narrowed so in this video we're going to be using proxies from ip royal we're kindly sponsored today's tutorial. IP Royal is a proxy service provider providing safe private and unrestricted access to online information with a pool of over two million plus reliable ip's ip royal allows clients to use a proxy server as an intermediary between their devices and the web which allows clients to maintain their privacy and use resources they can't access directly due to geo restrictions etc ip royals data center proxies can serve as a great product for businesses or users looking for premium high-speed anonymous private proxies which usually have unlimited bandwidth and no extra charges for data scraping though I would recommend using ip royals residential proxies as they not only allow anonymity when scraping websites which can help avoid getting weight limited but also let you select the geolocation of the proxy either through the dashboard or by making minor changes in your code lastly ip royal handles automatic ip rotation which not only allows easy integration within your code but they also provide the option to use static ip's in case that you decide to keep an ip for a longer duration of time in this video ip royal has provided me with a discount code that will give you guys a straight 50 discount on their royal residential proxies the code is yohan 50 and I've also included a link in the description which you guys can use to obtain the discount so guys make sure to use the discount code before the deal expires or you'll be missing on a great deal so to begin with what we're going to do is go to our ip royal dashboard now obviously got a 1.17 gigabytes of royal residential proxies which is the one I recommend once you're in the dashboard you can select from one of many countries that are available you can also select the region in this case I've just gone with the united kingdom you can also keep it random if you like which means that it will keep rotating or just select the random country or region so I've stuck with united kingdom at london in this case I've set my ip rotation to random which means that I'll be assigned random ip's while the scrape is happening and then what we need from the dashboard once we've selected all from these four options is this little URL right here so you just want to copy from where it says http and then end your copy where the port number ends so at the end of the port number and just click on copy and now when we go back we're going to need the proxy information that we just set or just copied so let's create a new variable called proxy and we'll assign that to the proxy information that we just copied a second ago so now what we want to do is create a dictionary called options and in that dictionary we'll have a key called proxy and that's going to be assigned to a key another dictionary with http as the key and http proxy will be set to obviously this and another key called https and the https proxy will also be set to the same proxy that we have pasted up here so now that that's all set up we can go ahead and initialize our chrome driver so we're creating a variable called driver and then set it to web driver dot obviously we're using chrome not firefox so we type in chrome and then we use chrome driver manager to basically install any necessary drivers required if you if you already have the necessary drivers it's not going to overwrite them obviously now the second parameter is going to be the selenium wire options which are the proxy options that we've just saved inside the dictionary over here so by assigning the options variable to selenium wire options we're basically letting selenium know that we want it to use a proxy rather than just using a naked connection so now that that's out of the way what we can finally start scraping so we're going to do driver dot get which means we want to scrape a particular page and then we need to as a parameter we need to tell it which URL to scrape from so we saved our URL in the scrape URL variable so we'll pass that in and then what we want to do next is create a new variable called soup and that's obviously going to be assigned to beautiful soup and we're going to pass it the page source so driver dot page source what happens here is once the driver goes ahead and scrapes all the data and stores it in this driver object what happens next is we create a new variable called soup where we assign beautiful soup to pass all of the contents of that page that was just scraped let's run this and hopefully a current window will pop up on my screen and then it should start opening up the URL we are wanting to scrape from so let's give it a few moments my internet connection is a bit rusty which is why it may take some time so I'll be forwarding this bit in the actual video okay now that the page has successfully scraped and we have our soup variable ready which is the past version of that page we can go ahead and view the soup variable all you're going to see is a bunch of htmo and javascript code which was rendered out from the site so what we want to do now is we want to go to the actual trip advisor website and from here we want to find the elements that we're wanting to scrape so first off what we actually need to find out is we need to find the list of all the restaurants that we need details about so in order to find the list of all the restaurants which are obviously down here we need to inspect the elements so let's inspect let's take a little bit of time and when we inspect we should be able to hover over the different elements and as I hover over the different elements we'll be able to see the items that or the divs that hold the data that we need now obviously we only care about the name of the restaurant in this case and the the URL which is hidden behind the name so that we can go onto the next page and then scrape the rest of the data also like the contact etc so all we care about right now is the the bit that's containing this uh london cabra club uh text and the URL at the moment so let me just okay so it looks like this is the div that contains what we need so we know it's a div and what we want to do is just copy the class name and we can go back to our code and say create a new variable called rest and then assign that to suit dot find all and basically we want to find all the divs and we want to only find the divs that have the class name whatever the class name was that we copied right now let's run this and we'll print restaurants down as well as you can see now let's just print the first restaurant from the list this is the entire div that contains all the data that we need and we can verify that because as we can see here it says the name of the restaurant steak and company um glass yester and we also have the URL to the next page now we're gonna have to loop through this array of restaurants so that we can scrape the details for each and every restaurant so i'm going to create a loop here where i'm going to say where for restaurant that we're going to keep the name of the restaurant so name equals restaurant dot find now in this whole div we only need the eight anchor tag because that's what has our URL and the name of the restaurant and instead of keeping the URL from the anchor tag we're going to keep the text the text basically will give us the name of the restaurant so if i do print name and break out of the loop it should say the name of the first restaurant that it sees perfect now we also need the URL so i'm going to create a new variable called URL and then i'm going to assign that to restaurant dot find anchor tag again now instead of keeping the text keep the href href is an attribute of the anchor tag which usually has the link to redirect the user to so if i were to print URL now as you can see we have the URL but the one bit to notice is that it's missing the front part of the URL it only has the end part so we can fix that by quickly adding on the front part of the URL let's go back to the website and if we click on this right here the missing part is the tripadvisor.com so let's copy that and if we go back and paste it here tripadvisor.com don't need the forward slash because it's already inclusive in here so we print that again this should be a working link now let's give that a go copy that and let's paste that in here voila it works great so now that we have the URL to the second page we can scrape the details from the second page so the details are going to be scraping like i said before are the star rating the rank of the restaurant the the money information how expensive it is and some contact details so going back to the code again what we need to do now is ask our driver to scrape the next page so since we only have the first page that we had scraped which gives us only the list of restaurants now for each restaurant we need to scrape the page of that restaurant so that we can get the details for that restaurant so the driver don't get URL and then we'll do soup to not the best naming conventions but remove equals beautiful soup and then we'll pass beautiful soup the driver dot page source so this should have the updated page now which is the second page so let's run that and obviously i'll be fast forwarding this again because it's probably gonna take a few seconds due to my slow internet connection right so now that that's done scraping one of the pages from the restaurant let's take a look at soup two now once again it's just a bunch of html and javascript and css what we want to do with this is only take up the meaningful information that we need so let's look at what we need now what we want to do first off is actually find the rating of the restaurant so how many stars it has let's just inspect the this rating thing i'm pretty sure it's going to be an svg that's what it is most times yes it is so as you can see right here it's an svg with a class name uctvdh0 so let's grab the class name and we know it's an svg so what we're going to do is go here let's create a new variable rating equals and then we'll do soup two dot um rating equals soup two dot find we know it's an svg and we know what the class name is because we just copied it so let's paste in the class name like that and now let's print out rating now as you can see it's only grabbed the svg but obviously we don't need all of this we only really need 4.5 we don't even need 4.5 of five bubbles we just need 4.5 because that's the only data we really need so first off to only get the text from this area label attribute let's do square brackets area label just like how we did href with the anchor tag and as you can see we're left with only the text now and since we have only the text we can easily manipulate this text by splitting on space so we'll be splitting where there is a space so over here and then what we'll do is we'll keep the first element which means we're only keeping 4.5 and then we'll do a strip as well to get rid of any leading or trading spaces now as you can see we've got a 4.5 but it's in the wrong format it has a string because we've got the quotes let's change that into a float to make it the correct format we can do that by casting this string as a float and now it's properly formatted and it's looking good so we've got the rating sorted now the next thing we need to get is the rank so let's have a look the rank over here so let's inspect this element as well and if we see here it's in a span but the span doesn't have any ID or class so we're going to go to its parent so if we go up um up a few up a few what we will notice is there is a span with the class DSYBJCMFRA which is this one right here now what I noticed before making this tutorial is that there's quite a few spans with this class so instead of finding the first occurrence of this class we're going to have to find a specific occurrence of that class so I'm going to create a new variable called rank and then do supe2.find span because that was a span and then the class is going to be able to what we just copied of course and then I changed this to find all because we want to find all the occurrences of this class of this span like I said there's multiple spans with this class so if I print this you'll be able to see there's like one two three different ones so let's see which one is the one we need zero now that one is it just says off our mini restaurants we don't need that one um that one is the looks like the address two that one doesn't seem to have what we need either let's check zero again I think it should be zero this is dash of 15,549 restaurants uh there we go so it was part of the first uh of the first element of the spans with this ID so as you can see we've got the rank of the restaurant stored in the rank variable so that's successful now the next thing we need is the pricing so the dollar signs and the um the bit of text that shows up here which is usually the the cuisine and stuff so let me inspect this and as we can see here we've got the a span that's containing all of it and the span has the class dsybjdxfe now let's just copy that class and when I was obviously researching before creating this tutorial there was only one occurrence of this class which had what we need so we don't have to define all this type we can just do find I will call this variable pricing underscore cuisine because we're going to have both of those um data points in the same variable and we're going to do dot find again span with the class that now let's print it out and as you can see we have what we need we just need to do I believe dot text yep just do dot text and we get the um how expensive the restaurant is and then the cuisine as well so steak seafood in this case so steakhouse seafood sorry so accept steak hot seafood so that's correct um the only thing that happens is these the spaces sort of go missing but I'll leave that to you guys as a little challenge to try and figure out a way to add in the spaces and if you guys want to take it further you can also try and find a way to store the how expensive the pricing of the restaurant is in a different variable so for now we'll leave this as it is now the next thing we want to get is the address so currently variable and what we noticed last time is when we did define all one of the elements contained the address so we're just going to copy this line from rank up here and instead of doing zero I believe it was the first element of the list so if I print our address here beautiful so we have the address of the restaurant so it should be 30 john high slip street yes correct now we've got the address the only other bit left to scrape is the contact so I believe that too was part of the same element as the rank but just a different um different position in the array so let's try two if I print contact oh yep beautiful we have the contact number in as well now those are all the details that we're going to be scraping now what we want to do next is we want to actually save all of this data into the data array um so we'll do data dot append and then we shall create a dictionary now do name we just have to create key value pairs so name and then the name of the restaurant then url is going to be equal to the url of the restaurant rating will be equal to the rating pricing and cuisine will be equal to the as in cuisine variable address will be address and lastly the contact variable is set to the contact okay beautiful so now that that's all saving and appending to the data variable what I'm going to do is cut this code from here and put it inside the loop the reason why I was running it outside the loop is because we had already scraped the soup for um the first restaurant in the list so I didn't want to rerun it every time and make it scrape over and over again because that would take forever now that we're done with this what I'm going to do here is I'm going to put a break statement in which will only uh which will basically mean that if this loop will not go through all of the restaurants it will stop the loop as soon as it's done with the first iteration so we'll only have data for the first restaurant in this scenario let's give that a go and see what we get great now that the um so has finished executing let's check out our data variable and hopefully it should be populated with data as we can see it's looking amazing let's look at the restaurant that we have scraped STK Stakehouse Westminster correct the URL will obviously be correct otherwise we wouldn't be able to get the rating and stuff the rating is 4.5 stars which is absolutely correct the pricing and cuisine is 4 dollar signs and steakhouse seafood which is correct as well address seems correct and so does the contact number amazing so if you guys want to scrape obviously all of the restaurants instead of the first one which makes more sense you guys can just get rid of the break statement and I will go ahead and run all the um go ahead and scrape the data for all the restaurants for you all this without getting blocked because obviously you're smart and using proxy is to keep yourself safe so the last thing to do is convert this into a data frame using pandas data frame method um pass it the data variable it's just for you df and it will show you a nicely formatted table obviously there's only one role now which is why it looks a bit weird but um if you guys wanted to export it to a csv file you could do so by typing dot 2 csv and then calling um popular london restaurants dot csv for example and then you could use it for sorts of analysis or any other purpose um that's useful to you i hope you guys found this video useful um sorry for not uploading for so long but i will be uploading in the next month and the month the following as well so please stay tuned for those tutorials and i shall see your beautiful faces in the next tutorial peace