 welcome to this UK Data Service webinar on getting data from the Internet. The presenter today will be Peter Syneith... of the Cathy Mawls' Institute for Social Research at the University of Manchester. Thank you, Gail. In this webinar, what we are going to do is look at ways... of getting data from the Internet. We are going to cover a four different aspects in pretty short order... dyna'r cyfwels o'r gweithio. Fe ydych chi'n gwelwch chi'n gweld y cwpion, y cyfwels yma, i'r gweithio'r ffordd, eitha syngel chi o'n gweithio'r ffordd, fe yw'r prôl ynghylch. Mae'r apis yn gwybod, yn ymryd, yn ychydig i'ch gweld o'r apis, ac yn y gwaith gwaith. Ac yna'n gwybod i'ch gweithio'r webscraping geniw. Mae'r cyfwels, ac yn y gweithio'r cyfwels. Mae yw yma. Mae'n ddefnyddio'r modd yn llawer. Doedd hwnna'r llyngr yn ôl. Mae'n debyg ei hunain. Mae'n meddwl a'r meddwl reiftwn yng nghwm yn d espaço y internet. Mae gynnal i wneud i ddweud y gŵr ar gyfer y tu hynny. Mae'n ddweud i ddweud hynny, ac mae'n gyd yna yn ei fawr i'r internet i fod yn ddim ar y ti erbyn hyn. Mae'n dweud i ddweud sy'n ei ddweud i'r teithnikau i'r tywiau gymryd y byddwch gyda'r brwysig, fel ychydig yn gweithio'r brwser. Felly mae'n gweithio'r cyfwyr cynhyrch yn python a yn y tu o gyfwyr, a'r cyfwyr gyda'r cyfwyr. Mae'n grwyddo i'r teimlo. Mae'n grwyddo i'r teimlo'r cyfwyr, ychydig yn gweithio'r cyfwyr. Mae'r teimlo'r cyfwyr o'r teimlo'r cyfwyr. Mae'r ddechrau'r cyfwyr yn gweithio'r cyfwyr. Mae'n gweithio'n gweithio. starting off with copy and paste if you want a small amount of data an article or a table or so you may find it easy to just copy from the browser screen you're seeing and paste it into a word documents or an Excel document or whatever. What you get isn't necessarily what you see on the screen gyda'r cyd-dweud ymgyrch yn dal. Yn gwybod,yn gynedig ar gyfer y brouwyr, ydych yn cael ei fod yn gwybod yn cerddoedd. Felly yn gwybod yn gwybod yn gwybod. Maen nhw'n meddwl, ymgyrch yn cael ei fod yn ddweud ymgyrch. Dwi'n meddwl yn y ddweud ar gynnig. Nid o'n meddwl, maen nhw'n meddwl. Yn y cyfriffeil, mae'n meddwl ymddiadau yma, Dwi'n gweithio y gall wedi gŵr o'r anhygoel. Rhe騵a'r dda haf sydd hynny'n ei gweithio'r ddefnyddio'n dda, oedd sydd eisiau plent yn gennym. So rhaid am syniaddd hynny'n gweithio'r ddefnyddio y Mydd ychydig ac yn gymryd cy Teddy C per oes cymdeithrig. Mynd i fyddi gwybod y peth, mae'r ddweud phefydau merddion o'i bleu hynny i mewn gweld, neu'r dweud o'n eu teimdeithig, Tynyrch gweithio'r llwyddon, ond yn dwylo rydyn ni, felly rwy'n meddiwch y cwm yn ei wneud, mae'n meddygwch y llef, ac mae'n meddygwch ar gyfer. Mae'n meddygwch. Mae'r meddygwch ar gyfer, mae'r meddygwch ar gyfer, efallai eich meddwl yn fwy fydag, ac y dyfodol meddwl yn meddygwch, ac yn ystafell yn meddwl. Felly, fyddech chi fyddai'n meddwl ar gyfer.krwfall o gyd gyda'r ysgrifnadol ar y llwyff er mwyn i fod yn ffarnig amdodd rai oed, a ydych chi'n ffordd o ran y cwmwis ar ben yr oed, o prosth cywlau y ddweud o fy modd i'r cwmwis, o'r ddweud o'r ddweud. Ac mae'r ffrindig o'r ddweud o'r ddweud o'r ddweudio at dyma ei ffordd. Wel, mae yw meddwl, y brwyddaeth yma, ychydig y bwyddiad hynny, ychydig y brwyddiad hynny. Dydw i'r cyfrifau gyda'r H1, rwy'r gwaith, ond am ymddangos, nid i'r wath. Mae'r bwyddiad y gallwn ei wneud o'r bwyddiad hynny, yn cael ei ddechrau'r bwyddiad hynny. Ychydig y brwyddiad hwnnw, mae ein ddaeth o'r dweud, y ffyrdd, y crobe yw'r bwyddiad hynny, ond lot of people like to use and if you just look at the different formats which I come up with though on the edge using I did it with both source pasting and destination pasting so we get source pasting with edge now this seems to have given a different result this time because here I've got all of the the teams in there and notice all of these are filled in as well edge destination looks very similar except that these are no longer links as they were the firefox source yes lost teams lost column headers destination very similar for chrome similar sorts of things none of them are absolutely perfect properly the edge source is about as close as you're going to get but even on the edge source if you look at these column names up here these are the one which seemed to disappear and if I go back and look at the natural page in firefox so here I've got played one drawn lost whereas in my spreadsheet it came up as pwdl etc so again you're not always getting what it is you think you ought to be getting so instead of using the straightforward copy and paste what we can do is assume we want to put this into excel as a table we're going to look for we can go into excel go to data from web in the getting transform data this is available in all of the office 365 versions of excel now I think so providing your excel is reasonably up-to-date you should have this available to you and you just just get a little form asking you to put a url in there and so I'm just going to go back up here copy my url and paste it in say okay and this will go away and get that whole page which actually had more than the table on it but it knows I'm probably looking for a table so when it comes back it will actually present both options and saying well you know what table do you want okay and it's not very good at labelling the tables but you can get preview so if I go to table zero I can see this is the table I want and then all I've got to do is click on load and then it will put all of the data into the table or into the spreadsheet it does actually format it as a nice little table you've still got these extraneous things which you probably didn't want team has moved and this stuff at the end which was a little graphics are still the same but this is generally a lot cleaner table if I delete that oops I'm looking to bother cleaning them but you can see what part is the table and which part isn't the table and that's a far cleaner way of importing tables from the internet use excel directly and import it that way so back to the slides I think copy and paste demo cutting paste doesn't always work use excel because it's genuinely better that is just what I've shown you that's good make progress right downloading files of different types this is interesting well well again you've probably all done it where you there's a link which says download or download here or this is a fine name that you want to download and you can just click on it and directly download it and it goes into my document or something like that there are other tools if you haven't if you're not using the browser there are software tools wget and curl which are unix based options they don't naturally occur in windows although there are ways of getting hold of them if you want to use them and I'll tell you the other software too is actually write it in python code or it could be an r code but the demonstrations I've got today are all going to be in python and the advantage of using the code is that you get better documentation and reproducibility because if you just clicked on a link in a web page and downloaded the file you've got no record of how you got that file and so on if we're into coding we can also automate the download so if we wanted multiple files we could find a way perhaps of downloading them all together now automation isn't always going to be possible for example some file some all locations put u i u id in the path or other randomness what i mean by that is strings of hexadecimal numbers which are part of the file name or the path in the file name which it makes it impossible for you to predict what other files in the series might be another problem might be if it's using javascript functions so these are typically the the buttons that you press to download the file because that goes away and runs some javascript to do the download so you can't we need something where we can actually get hold of the file name that we're actually trying to download and if we've got the file name in the path that we can do a fair job with automating some of this and this this approach you will also work for a lot of api calls as well so demonstration again let me just escape out of that for all of these demonstrations i'm going to be using mainly python in the form of jupiter lab jupiter lab comes in the anaconda release of python it's already built in there it's very useful not only because of the python environment but it's got nice lots of little tools as well to help you do things i'm not going to go into the actually downloading by clicking on the button and just downloading into my documents because we've all probably done that before i'll very briefly show you the command line tools and hopefully deter you from using them unless you happen to be a command line user and you you've used them before and you know what they're going to do for you as i said these the the wget and the curl don't occur naturally in the windows environment but are various ways of getting it and windows 10 there is a linux subsystem which will has them in for our demonstration today i have got this little thing called MOBA exo which is actually useful remote accessing of the systems but as part of it you get this little linux terminal session in here um and here i'll just type in a command and it's going to run the command so what i want to do is just demonstrate wget i'll just limit myself to wget and we're just going to copy that go back to my command and you can see here this is a command i'm going to run wget i tell us what it is i want to oops i've got two commands at once say let's just get rid of that wget what it is i want to want you get this is going to be a web page the minus o means the output and i want to call it pbc get dot html okay and if i just run that by clicking enter hopefully it goes away it gets it 100% it's got it all and it's told me it's saved now i've also got open somewhere here i think is the folder that way it's saved and you can see here bbc wget html quarter past three and if i double click on that it will just load it in the browser but you see it's not a very good representation of the page most of the page is most of the information the page is there what is missing is all of the the the graphics now that doesn't always occur it depends on how the page is being constructed some pages will download graphics automatically some don't i just look at the other one i could do exactly the same thing in curl if i wanted to i won't bother you because it's the same results the other one i want to show you is this one here wget um what this is going to do it's going to go to this um this is actually an api um api beginning there doesn't actually mean it's an api it just happens to be an api in this case and effectively what i'm going to call is i want to get hold of all of the football competitions that this api knows about we're going to be using this this api later on okay but the moment this is going to give me a list of all the competitions i know this gets returned in a format called jason so i've cleverly named the file dot jason just to make it obvious to me and again it goes where it connects it oh waiting for a response hopefully that's going to come back and eventually get me some output uh i'm waiting for it have i got a yes i've got an example that i've run previous on at the same one it's over in it okay that should come back and give me a file but we don't need that at the moment that's really just to demonstrate how these things work um control c to get out of that forget about it and the only thing i just wanted to show you um for those of you who haven't used these tools before if i look at the help file for w get there is an awful lot of things you can do with it an awful lot of parameters so as you can imagine it's very large very flexible very but very complex so my advice would be that if you if you're not naturally a command line user you're it's nice to know about them but you're probably better off avoiding them and instead what we're going to look at is using um similar techniques only using python so in here this is my um python environment this is jupiter lab um one of the reasons um i use jupiter lab particularly in this demonstration is that along with able to being able to run all the python um it has lots of little file viewers which makes it easy for us to look at results so i look at the csv file by clicking there we'll come back to that one and even the jason a jason file it will format jason very nicely or very conveniently for you to read jason um is what apis tend to use and an apis really designed for a computer to talk to another computer or programs to talk to a program so naturally speaking that jason isn't particularly well formatted for you and i to read um but there's lots of tools around which will format jason to make it easier to use so in order to use um our our um python we're going to need to slide recall requests again request is built it's automatically installed when you download the anaconda version otherwise you can import it yourself um and then this will allow us to download and save files of a variety of formats so i'm going to start off i'm going to set up a url variable and in there i'm putting this string now this you can probably tell by the last bit of this dot csv it's going to take us out to um or it's going to connect to a csv file and i'm going to save that with this file name here okay so i'm just going to run that and then that's there's your second variable such as python here this is the actual call using the request library where just all we have to see is get which is the method name and pass it the parameter of the url that we want and then i'm just going to print the status code and then i'm going to save the file open the file close the file write the file so r is is an internal object which will contain everything which comes back from the request um part of that is the status code which you might want to look at um the content is effectively the file content and there's lots of other things which we may or may not want to look at so if i just run that hopefully it comes back and it tells me if anything i printed was a status code of 200 this is what we're looking for 200 means okay everything worked okay um other ones which you're likely to come across are the the 400 range so 404 being the most popular as i'm not found but 403 is forbidden you're not allowed to have the file 500 server errors and so on so if it's 200 that's good and you can carry on and just just the other thing one of the things we can look for is just out of curiosity the url itself our .url is what is the file name that we asked for okay and we can see here um file just refresh that oh come in the wrong folder that's why if i go back into files there's my file which i've just downloaded okay um i'll just quickly do the others so i've got copies of these i just want to demonstrate that this technique works for virtually any type of file you want so this is going to be the jason file it's the same file i i failed to download a minute ago using um wget here i'm just going to run it again again i've got 200 coming back and we can see here competitions jason and again um we've got a nice little way of looking at all of the results all of competitions okay we'll come back to using jason later on look at the apns and finally an html file so here i'm going to bbcbusiness.html and i picked this a couple of days ago i hope that file still exists so that's just setting up the the variables run it come back as 200 um bbcbusiness.html and i'll double click on that again because it's it's um Gip delab it's got a nice little presentation of the the page including the graphics in this case it's not quite this quite the same format as you would see in um if i'd done that natively on on the screen here i think it's actually the um the mobile version because of the little hamburger up here which takes me down to the options but it's a it's a live page and i can move around so i'm able to download files at any time using simple code okay and i think you'll agree this is far looks you know you just give it a URL it's a lot simpler than trying to use wget or curl the next thing we would do is is take that a step further and having got one file we're going to look at oops we're going to look at some multiple files so we've already seen that slide the idea here is we want to find a way of identifying the files just like in bulk if you like so back up to Jupiter and now i want for this demonstration we're going a little bit topical multiple files using covid information so this website here getthedata.com is very useful lots of various types of data but we've obviously updated it very recently to include covid-19 data and if you go to that that link which i probably got up here it's got all manner of useful information about stuff that would rather never existed but never mind it covers all of the UK in different areas some type of you and if you go to the bottom it's got a whole load of files available for you to download okay and if i click on one of those files i'll get the usual thing oh you want to save this file blah blah blah okay this is a normal save type dialogue what is interesting about it is that up here it says cases by blah blah blah exactly the same as i was clicking on there and it says which is from htbahttps www.getthedata.com well that doesn't quite match this whole url up here so where exactly is this data coming from now if i'm going to just use this dialogue box download one i don't really care i'm just as long as it downloads it but it would be nice to know where this file actually comes from if i just close that if i am hover over this file right down in the bottom left it's very small so you probably will be able to read it it does actually give me a little graphic saying what the file name is really called and what you see down there is the file name that you really need to be downloading the full path and file name the trouble is it's not there it's not that plus that it's it's something else and so you've got the options of oh what can i kind of read that write it down copy it somewhere what have you uh rather tricky but there are other ways of doing this something that is almost full screen okay now all modern browsers are very sophisticated bits of software and they will allow include all kinds of tools for the web developer they'll all have this sometimes just called developer tools or web developer whatever and they will all have this thing called inspector and this is pretty scary if you haven't seen um hmlfo or web pages before but within within the set of tools there is an item called inspector and there's this one here which is also an inspector they tend to use a similar symbol to this and it's always on the left hand side and if i click on that whoops if i click on that it so it goes blue then when i move into my web page which is up at the top it gets this little greedy type situation and highlights individual elements so i can actually highlight that one file that i want and if i click on that what happens is down below where it's showing the html code it highlights the line of html which is which puts that bit on the screen if you like puts that into the web page and what i want to do is i'm just going to copy that uh notice if i go to copy here i've got options of inner html and outer html so again going back to what we did on our simple scrape when we weren't getting the same information that we thought we were scraping this goes some way to explain that you've got inner html and outer html and what we're interested in is outer html and i'm just going to put that whoops into it into no patch so we can see it better and what we've got down here this is um an html tag it's called an a tag and a tags enclose links so this is the a tag this is the end of a tag and everything in between is uh related to the a tag so what we've got here now that is actually the the inner html and that is what you see on the screen what we've got here in this parameter called href is actually the file it's going to try and download okay now the fact that it's having an html at the front of there it means it's a download relative to where we're starting from okay so i can use this information along i mean if this was just an html i could use it directly otherwise i could use it in conjunction with that information to work out exactly what that fine name is and once i've got that i can start downloading back to our workbook so you can see here i've just written it up i've done exactly that process and i've constructed this fine name and here i'm just going to do exactly the same thing as i did before um all this code is exactly the same as we've seen and if i run that let me just run requests for survivors of fail if i run that um i didn't print out the url but you can see up here that's a i'm sure you got a new version of the file and it looks exactly the same as the other one did but what we're really wanting to do here is download lots of files together and the way we can do that is by recognising the format of these fine names they've got dates in them that's good dates we understand because we know what the next date is going to be we know what the previous date is going to be so in this next little um section down here this is just a little test section and so what i want to do is try and construct um urls corresponding to lots of dates i lots all of the different file names and at the same time um i want to store these somewhere so i'm also going to create um uh the name of the file i want to save and if i ran all that what i would get is this output here so you can see at the end here everything is the same until i've got 16 17 18 19 and that's just a little bit of looping um python code and similarly for the outputs uh the data the output files i'm going to produce their their 16 17 and 80 so we can use just put it because we know the structure of the fine names we can create a little loop which will iterate through and get all of the files and then we can put it all together in our little loop um 16 32 if you're a python person you you understand that means starting at 16 and ending it at 31 um the stem is the same um oh the little f type of the end dot csv to name all the files and then all the code in this loop is essentially exactly the same as we were running before for a single item but this time it's going to loop around and get all of the files and where we're going to put the files the data to 19 blah blah blah which i've already done and you can see here in this um folder i've collected all of the files from 16 to 31 and they're all individual files um or for each different day for the those two weeks in in in march so the next thing we might want to do having download you know half the battle is being to download multiple files but the real problem is that we're probably going to want to put these all together at some point so again we can use this using excel um and here we're going to go to data and we're going to say get data from file and specifically from a folder just make this a little bit okay i can browse where the folder is uh where did i put it data data covid-19 it's a folder at once i'm just going to select that folder highlight the folder and say okay it's then going to tell me what it's found in that folder a list of all of our files we knew that what we want to do is combine these files so the little box down here combine i don't want to do any of just just combine them and load them into my spreadsheet don't do it eventually um it shows you a little preview of the first file notice the three columns that we had in all of the files all the same obviously condition is that they tend to have to be the same um column names eventually um it creates a square which is doing all the hard work which you don't have to any more and then they put all of the data from all of the files into your spreadsheet as a table and you can see what it's also done for us which is very useful it's actually put the file names in here so they um 16th if i go far enough down i get to the 17th and so on so all of our data is in there and then you can start um do whatever it is you might want to do oops insert pivot table um we will have um the names in the columns we want the source name which is actually the file name but in our case it really represents the time the date rather and then total cases down there and this will for each of those dates it will give us all of the the values across from all of the files and if i just restrict this to a couple um one said you i did oops you can pick any ones you want so i thought kinks upon Thames and kinks upon Haw quite diverse locations to let that insert line chart and there you can see a little graph comparing the two together okay so we started off with setting out to download multiple files we've stuck them all together in excel and then we can use the normal charting type thing to draw nice little graphs and you see so so each point in these along here all come from different files that we've downloaded okay moving on because time is pressing uh we're now going to use an api um apis always read api documentation it's very important you know how it works because it might have restrictions on it um like the need for keys and how many calls you can make in a period of time the one we're going to use um you do need a key but it's free um and are some call restrictions back to our demos football api football what we're going to do here is um we're going to start off with a very similar approach of first of all i'm going to show you um what jason looks like i'm just going to import my um libraries at play and we're going to use the the requests module again just as we're doing before and you this time it's a little bit more complex in that in addition to having the url which is all we needed before because the this api requires you to have an api key we need a way of specifying that and to do that we use this headers parameter and in there we say this is our authentication token and that is the value of my token you can all get your own tokens all free there's nothing hard about that and we can also it also includes things called parameters and what we're going to say here is match day equals one and season equals 2080 now i just want to um because i insisted that you all we all do it um let's just go back to um this is a website for the football data org get your free api key api documentation and if you go to the quick start it will show you how to what can be got from it how to use it and it's got examples and things that you might want to do so this one up here the v2 competitions is effectively you don't need a key for this one you can just download it and that's effectively the call i made when i was downloading files before that gives you a list of all of the competitions and you need to know that because the competitions are worked on numbers and if you want the premier league you've got to know what number it is because that's why they give that free the other thing i just want you to point out in this is um if you go to pricing hello there is there is a free option oops at the bottom of it it tells you this is the free option 10 calls per minute okay you have to bear that in mind otherwise it will stop you and other apis will have similar kind of restrictions potentially certainly twitter so back to our code then the lab is this is our simple code if i am just show you jason looks like i know we sort of had it before at the same time i'm going to save the file because you want to limit your time usually using the internet will have you so downloading the file and saving it is usually a good idea so i'm going to print it and i'm going to save it at the same time and if i run that oops you can see what i'm seeing here is nicely formatted jason this indent4 makes it nicely formatted and you can sort of read that and see where things are and see all the information that it's bringing back for you just going to clear that um if you oh the other oops if you want to know what jason really looks like it's that and that's totally useless anyway you can't possibly use that in any practical way which is why all you usually end up with tools which will um which will um format it nicely as i so the next thing what is is use the same approach as we were doing before to get multiple downloads because um you'll notice that in up here i've said match day was worn season i'm going to keep the same but if i go from match day worn to match day 38 i'll get all of the matches from all of the games um or all of the match days okay so here i've got md 1 1 to 39 so again that's 1 to 38 i'm going to look around and each time i'm going to get the data okay now this is i'm looking to run this because as we've just noticed now if i try to do this 38 times and i'm only allowed to do it 10 times a minute it's going to fall over i'm going to get stuck so what we're seeing here is just the results of the prints where i ran it previously and again you can see exactly the same way match day was worn right up to match day was 38 so i've correctly formulated the urls i need and then we just need to run it um now we need to pass jason to extract the bits of information you've got so if you remember up here i saved this file one of the files games.jason and i've got that right so if i look at games.jason this was just the first one i did so it's only one match day match day one but you can see here in here i've got bits of information like the match day and i've got details of the matches um regular scenes i can get the score from that and i can find out who the home team was and who the away team was okay so all of information i need from the the match is available to us including the referees if i wanted the referees yeah so what bits are we going to extract well i guess what we're going to try and to do is create a football uh a league table so what we're interested in knowing is the score but let's start with the match so um what we're going to do is the the response we came back we're going to put into a variable and we're going to use that variable to extract from it the matches and from the matches which we know there are 10 of them we want the first one so that's zero again uh because this is python and then what we're going to do is progressively i'm not going to go through all of these but we're going to drill down and for each of these um we can drill down and extract more and more precise information so that's the the full-time home team score and the waiting course that's the score and down here i can get the halftime scores and i can get who the who the the teams were just as i was um highlighting in in here okay but this is using program coded python to do it and then what we want to do is just save them into a csv file and this is the code that's going to do that uh this is just pure python it's not nothing to do with the api we've already done all the api work that's going to do all that and then i'm going to use this piece of code towards the end right at the end here to actually generate a football table okay and fortunately i've already got these here so if we look at the result csv you can see here these are the items which i extracted to match there the home team away to home goals where it goes halftime scores well okay and what from that the last bit of the code in there actually uh oops the last bit of code generate a table so again that is just simple python stuff so from that data which are downloaded across 38 different api calls i can recreate the 2018-19 premiere league table okay if you're interested in football you may find that fascinating if not well it applies generally to lots of other things as well so finally um apis we covered real web scraping this is where it gets a little bit trickier um we need suitable tools uh in python we've got this package called beautiful soup again in stores part of anaconda you need to know a little bit how hml works but not a great deal the most important thing is the one i've bit i've already shown you when we're looking at the inspector you need to be able to match up on the screen with the underlying html okay and the tool for that is the inspector as i've just shown you what we're going to do we're going to use the um we're going to try and create get some information on tesco stores okay now tesco like lots of other stores they have the store locator if you go to the store locator um you can put in a code like n 13 19 and it comes up with a little map to pick one of these via the store details that will take me to the actual web page now each of the stores has a web page and they all look very much like this little picture image map with the the pin in it and various information about the store itself and what i want you to note is up at the top here in the url for this it's got this number in here oops six seven seven eight okay and again we're going to use that in the same way to find multiple stores so for example i've got some example calls here where i've just changed the number i know those two exist i'm pretty sure that 999 doesn't exist and what we're going to do is we're going to try and extract for multiple stores i.e. by changing that number various bits of information like the name address the geolocation the store type postcode and some other things we're not going to do all of these we're just going to pick a few of them so again i'm just going to run that to import my libraries and just an example of how this sort of works is before we go on to the test yourselves if i run this um no i'm not going to run that for a long time that would be the um the bbc site against standard requests there um um dear me seem to have missed a line out okay i am going to run that just to get the bbc site and that means first demonstration to make much different um having got the information back it's now in r what we do is we create a soup of that using bs which is beautiful soup and we need the text and then we can if we want to look at it we can use the pretty HTML to have a little look at it we won't bother looking at it all we want to know now is that soup contains all the information we need to extract anything we want from the the bbc web page and so we can actually the for loop we can say find find all anything which has a tag of img img is another tag like the a tag and so on and then within that tag i want you to get the value of the source parameter okay and then i just want you to print the url as that's come out so if i run that you can see here just by what's come out dot png dot jpeg what have you these are all of the um the images which are on that bbc page and the one which is probably more commonly done is to find all the of the href so again we're going to use the a tag and we're going to look for hrefs okay and you can see all of the links on that bbc page so when we um i should actually have showed you um what some big HTML looks like so this is a very simple HTML page which i've created and what the underlying h looks like is this so you can see here um the bottom i've got an example of an a tag with an href in it and at the top here i've got heading tags h1 tags some kind of something in it lists an older lists url li list item or well order lists they can be nested one within the other there are different ways of creating tables using different tags and again just looking at the tables you wouldn't know what they were so you need again to use the inspector in order to find this so how do we get the information we want from our tesco so here we are actually picking a tesco school tesco site 6367 which i know exists so if i run that and get that back um i know because i've looked at this before how do i get the title it's in h1 title and actually it's Manchester Oxford Street Express i want the store id as well and again the the issue here is finding the tags and potentially getting the text from the tag or getting some the value of some parameter within the tag and and once you got hold of the idea um this sort of thing is is relatively straightforward the complication occurs is that within an HTML page some of the tag or many of the tags will occur multiple times and it's the real difficulty is isolating what makes the tag unique to be the one that you want okay so here um get the address we're getting all of the h2s i don't have any of the word but then i want the text within h2 to be say address and once i've done that then i know that within the next span which is another tag i want the text and i know that is going to give me the address and the reason i know that is because um i went to the page and i used the inspector i found the address on i found the address on the page and i highlighted it and this is the slightly indentation or the path of of the um hml tags which i followed to find it or it selected for me and then i copied it and just read it off and worked out where it was various other ways of getting it um again here i used um actually i tunneled down to find the html i wanted but then there's other ways of doing it because there is a tag called div um which has this parameter set class is set to address and then from within there i can find the one which has height and prop equals address and text so again it's just a way it just all you have to do is find a way of making what it is you're referencing to be unique okay and then you can get all of it in a similar way the only complicated one is the image that's a bit more involved because the um longitude and latitude which is what we wanted is really only shown as the the the pin on the map so you've got to do something very similar again find the image have a look at what's inside there um get the url out which is a call to um google maps or whatever and then extract the longitude and latitude okay do not run this um and the reason i'm not going to run this is because this is going to get all of the code to four three thousand to four thousand that's a thousand times um i've put in here a sleep of five i don't know if that's strictly necessary but in the case of you don't want to overwhelm the um the server so be polite and just space them out so that's why i can't possibly run that with a far too well but don't worry i've got them already stored in stores here uh yes there they all are which i've done previously so having worked out how to get all our various bits we just need to start putting it all together but one thing we need to be aware of because we commented at the beginning that we're sequentially going through all these numbers and some of them won't really be real stores so what we're doing here we need a way of identifying a missing store okay um and what we've if i if i put that back to nine nine nine and run that test your store okay that's that is what is written in the header or in the title and because it starts with error i know it's not what i want so i'm going to miss them out the next thing we can do is put everything into a nice loop uh three thousand four thousand extract all the bits if you want we're going to put it all into um um a data frame we can look at and then we can um the last thing we're going to do is because and just go back up to there if i look at the store info of csv uh this is the final version so id name address lat and long came directly from the scraping the store type i've extracted from the name and the postcode from the address okay but that is now representative of all of the data that we've wanted to scrape this is just extracting the last two items and then finally what we want to do is put it all together so to potentially from our thousand files i don't have any actual stores or involved there i want to put this information onto a map so if i run that i get my little map whoops if i click on the points okay i'm going to try and play with it on there what i've also done is saved it to teskastores.html so if i go in and have a look at that teskastores.html i've got a nice clickable map which i can expand if i click on an individual one it'll tell me what the store type is and the postcode