 Good afternoon everyone. Thank you for joining us for the third and final webinar in our Web Scraping for Social Science Research series. Again, I hope everybody is well under the current circumstances. Your time is very much appreciated. And if this is your second or third time joining us on this webinar series, you're very welcome as well. So today it's the final webinar in our Web Scraping series. We showed a case study where we showed the value and the role data scraped from a webpage played in a social science research project. Last week we looked at how you could use Python to collect data that's stored on a webpage, so the text and the images and the links and the files that you see on a webpage. How can you request those using Python and how can you convert them into usable data? Our final task today is to look at something called an application programming interface or an API, which is another method of getting data from the web. Just quickly, this webinar series is part of a wider new forms of data training series that we've got here at the UK Data Service. So on the 12th of May we've got a webinar with myself and my colleague Julia Kazmier looking at the five key skills and knowledge domains that are related to being a computational social scientist. So much like today's webinar and previous ones, there's a mix of theory and demonstrating code that you can use in your computational social science activities. We're also doing something somewhat novel. Over the month of May we're doing these live coding demonstrations and you can see some of the links there and I'll post them in the question box as well near the end of this webinar. These will be roughly half hour and explicit coding demonstrations somewhat like we will do today but we'll focus entirely on the code, talk through what we're doing line by line and so hopefully you can join us for those as well and we've got past webinars and you can view those on our YouTube channel and all the materials are available online as well. So today we're going to cover these kind of six key areas. So we're going to ask ourselves what is an API and we're going to do a demonstration of how you actually interact with an API using Python and we reflect on the value of this approach or method to social science research and of course must consider the limitations and the ethical implications as well and we'll take your questions near the end but feel free to post them right the way throughout the webinar and we'll cover them at the end and I'll again post you to some learning and teaching resources and some of you have probably received those already from us in advance as well. So what is an API? Because it might be an unhelpful acronym. So officially an application programming interface is a set of functions and procedures and that allow the creation and that access the features or data of an operating system application or other service. It's quite a technical description in essence an API acts as an intermediary between two or more software applications. So let's take a fairly simple real-life example. Let's say we have two individuals. One is an English speaker and one is an Italian speaker and let's say the English speaker doesn't even know what one journal actually means. They have no knowledge of Italian whatsoever and the Italian speaker has absolutely no knowledge of English whatsoever. How can these two individuals communicate? Well if they're lucky they may have a real-life translator of course there are technological solutions as well. So these two individuals do not need to know how to converse with each other. They just need to know how to communicate what they want to say to a translator who does the heavy lifting and then translates for the other individual. And an API performs a very similar role. So let's say we have a program and let's call it a smartphone application and this application needs real-time traffic data and let's say from transport Scotland. So transport Scotland may have a database containing real real-time and historical traffic information. Now one way of getting that data is the smartphone application can go directly to the online database. The trouble with that approach is that the program needs to know a lot of technical information about the database. Needs to know what language it's built in. It needs to know how to ask for information. A much simpler approach is to place an intermediary between these two applications called an API. So the program can just say okay API give me data. The API then does the translating it does the heavy lifting and communicates that in a much simpler way to the database. We need some data. In response the database doesn't have to go back to the program. It returns to the API a response. The API takes that response and again transfers it back to the program. So an API just sits in the middle of computer applications or software applications and it performs the same role as a translator. So it's a very powerful way of ensuring software applications can communicate with each other. So now that we know how they work in some shape or form why would we collect data? So as social scientists you know what's the value of an API? So first and foremost they tend to be just an important source of publicly available information on social phenomena of interest. In some cases let's say the UK government has an online data portal. So there's weekly or monthly data releases of some data sets from the Department for Transport, Department for Health etc. If you're interested in that data then you need to go to that website and you know right click download the data manually etc. Often what happens is an organization will just say okay instead of releasing the data on a weekly basis or a monthly basis in the form of files what they will do is say right you can have access to our online database but only through an API. So if you can communicate with the API you can send requests for data that we hold and those requests can be customizable and flexible. Those requests can be scheduled using a programming language they can be every day every half day every week and etc. So that's a really key point that APIs allow customized access to data resources. So think traditionally if you want the data from a government department for example you might download the entire procurement data for a given department. So every you know spend over 5,000 pounds that the Department for Health made in 2020 for example. You might be only interested in certain amounts of spending so you might be only interested in purchases that were over 100,000 pounds. You might only be interested in purchases made in Scotland or in Wales for example. So instead of downloading this big bulky file and then doing the filtering yourself when you actually communicate with the API you can just get the data you need. So you can just send a customized request for data. And that once the data are sent back to you by the API they might come in a format that you're not very familiar with. So they might come in a hierarchical data format and we'll see some examples today but it is possible then to reshape those formats into something more familiar. So like a traditional rows and columns type and data structure which we would call a variable by case matrix or a tabular data structure which again can then be linked to social science data that you're interested in also. So there's a logic to using an API. So you can communicate with an API using multiple programming languages. You can do it through R, you can do it through PHP, you can do it through Pearl. Lots of programming languages allow you to communicate with APIs. But there's an underpinning logic that apply to the process no matter what language you're using. So the first thing you need to know is the location of the API. So this is its web address. So this is very similar to the web address of any website. So the UK data service website will look like something HTTPS colon forward slash forward slash UK data service and dot AC dot UK. So APIs are also located at the end of web addresses. So for example the UK police API which provides open data about street level crimes, the number of forces that around the UK neighborhood level crime statistics that can be accessed at the following web address HTTPS data dot police UK forward slash API. So once we know where the API can be accessed, let me need to know how we're allowed to use the API. So for example the UK police API doesn't require you to register your use of the API. You don't need to tell that API anything about yourself. You don't need to provide an email address, etc. But it does restrict the number of requests for data that you can make and it's around 15 per second, which is obviously a lot, but you need to think a bit more computationally about programs that might be looking for data multiple times every couple of seconds or even every second. And this number of allowable requests is known as the rate limit. So it's the rate at which you can request data from the API. And the third piece of knowledge we need to get going is the location of the data of interest. So the API exists at a web address and so do the data resources. So again at the UK police API we can get street level crime data at HTTPS police dot UK forward slash API crimes hyphen street. And the location of data on an API is known as its end point. So it's the end of the web address that then gives you back your data. So thankfully you know if you joined us last week you'll see that it's very similar to scraping a website. You need to request the web page itself and then you can start working with the data. And now that we know these pieces of information then we can actually implement the interaction with the API. So once we have the information the first thing we need to do is register your use of the API again only if required. Most APIs will require you to register There's usually no cost there's almost almost a free version of an API but you do need to provide some information. So it might be your name an email address it might be a link to your website and they usually ask you to write a short description of the application that you're planning to use the API for. So if you're a mobile developer you'll say well I'm building a smart application that takes police data visualizes it and you know provides users with some kind of service or paid for service. Even us as researchers we're still using the API we might be creating an application but we are writing code that requests the data so we might need to communicate that to the API. So once we've done all that we can get into the kind of meet and drink of interacting with an API so we actually request the data and then we might need to supply authentication so if we do need to register our use of the API and what we're granted is a unique ID essentially and every time we request data we provide the unique ID otherwise the request won't work. So this entire process is known as making a call to the API and the final thing we can do is get the data and then save it to a file for future use. Now let's look at an API in practice. So let's say we're interested in COVID-19 data so we might be interested in the discourse around the disease we might be interested in the levels of public information that are available about the disease. So we might be interested in one of these large-scale national newspapers and so the Guardian is one and you can see that it provides information on the website itself so here's a subsection about the coronavirus explained there's a section as it relates to the UK there's a global section there's an opinion section and etc probably doesn't need to be said but I will then contradict myself and say it anyway you could manually collect information so if we were interested in this particular article we could you know highlight what we want we could right click we could copy we could open a txt file we could save it etc but it's obviously not something we can do at scale so thankfully the Guardian has foreseen that its information would be very useful to a wide range of people and so they've built an API. So they've got an online database that stores all the information about its articles including the text the images the videos the links to other articles etc and it makes most of that available through its API. In some cases the API will allow you to explore the data without having to write any programming code and without having to register your use so for example the Guardian allows you to explore what kind of data is available and so it creates a little user interface let's say for example we'll type in Scotland that's where I'm currently living and we can tell it to search its content and database so let's run it again yeah so we can see it returns some information so here's the web address that we've submitted to the API so we've got a base URL here and then we've got the query that we were interested in so go to the API and then look for mentions of Scotland. So we can see we get a total of 102,000 results relating to Scotland there are 10 results per page there's over 10,000 pages of results and then in a field called results here we can see the actual content of what we're interested in so the first article was written in March 2020 on the 17th and the article is available at this web address here and we can take a look see if that works yes it does so we can read the article if that's what we're interested in and it's found a second one about poetry etc etc now this is quite useful just for getting used to the API but it's again very difficult to interact with this user interface at scale or at pace over time etc so we do need to write some programming code that helps us access this information so what information do we need so we're working through our steps so we know where the API is located and it's located at something called httpscontents.gardianapis.com so we know where that exists and we can actually test out if you want to see what needing authentication does so we know there's a URL or a web address where we can access data if we try and do it you can actually see that it tells us we haven't provided our API key so we haven't given the API our unique identification therefore it won't return us any information but at least we know that this is the correct web address for accessing the API so if we know where to get the API we need to figure out how we can use it so we need to know the terms of use and we can see the terms of use add a different web page so how do we access the API well there's two levels of access there's a developer which is free and there's a commercial license as well that we can use so with the free level of access is what we're interested in we can make up to 12 calls per second so we can request data 12 times per second up to a maximum of 5000 per day the content that we can access is strictly related to the article text in comparison to the commercial access which gives us images and audio and video so if that's the focus of your research you may need to think about building in the cost of a commercial key and you can see with the free level of access we have up to about two million pieces of content and what there's more so if you pay for the service you can get access to pretty much I'm guessing everything the Guardian has published that's digitally available so now we know where the API exists we know its web address we know how to use it in advance of this I've registered use of the API so we don't need to do that now the final thing is just to think about well what types of data are available through the API and helpfully the Guardian documentation is very good so we can see which end points or which data tables were allowed access so there's one called content one called tags one called sections one called editions and one called single items and we go through what some of these mean during our demonstration but in essence your content allows us to retrieve a list of search results tags and tell us which kind of bits of metadata an article is tagged with so if an article is about the environment it might have an environment tag sections tells us which part of the newspaper it's published in editions which edition and single item gives us every single bit of information the Guardian has about one particular article which is obviously very important and we're going to show an example of that and as well and I'll just demonstrate it myself very quickly just so you can see how it should work and basically you don't need to install anything on your machine to be able to run the code and it all launches on a service called and binder which is an open source computing environment platform so you might see a screen like this it might take a wee while and to launch and but I'll show you the final result what it looks like perfect so you should now be able to see what's called the Jupyter notebook which is what we're using for the coding demonstrations again if you want to just give me a quick yes just to say that you can see the notebook just in case you're still seeing the other screen perfect thanks very much okay so let's get stuck into the demonstration so this is a Jupyter notebook it mixes text code and output in a single document so you can work through this once this webinar is finished but I'm just going to enter a slideshow mode and I'll work through our example of interacting with the Guardian website so if you are working alongside me as I'm doing this all you need to do are run or execute the code that's contained in a cell that's identified by an i n and some square brackets so for example this is one of the the coding blocks that we've got so I can execute that Python asks me for some information and yes I will enjoy it hopefully perfect so Jupyter notebooks if you're interested in they're a really powerful data science tool and I've got some links in the notebook there's a great guy at University of Liverpool and has some excellent resources as well again don't worry if you can't see the notebook now I've posted a link I will email everyone at the end of this with the links again so you'll be able to work through if you can do it just now so we have our three pieces of information we know where the API exists we know the terms of use how many calls were allowed make and we know the different end points containing data that were interested in so the first thing we need to do is load in our API key so we can make requests and so I've stored my API key in a folder called auth a uth and I've stored it in a file called guardian API text key dot text and this is good practice in general your API key is unique to you so if you give it away other people can use the API as if they were you and that might pose problems so it's usually good to keep it in a text file but for you I'm quite trusting so you can use my key for a couple of days so I'm going to open that file I'm going to read in what's contained in the file and I'm just going to show you very quickly what an API key looks like so it's just a unique scrambling of letters and numbers that allow us to interact with the API so good we've got our unique identification let's now focus on getting information that we're interested in so we're interested in articles about COVID-19 the first thing we need to do is just tell python which functionality and which modules we need for interacting with the API so we need a module for working with our operating system as we saw previously that allows us to look inside folders it allows us to create new files so that's the OS module if you joined us last week you'll remember the requests module so that goes and fetches web addresses for us and returns the content and we've got a JSON module which is for working with the data that's returned by API APIs see if I can make this a little bit bigger if that helps and there's a date time module as well which allows us to do things with dates and times which is quite useful and also and I've just printed a little message to myself saying okay everything's been imported successfully right so now let's look through the Guardian API for mentions of COVID-19 so the first thing we do is we define the base URL now I can call that anything I want it's just a variable name and you'll have seen that before so this is where the API is found and at the end of HTTP etc and at the end of the search endpoint so that relates to the content that we were looking at earlier the content endpoint we define a search term that we're interested in so it's just a simple bit of text COVID-19 I need to provide authentication to the API when I use it and so I just create a variable that has one field and in that field is my API key itself because remember I loaded that in earlier and stored it in a variable called this so then what I need to do is I need to construct the web address so the web address is made up of the base URL and then it's built up of a query and symbol so I'm saying right take the base URL and then on that start looking for this search term and then I'll just print that web address just so you can actually see what it looks like itself then what I do is I've got the web address now I want to fetch it from the Guardian API so we use the get method from the requests module and we tell it to go fetch the web address and we provide the authentication in a variable called headers and this is very standard very common way of doing it you won't need to you know make this up yourself you know this is standard code for doing this task and then at the end what I want to say is okay I've stored the results of the request in a variable called response and I basically want to check if the request was successful so we get some results that look like this and that are encouraging which is good and note that this is a live demonstration so I do need to be connected to the internet and it's working so far so you can see the web address where we've requested information from and again if I was to type this into my URL this wouldn't into my web browser apologies this wouldn't work because I need to provide authentication thankfully the way I've requested it from the API and it's returned 200 code meaning that was a successful request so I formulated my request correctly and it's been sent back to me by the API just to show that it doesn't really matter what you call your variables I'll just do the exact same task really quickly except I'll change base URL to Highlander I'll change the variable storing the search term to yarrow etc etc I'll make a call to the API and you can see I'm getting the exact same result which is really good and so up to a certain point you can call your variables what you want in python and as a little quiz if you can tell me which brewery these beers refer to in the question box then you'll win the price to maybe wondering what exactly it is we requested and so with the response we can see in the previous example the response has an attribute called the status code and it also has an attribute called dot json so that's basically the data itself that's associated with the response so we take this json response and we store it in a variable called data and then we actually take a look at the data itself so we can see what the API returns to us and it returns this kind of odd hierarchical set of data and it seems to be made up of fields and each field then seems to have a value but certain fields so that a response field seems to be made up of other fields etc we'll look at ways of interacting or of navigating this data structure but for now we can just visually inspect it and we can pick out metadata so we can see that there are a total of nearly 16 000 results associated with our search we can see that we get 10 results per page and there's about 1600 pages of results relating to our search term COVID-19 and the actual data itself that's associated with the request is helpfully contained in a field called results so you can see the results field then has a long list of articles about our search term so the first one we found is an article called what is COVID-19 helpfully and its web address is here so if we wanted to read that article we could type that into our web browser and this article exists on the API at this address here i will use this address in a moment as well to request the actual article text itself so visually we can inspect the data that's returned but as you can tell it's hierarchical we're not entirely sure how we can actually navigate it should we just do control f and should we just look for certain words so if i'm looking for the word web you can see that this is not a very efficient way of navigating the results so thankfully there's ways of navigating these types of variables so we asked for the response to return the data in something called a JSON data structure and this is known as a dictionary in python it's just the type of variable that has certain ways of interacting with it so the first thing we should notice is that a dictionary is made up of keys and values so it's made up of pairs of keys and values so the first thing we should do is well what keys are actually contained in the dictionary so we can use the dock keys method and it will give us a list of the keys and somewhat confusingly the result says there's only one field in the data that we that was returned to us and that's obviously not true because as we visually inspected we can see that it's a hierarchical data structure so this is the top level key but within this key is a list of other keys and then within these keys are other keys again so there's a hierarchical nested nature to a JSON or a dictionary variable so let's look inside so if we want to look at what a key contains we write the variable storing all the data and then we use square brackets quotation marks and then the name of the key so this will tell us which keys exist within the response key and now it's starting to make sense with what we visually could see ourselves there are 1, 2, 3, 4, 5, 6, 7, 8, 9 keys within the response key and the main one we're interested in is the results key as well so the results key contains a list of all the articles that relate to our search so we need to navigate that list as well in python so the first thing we'll do is we'll take what's stored in the results key i will put it in a new variable called search results and then we'll take a quick look within that variable so now you can see we've stripped out all the other keys at the higher level so we no longer see the response key the total results key etc now we just have a list of observations for all the articles that relate to COVID-19 so the first one is here what is COVID-19 if we scroll down and there's another one about pregnancy if we scroll down further etc we can see all the different articles up to 10 because there's only 10 results per page so we have a list of results it's probably useful to confirm that we're working with a list so in python for any variable you can ask python to tell you what type it is that seems a kind of obvious thing to do but even though you might have something whose value is a number if it's actually stored as a string if a number is stored as a piece of text you can't do any calculations with it for example or if something is stored as a list there's a certain set of functions that you can apply to that variable that don't exist and for other variable types so we can ask python great we're working with a list now that we know that well you can say how long is the list so how many articles did we find I told you there were 10 and that was based on a visual inspection and on part of the metadata as well but we can ask python to confirm so how long is the list and it's 10 elements long and then we might be interested in looking within the list as well so we can say for every result in the list that we've created print certain types of information so for each article tell me what type it is tell me what section it appeared in and tell me when it was published and I'll just zoom out a wee bit and you can see one two three four five six seven eight nine ten so all the 10 articles that we found and we look at their type so we know their articles the first article appeared in world news the third article appeared in the environment section this was published in April this was published in March et cetera so you're already seeing ways we can use python to work our way through and what look like fairly complicated and hierarchical data structures so before we move on to the slightly more complicated examples and we can actually say right that's a pretty good bit of work I'm interested in those 10 articles how do we actually save them so we mentioned that the data stored in something called a JSON data structure that's also a file type so we want to write that information to a JSON file so it's really easy to do first all we're going to do is create a folder and call downloads and if that folder already exists which it should on my machine it just says look unable to create it it already exists which it does so that little try accept close is quite good for catching errors as you can see zoom in a bit more of you a little bit perfect so we want to save the results to a JSON file the first thing I'm going to do really simply is I'm just going to grab today's date and I'm going to format it as year month and date and then I'll print that so you can see it I'm going to define a file for storing the scraped results and so I'm going to put it in the downloads folder and then I'm going to call the Guardian API COVID-19 search and I'm going to add today's date to the file name and it's going to be a JSON file as well so what I do is with Python I say with this file open in write mode and dump the JSON data into this file so you can see the kind of logic of what we're doing as well so with this step all I asked Python to return to me is today's date and you can see this here but how do we know that the actual save function worked I'm on my machine at the moment I could obviously just open up the folder but I can actually use Python to navigate my machine and say does this file exist so I can check whether the file was created and B I can check if the actual file contains the information we collected so the first thing I'm asking Python is list all the contents of the downloads folder so you can see that okay when I tried to save the file it definitely created a file so I know the file exists which is really good now I need to confirm that there's actually something in the file it's very similar to saving it we open the file but this time in read modes we want to read in its contents and instead of using json.dump we're using json.load which is just the antonym of what we were doing and then I ask Python to spit out the data so you can see that worked so I had a variable before response.json and now I've saved that to a file I've loaded it back in and here we can see the first 10 articles that we downloaded from the API so we have successfully requested data using an API hopefully you'll agree with me that the requesting is actually relatively simple it's just about constructing the web address correctly and the API documentation usually tells you how to do that the tricky bit is working with the downloaded data it comes back in json or it comes back in something called xml which are hierarchical nested data structures and it can be difficult to pick out the elements that you're interested in so before we go back to thinking about the value and the limitations and the ethical concerns let's do a couple of more demonstrations so let's refine what we're doing and we'll do four more things so we'll include additional search terms because we've only used one we'll deal with the fact that there are multiple pages of results so if there are 15 000 articles I want all 15 000 you know this is the year of big data I don't think a sample of 10 is representative the third thing is I'm going to actually then grab the article text itself and so we saw a list of articles in our search results for each of those we want to then collect the article text and we'll see how to do that and then hypothetically we'll see how to handle the rate limit very unlikely we'll bump up against the 5000 daily limit but it is possible and you can think of testing your code if you're constantly testing and then you think oh okay now it works maybe you've actually used all your calls to the API that has happened to me before so let's get some preliminaries I'm just confirming that you know we have all the modules loaded in I've got today's date and I've got my API key and we've seen how to do that I'll just confirm that Python has all of the information it needs and so I won't go through too much what's happening here okay so let's include some additional search terms so we can use what are called logical operators so and or not and not and there's lots of combinations of those operators to include additional search terms so we've seen this previously we've got a base URL that we're using again but this time I'm adding an extra search term so coronavirus and I'm adding in the logical operator o or into that search term I'm providing my API key again I'm building the web address it's composed of the base URL plus a query term and plus the search terms then I'm again requesting that web address and I'm just checking if it was a successful request so you can see the web address that I've requested don't worry about the fact that there's a space and that actually needs to be there and if I did this Python reads that as one string so it doesn't see the the logical operator it just sees one search term which is COVID-19 or coronavirus which is a silly term so when you actually request the web address the API knows how to handle this space here and again we get a 200 status code which is good that means my request was valid so what I want to check is now that I've added an extra search term do I get more search results than I did for just searching for one so again I take the data that's returned to us in JSON I put it into a data variable again this could be called anything you want and then there's a key in that data variable called total which tells me how many search results there are so now you can see if I add an extra search term there are now 20 000 articles that are relevant for my analysis and in the notebook itself when you get to work through it you'll see I've given you an exercise of adding different search terms and different combinations so you might say I want an article that mentions COVID-19 and England for example and you can see the results of that so we do get a higher number of results than if we just searched for COVID on its own so how do we deal with multiple pages of results so the first thing to understand is just because you're allowed request lots of data from an API doesn't mean it gives you back all of that data in one go and basically an API it takes a lot of computational resources to operate an API if every time someone made a request the API returned everything it would soon be overwhelmed so basically it gives me the full list that says right that request is associated with 20 000 articles but I'm just going to give you them 10 at a time and you have to request each individual individual page of results one at a time yourself so it transfers the work back to you basically to make more requests but thankfully it's a very simple process so there's two ways of tackling this issue and we can increase the number of results returned per page and then we can request each individual page of results so the first thing we'll do is tell the API we want 50 results per page not 10 because that's too few so these three lines are the exact same as before define the base URL the search terms the authentication this time I'm defining a variable called num results and that could be num pages num whatever you can call it what you want and I'm setting it to the maximum that the API allows and I got that information from the documentation I build a web address it's the same as before but now there's an extra bit where I say write filter by page size equals 50 again I'll show you what that web address looks like and we'll ask Python to tell us whether it's a successful result excellent another successful result and this time you can see the web address is slightly longer because it has this filtering option here so it seems to have worked can we just confirm that we now have 50 results per page so again transfer the data into a variable and then look at the page size key to see what its value is and correct yes it's given us 50 results per page so now we know we have 50 results per page that still gives us about well I'm not sure it's going to be hundreds of pages so how many pages do we actually need to request to get the full complement of search results well there's a pages key that's returned to us by the response that tells us how many pages of results there are what I do is I add the value one to that result and we'll just print it to the console so we can see that there are 402 web pages we need to request you know you're probably thinking that seems silly because if I cancel this out you can see that the api said that there are 401 pages and but python has a little quirk that means we need to add one to it and basically that quirk is to do with the way python loops and over a range of numbers so the code that I'm just about to show you starts at the number one and goes up to but not including the number five so for the numbers one to five print the number you can see it only goes as far as the number four so this may be just them just the way we read english or maybe it's a cultural thing but if I see in range one to five I expect that to go up to and include the number five in python it doesn't and so now you can see why and we have to add one and to the total number of pages because when we run the loop we have to say go from one up as far as 402 and that means the python goes from page one to page one on to page 401 we'll be here till tomorrow if we try and request 401 pages and so let's just request the first 20 just to see if it works so again we've got largely the same web address we've got a base URL the search terms the page size except this time we have an extra filter which is the specific page number we're interested in and again I'll print the web address so you can see we do all the same things we request that web address we store the result in a variable called data and then for each page we define a file name and we're going to save the information we download to that file so now you can see the list of web addresses that we're requesting so you can see that it requested 20 web pages so page one page two etc all the way down to page 20 again how can we confirm if this actually worked well we can ask python to list all the contents of the downloads folder and hip hip array it did work and for each page of results it saved each page in a file called page one page two etc so that's the way of handling multiple pages of results it's in essence it's figuring out how many pages there are and then requesting each individual page itself so let's get to the really crucial thing which is that's all interesting but where's the actual article text itself I want to do some content analysis some sentence sentiment analysis some natural language processing whatever you want to do you need the actual article text itself so what we do is we take our search results and then we use a different endpoint on the API that gives us the article text and itself so we have a list of search results let's pull one of those articles out not quite at random from that list so we create a variable called search results which stores all of the values contained within the results key and then I pull out one article from the list of results so basically what I've done is I've just taken one article and how I've done that is I've referred to the element in the list by which position it's in so in python this approach is called indexing and an index begins at zero if you've used r for example indexing begins at one so the first element in a list is in position one second element is in position two now that's in r in python slightly more confusingly the first element in a list is in position zero the second element is in position one the third in position two etc so just to test that out we know we have 50 results per page so basically there are between zero and 49 different positions in the list let's say I want article so this is article 31 because remember it begins at zero and yet you can see that's a different article that's given to me so whether we can learn something from germany what's the article in position six yeah a different article and so when you have a list you can refer to it by its position in the list but let's go back to article zero and something about coronavirus latest 30 out of March at a glance so let's go and get the actual text of that article and using python so what we do is we go to the article and we take out the value that's stored in the api url key very quickly just to show you what that is there's a field called api url and again this is the web address for that specific article so we want to request that web address so again we build up a web address it consists of a base url and then it's got a little search function and we're looking for the body of that article so we're looking for the text of that article and again how did i know what to say here and here it's all in the documentation and it's not something you need to figure out yourself good so it tells me i've successfully requested the article now you're starting to see the same process again and what does the article look like again it's got some metadata there's only one article and with this id it's in the world section it was published then etc but now we can see there's a field key and within that key is one called body and then the value of that key is the actual text of the article itself we can kind of expand it here you can see the full text let's just actually pull out the text itself and store it in a separate variable we can just get a better look at that as well so we can see that it's mainly english but there's lots of funny little symbols mixed in basically because the article was published on the web the article is in a sense a web page and so it's written in a language called html hypertext transfer markup language meaning we need a way of stripping out all of these a's and p's and h2 tags so that we can read the text itself if you joined us last week and you'll know how to do that i'll do a quick demonstration here so we need a module called beautiful soup and this is for parsing web pages in python so basically i take the result from the article so we take all that massive text and i say to python well this is actually a html page so let's treat it as a html page so then let's say how can we actually navigate through the text of that page well let's say i'm interested in all the links that are found in that article so all the all the external websites and files etc that an article links to so i can say right go find me all of the link tags in the article and show me some of them so you can see again we get a list of results so we get a list of these a tags and we can see the link for each one but let's tidy up how we actually look at it so firstly how many links are on this article so this particular guardian article has 14 links on its page and then we can look at each link itself so it links to its own website it links to the johns hopkins university it links to another three four five yeah the rest of the article links basically to the guardian website so you can see using this approach there may be interesting questions around and you know do news articles tend to self refer to they refer to the guardian website do they link to other sources of information how many an average per article etc lots of interesting things we can do and it's a separate topic itself and we cover it in more detail in our previous webinar and all the code for that as well and is available online so we've downloaded article text for one particular article let's actually save it much the same code again define a file for saving it open the file and dump the json data so you can see into my downloads folder I've saved this article right here so the final thing we'll do is we will handle the rate limit so we can do up to 5 000 per day but there may be times where we're we're at risk of breaching that limit now there's usually no penalty for breaching the limit what happens is you're just denied requesting the data so it just says for the rest of the day you just can't make any more requests so you don't need to worry legally or financially really about requesting too much data you'll just be prevented from doing so so it is work keeping track of how many requests you've made so thankfully the api actually tells you how many you've made and it stores it in an attribute called headers from the response variable so if we take a look at the headers that are returned by this response lots of things we don't really need to know about but there's a couple of variables called x-ray limit limit day 5 000 so we can see that's the 5000 daily limit and then we can see how many are remaining today I've got 4900 left so I've obviously used quite a few today getting this ready some of you might have been using it at home so we're working through our 5000 daily limit so we've got a couple fields in the headers attribute that we might want to capture so let's use some of this metadata to track how many requests we've got left so if we go into the headers attribute we can take out this key again we can take out this one stored in this variable and basically the total number of calls we made are what were allowed request minus what's remaining we can ask python to say how many calls have we made today we have made 104 let's just very quickly test it out if we're working through a loop so I'm going to say yeah let's let's boost this a little bit let's let's say it's not be too let's request about 49 pages I will just ask python at the end to spit out how many calls we've made so as you can see we're asking python to keep track of how many calls we're making so you can see it's working away it's actually requesting data from the API and it's just keeping track and I keep keep going I'm not sure yeah so it's probably a little bit small and but we've now made 153 calls to the API today so that's just maybe slightly more intermediate knowledge but it's all there in the notebook for when you need it perfect so let's just take a quick look at some of the values and the limitations and ethical implications and hopefully as I demonstrated maybe the the data structure is quite unfamiliar but the process of requesting data from an API is well established it's mature there's lots of packages that are really useful and hopefully I'd even say within an hour you could be using an API I would I would say that and really critically API's are intended to be interacted with so in comparison to scraping a website an API has already thought about the structure of the data it's thought about the fields that you need and it's it's thought about the format that it returns the data in so actually API's usually provide access to data that's at least is formatted and structured correctly even if it's actual data might be high quality itself that's for you to decide in general API is just stored fantastically interesting information and there's a company's house API where you can get pretty much all information about all UK registered companies it's tremendously interesting including disqualified board members very very interesting and with an API you don't have to bulk download data you can actually customize your request and just get back what you actually need and even if you're not sold on this which is totally fine maybe the information you need is only available on an API and it is just worth learning these skills for that purpose what would be some of the limitations well API's restrict the number of requests so five thousand per day is pretty good but you might envision a scenario where that's actually not suitable so the quality of documentation can vary widely there's some god awful documentation that can make it incredibly difficult to request data it's actually hugely frustrating the Guardian is excellent thankfully but I've used some that I won't mention that we're depressing the data that's actually contained in API's it's still you know affected by data protection laws so if you think of the EU's GDPR and if you're collecting data about individuals you've got a responsibility to process that data that only use it for the purpose and that you stated to dispose of it and archive it in a proper way and etc so just because it's data publicly available from an API does not mean you don't get to think about data protection and even though an API can provide free access to data it's still a product and it still has terms of use or service basically the contract between you and the API and associated with it there's no need to get scared legally about that but there may be restrictions on the use of the data available through an API or the attribution that you must give if you use the data for research purposes and for example and then finally API's can be updated on a frequent basis so the rate limit can change whether you need to register can change the end points can move and change so if you're writing code it does need to be periodically reviewed much like what i'm doing with this i'm going to have to check it on a fairly regular basis so the ethical implications so i'm assuming most of you are researchers but even if you're private sector or public sector and government for example you'll still have to get clearance and for what you're doing so you're still going to have to think about research or harm harm to the participant data security curation encryption all these various things something that's particularly relevant to the use of API's is informed consent so let's take twitter data as an example and this is something we're going to actually provide training on over the summer so we will use the twitter API but let's let's think of twitter data so when users sign up to twitter they do sign something saying you know twitter can share your data with her parties etc but can that reasonably be said to have given consent to participating in your research project just because you can now access the API and if you if that's not the case so you say okay they have not given informed consent how am i actually able to get it can i then go through the API to get the users details and contact them one by one it gets very very difficult and very marquee quite quickly the API will make a lot of information available and again you might say well if somebody posted something on twitter that appears private or personal if it's publicly available through the API i can use it but really will there be bits of information that are private and personal that you really shouldn't be using for your research and are there identification risks so usually you know APIs won't tell you exactly who somebody is or exactly which organization you're looking at but of course we know you can identify somebody using combinations of other variables so maybe your research actually reveals a very vulnerable person on a social network that somebody else was looking for that can be quite worrying and then let's say you capture some personal data through the API and then in a couple of months later that individual actually thought oh i don't want that in my profile anymore it was a phone number it was a it was an email address but you've already captured it you know previously can you now use that in your research now that the user has said um i consider this too personal and it is a minefield it doesn't prevent i mean there's lots of great social science research using twitter and other APIs but you have to think about these ethical concerns excellent so there's been lots of really good questions so i'm going to go through some of them so thank you obviously firstly i decided to take my time in response to feedback about you know taking a bit more to go through the coding demonstration so if you have five more minutes i'm going to work through some of the questions so this comes up quite often which is about you know python versus stata versus and or versus matlab versus something else i'm not going to dodge this question um but i'm going to say i'm quite agnostic about tools so for me as a social scientist i choose the tool that i think is best for a given task but that's not always objective i would like to think that particularly python and what i've just shown you is english language based it's fairly understandable it's fairly logical i don't think or is quite as understandable um as python that's a personal judgment but python is certainly more understandable and easier to read than let's say julia or pearl or c sharp or visual basic or lots of other programming languages but ultimately it's up to you how do you normally find out where the api is located quite simply i just a google search i knew the guardian had an api but i didn't know the web address i just google searched it looked through the guardians website and eventually found the documentation um so there's no easy answer for that you just have to you know type in google api if you want to use the google api company's house api etc and that's that's linked to a question as well about how can you determine if a website uses an api and essentially what i've just described of you know doing a web engine search and there are a couple of other ways i might write something technical but some web pages are actually powered by an api so when you load a web page it actually makes a call to an api to load in the web page and sometimes you can circumvent that web page and go directly to the api even though there's no documentation about it and etc and but it's a it's not technical in terms of how you do it but how you spot the fact that the website has an api powering it and is a little bit tricky so i'll probably write something about that but it's quite difficult to describe here uh oh this is a really good question so somebody's research topic is about news and articles what are the benefits of using an api approach rather than going to lexus nexus and so i know what lexus nexus is but i haven't used it and so i'm kind of making an assumption here and but i would firstly say and being able to write your own python or or code or whatever you're using and takes a little bit extra to begin with but actually brings enormous benefits in terms of flexibility customization and you can schedule your script to run that fixed interval so you're not having to manually go to lexus nexus itself and so there are a couple of advantages of using a programming language a really good question again when connecting or scraping an api do you need to think about permission yes so the api documentation will clearly state you need to sign up or you don't and if you don't need to sign up you could just do the requesting like you've just seen me do so there's no providing authentication you just say request stock it the web address of interest basically if there's an api available and it's got documentation associated with it it's intended to be used and so maybe you have to register but if you don't you don't have to ask them you don't have to let them know that hey i'm about to download your data it's there to be used as and when you need it and yes someone correctly guessed that it was fine ales and based up in scotland was the name of the brewery well done are we asked to submit an ethical form to collect data from websites and yes i think if you're talking about an institution so a university for example and yeah i mean it's a data collection activity it's part of your your research project it needs ethical clearance if you're collecting data from an api that doesn't have anything to do with individuals then your ethical clearance is probably guaranteed or it's you know it's going to be easier to pass and but if an api has information about children or vulnerable adults or really personal data you definitely have to get ethical approval do you know any websites where they actively try to prevent you scraping and so yes so an api as i said it's it's meant to be used it'll only prevent you from making a certain number of requests but you're allowed to request data certain websites and google is certainly one if you try to scrape the google homepage for example or google search results yeah that will block you from scraping the web page absolutely and there's a specific question about companies house and which i'll ask you to contact me directly if that's okay because that's quite specific and technical there's some questions about extracting data from science direct and i don't know if that has an api if it doesn't then you can potentially use web scraping and to collect that data so you can look at our previous webinar and more send me a question and if you want do journals tend to have apis um i would say they tend not to have but if they do um or if they have web pages which they do um your question is about can you extract do i links then yes in theory you absolutely could