 Okay everyone, apologies. I think I should be okay, hooray. That means the wall clock and the echo all gone. I'm assuming so. Apologies, hopefully you have some sympathy for homeworking and being turfed out of the room I usually work in for family reasons. My daughter needs more space than me. The wall clock was fine. Okay, let's get up and running. What we're going to do is run through a demonstration of collecting police data for England, Wales and Northern Ireland. Again, Julia will keep an eye on the questions in the chat. Fantastic. The link as well to the code we're using today, if you'd like to follow along, Julia should be able to post that into the chat. I think she's doing that right now. I won't waste any more time. Let's get stuck in to collecting data. So this lesson today is about getting data from APIs. So we're going to take a look at using Python for downloading data from the web. We're also just going to try and cultivate your computational thinking skills as well through some coding examples. So how do you define and solve a computational data collection problem using computational methods? Let's make that a bit more legible. So we're not going to assume any previous knowledge of Python or collecting data from the web. It would have been useful if you've participated in the first two sessions, but it's not a requirement. I'll talk you through everything we're doing. We're using police data, so it's social science data, but if you're from a different background from the private sector, that's totally fine as well. The techniques you're going to learn apply much more broadly. Yeah, and we're going to look at what an API actually is and understand the kind of key steps and requirements for going about collecting data from an API and being able to use Python for this particular task. So hopefully you'll get quite a lot out of today's session. So this is a Jupyter notebook. Some of you will be familiar with it now, so it mixes live Python code with some narrative, like you're seeing here, and the results of the code as well. So if you're following along with us just now, basically you just have to run or execute the cells that are marked by this IN square bracket symbol, and it'll be up here on the left. So for example, this is a piece of Python code. It's a very simple piece. Some of you will have probably seen it before. There may be a little bit of a lag in terms of accessing the notebook just because if you're all accessing it at the same time. So if it's not running right now, so sometimes the notebook might need to be reloaded. Okay, never mind. Thankfully I have a version on my machine. It's just because I think we're all accessing it at the same time. So let's give a quick demonstration of using Python code. There we go. It asks for my name and enjoy more about getting some data from the web. So very quickly, what an API is. So it's short for an application programming interface. There's a formal definition which is probably not terribly helpful. How I understand it is it's an intermediary between software applications. So in much the same way that a translator sits between two individuals who wish to communicate with each other. So you might have an English speaker and you might have an Italian speaker and they don't speak a single word of each other's languages. Therefore what we need is a translator. The person speaking English doesn't only needs to worry about communicating what they want to say in a way that the translator understands and then they do the hard work of communicating with the other person. Obviously that simplifies what you can say to the translator. The translator won't be able to understand every single thing you would like to say but they'll understand enough to allow the two of you to communicate with each other. So similarly an API simplifies how applications communicate with each other. So for example we have this quick look here. So a program wants some data. It connects to the API. The database talks back to the API and goes back to the program. So for example if I had a smartphone application that needed real-time traffic data. For example from Transport Scotland. I could go directly to Transport Scotland's database. But then I'd need to know a lot of really technical information about how that database actually works. What language it understands. How the data is returned. All sorts of really technical things. But if we put an API in the middle. That takes care of talking to the database and it takes care of talking to us. So we can send really simplified requests to the API for data. It communicates those requests to the database. It takes back the data and then passes it on to you in a format that you can understand also. So think of it as an intermediary or some kind of translator. So what's the general approach. So it's always good to map out what we call maybe the pseudo code or the algorithm or just the general approach for getting data through an API. So the first thing we need to know is the location of the API. So that's its web address. So where is the API found. So that's common. It's the same format as a link to a website. So the UK Police API is located at this web address. So that's reasonably easy. I just did a Google search for police data API and it brought up some data that looks interesting. Then we need to know the terms of use associated with the API. So a lot of APIs will restrict how many requests for data you can make in a minute or in a day or over a year, etc. They might have different levels of access. So there's a free level and there's an enterprise level and there's a custom level where you pay the most money, for example. With the UK Police API, you don't need to register. You don't need to provide any password when you make a request for data. It's essentially an open data portal, which is good. It does restrict how many requests for data you can make, which is 15 per second, but it's unlikely you'll be making so many requests. But you can imagine a more sophisticated app that is repeatedly and constantly making requests for data and that could overload the API. So we know where the API is. We know how we can use it. What data is available on the API. So we're interested in where the data is available through the API. And that's known as something called the endpoint. So the endpoint is the location of the data on the API. Again, a technical term, but the endpoint, as you can see here, is again, essentially a link or a URL or a web address. So if we were to click on that link just now, that won't work because we need to formulate the request more appropriately. But when we use Python to do this, we'll be able to request the data that's found at this location. So the location of the data is known as its endpoint. So then we have three bits of information. Then there are three essential tasks we need to perform to get the data. So we need to register our use of the API. In this instance, we don't, but for most APIs, you probably need to register. So we used the Guardian API recently. I needed to register with my email address and get a password that I can use. Then we want to actually request the data itself, the actual interesting bit. And this is known as making a call to the API. So it's just a slightly different terminology. Requesting data, making a call, mean the exact same thing. And of course, we want to save the data because we don't want all of our work to be done in vain. We want to save it for future use. So let's take a social science example quickly. So let's get some up-to-date crime data for England and Wales and Northern Ireland. For some reason, Scotland and the British transport place are excluded from this database. I won't make any comments. So let's find the API. So we've kind of established that it can be found here. And as I just showed you previously, there's not actually anything available at this web address at the moment. It's simply the stem or it's the base of the web addresses we'll be using to request data. So every request for data will begin with this and then we'll add on options. So for example, if we were interested in all police forces, now we can see that that's a valid web address. So here we've got the stem as far as here and there we go. If we want all the forces, here we get a nice list of all the police forces available through this database. So we're off to a good start. As I said, if you can click on it yourself, that would be really good. Before we delve a bit deeper, let's again just establish how we can use the API. The police API is reasonably well documented. It's not always the case at all. I think someone posted about the Twitter API and getting into a bit of a mess. Yeah, I agree. It can be quite difficult to navigate and, you know, that's got a lot of documentation, but it's not particularly helpful. So of course, with the police API, no authentication needed, don't need to register, do not need to provide a password every time we want to try and get some data. So we can make on average 15 calls per second with up to 30 in a single second if that's, you know, untypical or infrequent. It's highly unlikely you'll exceed the rate limit for this purpose if it's researched, but, you know, you may have an idea for an app using police data, so you don't want to be restricted either. And finally, we need to know the end points. So we need to know the location of the data that's available through the API. It's got over 20, I think it's got 21 different data sets, data resources on the API, categorized as information about police forces, so you can get a list of senior officers, crime itself, so different crime categories and events, information about neighborhoods. And what we look at today is, you know, stop and search data, which is quite interesting as well. And again, there's usually, well, for this particular instance, there's really good documentation. So here you have, again, a list of all the end points and instructions about how you would request data from this endpoint. But I'll do a lot of the hard work for you today. So we can skip this step, don't need to register, we can get straight on to requesting data. So let's get to the interesting bit. So let's focus our activities. Let's define a task that's reasonably interesting for us to complete. So my first step is I'm going to download a list of police forces in the UK. For each force, I'm going to download a stop and search data. And then I'm going to save everything that I've downloaded to various files so I can use them in the future. So the first thing is to get Python set up with what we need. Those of you who've used Python before will understand what's happening here. Basically, a lot of functionality and methods that you would like to use for computational research. They don't come as standard with Python, or if they do, they don't launch when you start up Python. So lots of things are available when you begin Python. Computation, whoops. So if I wanted to multiply two numbers together, I can do that already. It's not working at the moment because it's a little text. I need to make that code. There we go. So with Python, you can do things out of the box, so to speak. Calculations, print messages, all these kind of things. But if you want to scrape webpages, you need to load in the requests module. If you want to work with a certain type of file, known as JSON, you import the JSON module. Nothing difficult about this. You may have the experience of using R for loading in libraries, for example. Or in Stata, you might have installed some user-written packages as well. So we've set up Python with what we need, only a handful of packages, which is really good. So we're going to tackle Task 1, which is getting a list of police forces in the UK. So I'll run the code first before talking through. So basically, I need to define the web address that we're going to use to request data. So we know, for example, that there's a base URL, and that'll always be there. So it's handy to put that in a variable called base URL. And I could call that what I wanted. That makes sense to me. And then I know that data about police forces is located at the forces endpoint. So again, I just create a variable called forces, which stores that value. And then the web address that'll actually work for requesting data is a composite of the base URL plus forces. And you can see here, the plus symbol, if you have two bits of text joins the text together. While if those variables were numbers, they would add the numbers together to produce a result. So this is the URL we're using. Again, just as proof, you can see that it really does work. That's the URL we need. Then we start using the requests module. So first, from the requests module, we use the get method and we ask Python to get or collect the information located at this web address. And we store all of that in a variable called response. And then what we want to do is check whether we made a successful request for data. Thankfully, there's a status code attribute associated with the response variable. If it equals 200, fantastic, you've actually successfully requested data. If it's in the 400s, it means you haven't formulated the request properly. Maybe it's a typo, maybe the web address isn't valid, for example, but it's quote unquote your fault. If it's a status code in the 500s, there's something wrong on the end where the API is. So maybe the website is down, maybe the server is not working. Maybe the web address has been moved somewhere else, for example. So basically, when we request any website or an endpoint or something from the web, a status code of 200 means well done, you've successfully requested. But what have we actually requested? So we're not interested in status codes, we're interested in data. In the response variable, we've got a method called JSON. So that's a particular data type. JSON is a type of data. We can use that method to grab the data and put it in a variable called forces data. And then when we call that variable, it should list all the forces in the UK. So you can see it's maybe a slightly unfamiliar data structure. So it's a long list of forces. Each force has two variables. It's got an ID variable with a unique ID of that force. And it's got a name variable, which as you would expect is the name of the force in question. So even just a cursory glance shows that the data hasn't been returned in tabular format that we're used to. Every row is an observation, every column is a variable. So it comes back in a different format. It comes back in something called a JSON format. Thankfully, Python gives us lots of ways of manipulating that type of data. It's not that difficult. It's just probably unfamiliar if you're a social scientist. So voila, very simple task. We've been able to get a list of all police forces excluding Scotland and BTP. So hopefully you'll agree that the requesting bit is relatively simple. You define a web address and you use the request module to go get the content that's stored at that web address. That's pretty easy. It's the data that's returned. It's kind of the difficult bit really of working with an API in my experience. So just to show you the importance of how data is stored, I'm going to define two variables here. One called my number and one called my string. So a numeric and a text variable. This should work if I add the number 50 to a number. But if I try and add it to a piece of text, Python kicks up an error. So you can't connect a string and an integer together. But you can obviously add two numbers. So how does that impact what we've just done? Well, we've gotten some data back from the API. It's good to check what type of data that actually is. So we can use Python. We can say, OK, tell me the type of data this variable is. And you can see that we have what's known as a list data type in Python. And you can also identify lists by the opening and closing square brackets. So if we go back, you can see that this variable here begins and ends with some square brackets. So that tells you you're working with a list. So now we have a list. We can count how many elements there are in the list. So how many police forces were actually returned? So we can use the length function, the LEN function. So how long is the list? There are 44 elements in the list. So there are 44 police forces contained in the list that we downloaded. OK, that's good. So what if we wanted to look at each police force in that list separately? So what we can do is use a loop. So we can say for every force in the list variable. I'd like to say that's a deliberate spot. So for every force in this list, print the value of that variable. So show me the actual information for each force. And then I just print a blank line underneath just for some separation. So again, we're just looping over a list. So a list has an ordered set of elements. And we just go to each element and we look at each element in the list. Again, when I said I had an error, I didn't really. This is just basically a placeholder for every L, which is element in the list. Print it. You'll be bored of this if you've watched me do it. I've used this example before. For every chicken in the list, you get the same results. Because it doesn't matter. I'm just saying this is a placeholder for every element in this list. And just show me every element. So that's enough chicken examples. The final thing we can do if we're working with a list is we can pull out a particular element by referring to its location in the list. Because as I said, a list is ordered. So there's element one, element two, element three, for example. So if I wanted to know which police force is found in position 10 in the list. So we can use some square brackets again and we can insert a number referring to that position in the list. So dorset. So dorset police is the 10th element in the list. Are you slightly confused that I've used the number nine to refer to position 10? Yeah, it can be a little bit confusing. Python counts, it begins counting at zero. This is in contrast to or, for example, which begins counting at one and humans who also begin counting at one. So the value nine refers to position 10 in any list and not just the list we're using. So a very simple rule of thumb because you may get caught out with this in future is element number n is located in position n minus one in the list. So that's just a bit of a rapid tour of the list data type in Python. And now that we've just gotten up to speed, now what we want is the unique ID of every force that's contained in the list. And it would be good to store those IDs themselves in a list that we can use later. So this neat little bit of code for every force in the list that we've downloaded and extract the value from the ID field. So there's nothing really difficult here. And then we store all of that in a variable called force IDs. And then by calling on the variable, it produces the results. So I'll just do it again to show. Now you can see we don't have the name variable anymore because all we've asked for is the ID variable for every force in the list. And because we've done that, then we can loop over the list of IDs and we can request each one of those forces using the requests dot get method. We'll see that in a moment. If I'm doing that, apologies, I think maybe pay fever is coming is coming home. So we've constructed our list of force IDs. And I just talked us through that. So we've got something we can use. So while we will loop over all of those IDs and get stop and search data for, you know, every police force that's contained in the notebook right now. I'm just going to simplify the task. So I'm going to say get me certain stop and search data for the city of London police force. So we're just going to keep it nice and simple. So again, we've got much the same process. We've got the base URL, which we always need. Now we're not looking for force data. We're looking for search for stop and search data. So the end point is different. So it stops a hyphen force. And then I want to search for a particular force, which is the city of London. So that's the unique ID from the force ID list that we just created. So again, I construct the web address. It's slightly longer because it's the base URL plus the end point plus a custom search term. That's what this question mark force equals is doing here. Again, we'll print the web address. We'll request the web address. We'll check if we were successful. We'll store the data in a variable called sas underscore data. And then for every element in that list, I'm just going to add a new field called force. And it takes the value of this here. You'll see why I'm doing that in a second. Basically, it's because if you request stop and search data, obviously, you know which force you've requested because you've specified it here. But actually in the returned data itself, it doesn't actually tell you which police force the stop and search data refers to. Good. So I just executed the code. This is the web address containing the data of interest. It kind of flashed up there. You can see that there's a lot of data for that police force. So there's 170 different records for stop and search. And you can get a look at what the fields are in this data set. So there's an age range. So they stopped somebody between 18 and 24. The outcome was nothing. Nothing needed to happen. Self-defined ethnicity, their gender. They were interested in an article for use in theft, for example. So some really interesting data to do with stop and searches. But again, we don't need to look at the data that way. We've downloaded it into Python. That means we can continue using Python to check the data. So this is just looking at the exact same information in Python this time. Again, this is the person we were just looking at. They're between 18 and 24, et cetera. And you can see, as it scrolls down, there's lots and lots of stop and search records, which is pretty interesting. So we'll just close the results. So how many stop and searches were there for the city of London? Remember that we're working with a list. So we can use the length method again to say, well, how long is the list? So there are 171 stop and search instances that we've downloaded from the API. So we've got some stop and search data. We've got a list of police forces. It would be interesting to store those for future use. So because the data comes in something called a JSON file format, it's good to export it in that format also. And maybe future demonstrations, or you can contact me if you're interested, you can convert JSON to CSV. Sometimes that's easy. Sometimes it's not that easy. But if you want to work with something in a different format, that is possible also. So the first thing we're going to do is create a downloads folder and just to store the results. So I've run the code there. I've created the folder. If I try to run it again, you can see that, hey, this already exists. No need to create the folder. Again, going to do some simple things. I'm going to collect today's date. It's useful to get today's date, I think, for naming your files. So we're going to create a file that'll store the list of police forces. I'm going to call it UK-PoliceForces. Today's date and it's a JSON file. And then we've got the city of London. Again, it goes in the downloads folder. I'll call it this with today's date and it's a JSON file. So in turn, I open each file and I dump the JSON data into each one. So in this instance, all I've asked the code to do is just print today's date. We need to actually check that I have saved the data. So the simplest way is to check whether A, the files were created and B, the data were actually written to it. So we can use Python to check the downloads folder. So basically show me all the contents of the downloads folder. That's what's happening here. Okay, it's definitely created some files, the ones I wanted. That doesn't mean there's actually anything in the files. So we need to check that also. So basically all I'm doing is importing the data back in. So let's open the police forces file. We're opening it in read mode. So we're not trying to write data to it. We're trying to read data from it. So we're going to read the contents of the file, store it in a variable called data and again, just call that variable and take a look at what it contains. And again, voila, the file does actually contain the list of police forces that we're interested in. So I just again wanted to control myself and stick to one example. So that's a quick tour of the UK police API in Appendix A. So I'll show you in just a moment. I've got a little bit more code, maybe that much, if that makes sense, of how you could search, stop. You can collect, stop and search data for every police force. But for now, before I take some questions, I just want to just briefly reinforce what we've actually done. So it's been quite quick, but we've learned how to import modules into Python. So you know that for certain techniques, you need to bring the functionality into Python and that's really simple. It's the import command. We've learned how to make requests or calls for data to an API. And that's worked really well, thankfully. We've learned how to handle and save data that's returned by the API also. You get data back in something called a JSON format. That's different from a tabular format. It requires slightly different data manipulation techniques than you're used to. But we've learned a few ways of pulling out the ID field from a JSON file. And we've learned how to save that data as well. And hopefully I've done all this in a fairly efficient, clear and effective manner. So again, very quick few thanks for putting up with the technical home working issues. I think interacting with an API is fairly simple. Easy for me to say, I think maybe at this stage, but I don't think it's drastically difficult. Handling the data that comes back for analytical purposes is challenging, certainly. But making the requests for data, I think you can get up and running. And I'm sure you've got lots of good ideas yourselves about what you would like to do if you could get that data. If you're collecting data from an API, you've got to think about, you know, data protection. So, for example, if you're collecting, you know, personal information from Twitter, you know, do you have to see consent for getting that information? That's one issue. You have to obviously comply with the terms of use. So you have to stay within the, you know, the number of calls you can make per second, per minute, per day, whatever that is. And I call them many other murky ethical issues. But hopefully I think you'll hopefully agree that it's quite an exciting area of social science and computational research in general. So hopefully good luck with progressing your own research on that. I'm going to take questions. I haven't been looking at the chat because now I'm on my phone because I don't have my second monitor. Okay, yeah, so I'll start taking some questions now. Fantastic. Yeah, so Julia will be sharing the link, which is fantastic to the GitHub. And so all this, I'll just come out as slideshow mode. All of this, you know, is available on our GitHub. And there, you know, again, we suggest some further reading and resources, all free again. And some extra code for scraping stop and search data for lots of different police forces. So we've got a question here about when I extract the data from the list, it still includes HTML tags. So you I take you're probably scraping a web page rather than using an API, correct? You can reply back to the chat if you want. I'll keep an eye. So if you're scraping data from a website, you're not just scraping the content that you see on the website, you're scraping the underlying code behind it, which is known as HTML. So then it's not enough to just have the requests module. What you need also is the beautiful soup module for navigating a HTML data type. So in the way that we've used the JSON module for dealing with JSON data, you would need the beautiful soup module for working with the HTML tags that are returned when you scrape a web page. In this case, when we work with an API, because we're downloading the actual data itself, we're not scraping it, it's been designed to actually share data that should be usable pretty quickly. So yeah, the long answer short is basically if you see last week's demonstration, which is also on YouTube and all the code is on the GitHub, then you'll see how we can actually extract the HTML tags from data that you scrape from the web, if that's okay for an answer. And yeah, sorry, Julia has been responding as well. Can you show the answer for the task just at the end of section four? So somebody's been working through the notebook at the end of section four. Yeah, certainly. So in this example here, I've imported the data in the forces file. If I wanted the city of London, remember that previously I called it COL. So if I call on the COL underscore out file variable and I read in the data that's stored there, there you go, there's the city of London stop and search data. So yeah, just to go back here, there's a variable called COL underscore out file and that points to the location of the file containing stop and search data for the city of London. Fantastic, thank you for the question. Oh, okay, so the use of the Gov API. Okay, I'm taking us the UK government API probably easier to contact me after this just because it might take me a while to, you know, on my machine try and find similar data. So if I can address your one offline, then that would be fantastic. But I'm happy to help you figure out what's going on with that. I've got a question about the Twitter API. Yes, very good question. It seems slightly different than the police API. Yes, Twitter is, you know, it's obviously understandably protective of the data it holds. So it does ask you to register your use of the API. It's quite unfriendly language because it asks you to register your application or your app. Twitter is assuming that you're trying to build a smartphone application that uses Twitter data. And hence, you know, that's what you're trying to do. You know, as researchers, maybe you do want to build an app at the end, but you probably just want to get, you know, a hold of the data itself. But someone, you know, contacted me about this recently. It looks really onerous, but actually you just have to fill out, you know, some questions about the name of your app. So that could just be your research project. It'll ask you questions about whether you're sharing data with government officials. I mean, which again is almost never the case. And that doesn't apply to, you know, you sharing your findings, you know, with colleagues or with, you know, in a briefing paper. That would be if you use the Twitter API to get data and then immediately you transfer that data, you know, to a government department. You know, that wouldn't really be allowed. So yeah, so for Twitter and Facebook, it can be a little bit more time to register your use of the API. But once you do, I mean, they're quite open to academic research. So it should be okay. You should be able to get access to the data you need. Thank you. See if there's a couple more questions. A question I missed about Google. Oh yes, I did. Thank you. Okay. Do you have any experience with Google Trends? Personally, I do not. Google does have a lot of APIs allowing you access to some of its data. I'm presuming Trends has an API. Yeah, you probably can get Google Trends data from an API. Yeah, that's something I'd need to look into, but I presume you can. Yeah, I mean, there's mention of one here. So there's a Google Trends API. Yeah. So okay, so it should be doable in theory. There's a second part to your question which is quite interesting, which is a more general point is when we were using the police forces API, can you just change the web address slightly to get different information? So I mean, yes you can. I know that Dorset is a valid force ID, so I can search for that. And yeah, you'll find that you can change the web address slightly. It's almost like a little trick to access information. I won't show you other examples because I think by doing that, sometimes you can actually get access to files that you're not really meant to scrape or to collect. Excuse me, but yes, in theory, you can change around the URL for an API. Excuse me, or for a web address and that might actually get you access to the data that you need. Yeah, so fantastic. I can only scroll back so far on my phone if that's okay. Yeah, so if we can, I think we shared the GitHub link. If not, don't worry because you've all signed up through Eventbrite. We'll be emailing you after next week's final session about setting up your computational environment. Yeah, we'll have a full set of links to all the code demonstrations. I can just show you quickly where you can get that. Who am I signed in as? Perfect. Yeah, so we've got a dedicated repository for all the coding demonstrations. And there's a code folder. And at the moment you can see we've got intro to Python, web scraping and the API's code as well. And hopefully it worked for you. If you're to click on the binder link, that just saves you downloading Python and Jupyter notebooks to your machine. So, given that we had a late start, I am reasonably early. Thank you. Thank you for putting up with the substantial technical issues. Julie and I will be in contact soon. Hopefully you've learned a little bit about how to interact with an API. As I said, the notebook has a little bit more reading, a few more tasks it asks you to do. I'll just run the code and the appendices really quickly. Just to prove that we can get stop and search data for every police force in the same block of code. It takes a wee while because actually there's a lot of stop and search data, as you can imagine. It's quite a lot actually. So 44 forces times hundreds and hundreds of results per force. Yeah, it's deciding to be a little bit slow so we'll leave it for now. But yeah, thank you very much. Please contact me via my Twitter or my email address. I'm very happy to help out. I hope you're all doing well and I hope I'll see quite a few of you next week as well. So thank you and speak to you soon.