 I'm Dermot McDonnell, a research associate at the UK Data Service and thank you very much for joining us as always given the current circumstances that we are in, your time is very much appreciated and I will certainly respect it during this hour. Today we're going to look at collecting data from websites and so it's the second in a series of web scraping for social science research. This one will focus on web scraping websites and next week we have a session on online databases and also known as application programming interfaces. We had a previous one as well on the 27th of March which was a case study of how web scraping works for a piece of published social science research. So it's part of the wider package of training that we're providing here at the UK Data Service. So as you can see we have a further webinar looking more broadly at what it takes to be a computational social scientist. We've got a fairly novel development and we're going to do some live coding demonstrations on a different platform. There are four in May every Wednesday beginning on the 6th and these are free again roughly a half hour and I'll present a piece of social science analysis. I will teach you programming through that and again these are focused on Python and you can find all these events on the UK Data Service events page as well. So today we're going to focus on collecting data from websites. So we'll talk a little bit about what web scraping is as a computational method. We'll look at how you actually implement web scraping. So today we'll actually do a lot of coding demonstrations. We'll take our time and we'll go line by line through an example looking at unfortunately the COVID-19 public health crisis. We'll then reflect on what the value of this method is for social scientists. It is a computational method but it is more broadly applied and it is quite popular now in the social sciences. Equally importantly we look at what the limitations and ethical implications of this approach are also and we'll take some questions. I will read those out because you all can't see each other's questions and I'll also point you to some more further learning and resources. So what is web scraping? So it's a computational technique for capturing data stored on a web page. Computational is the key word here because I'm sure you're aware it is possible to perform this task manually. You could load a website using your web browser. You could start highlighting some of the text or some of the tables. Right click, copy and paste into an Excel file for example. Performing this task manually does carry considerable disadvantages particularly in terms of the accuracy of what you're doing and of course the labor resource needed. Particularly if a web page changes every week, every month, you're going to have to time it so that you collect the data manually and that's quite frustrating. Web scraping is generally implemented using a programming script so we'll pick a programming language like Python today. We'll write some code and we'll run that code to perform the web scraping. It should be noted that there are software applications that you can use and they'll do the job for you. Many of those are paid for as you're sure you'll understand but some that you may have to hand can be used for web scraping. So for example Microsoft Excel has a function that allows you to collect data from a web page. One of our colleagues Peter Smyth a couple of days ago if you saw his webinar on the 16th of April he covered some of these techniques as well. So if you don't want to engage in programming that is fine. There will be some kind of solution using a software package that you can implement. Saying that, I'm going to go out on a limb, I think it is relatively simple for a beginner to programming to be able to write some web scraping code. And you can do it using open source programming languages. If your preference is for Python you can use Python, if it's for R you can use that. And even more traditional software packages like Stata are starting to provide web scraping modules. MATLAB is another one as well. So you do not need to be highly computationally literate nor write hundreds or thousands of lines of code. This is a very popular and mature computational method. Lots of documentation, lots of examples of which today is one of them. One of the crucial questions of course with any piece of research or data collection method is why do this? Why collect data from the web? First it's worth saying that web pages are a hugely important source of publicly available information. For example, unfortunately again the COVID-19 public health crisis is an excellent example. You've got the World Health Organization website being constantly updated. You've got the UK government website and the devolved administrations being constantly updated. The Office for National Statistics is producing lots of surveys and other real-time indicators that are also being updated often fortnightly. So essentially there's a lot of very rich, valuable information that is publicly available, but it's not made available to you in a systematic or formatted way. So there's not lots of CSV files or Excel files that you can easily download. So in that instance it is worth scraping web pages to collect the data. And there's lots of data that are stored on web pages. So for example there are files that you can download, photos, videos, lists, tables of statistics, all of which may be collected and then kind of marshaled and repurposed for statistical or qualitative analysis. And then once collected again the data may be in an unfamiliar format on the web page, but you can do a little processing and then you can put it into your more familiar rows and columns type data structure. And then it can be potentially linked to other sources of social science data. One example might be COVID-19 cases or death rates or recovery rates. These may be linked to official statistics on transport use in the UK, for example, and you can start doing some causal or quasi-experimental analysis of which interventions had the biggest effect, if any, on the rate of cases or the death rate to do with this disease. So no matter what programming language, no matter what software application you use for web scraping, there's an underpinning logic that essentially applies across all of your activities. There are two pieces of information that you need to know to begin web scraping. You need to know the location, so this is otherwise known as the web address or the URL, the Uniform Resource Locator. This is the unique ID of a web page on the internet. So for example the UK data service home page can be accessed at HTTPS, etc. And that's usually a permanent unique ID for accessing that home page of the UK data service. Then we need to know the location of the information we are interested in within the structure of that web page. So this involves visually inspecting a web page's underlying code using your web browser. Now this is a manual process. You could use a programming language to scan the web page to try and find interesting bits of information. But really that's not good practice, really you want to identify the information you're interested in and then you want to go exploring or hunting for that information within the structure of the web page. And once you get that then you can start doing the more technical aspects of web scraping. So you can use Python to request the web page using its web address. So this is equivalent to you manually taking that UK data service link, opening a web browser such as Firefox or Chrome, entering that URL, pressing enter, and then seeing the UK data service web page. We can do all of that within Python or other applications. Once we've requested the web page then we need to take the content that is returned to us and we need to parse the structure of it. So our programming language then knows, okay I'm working with a HTML file, I'm working with a web page. Now I know, now I can give you tools that allow you to pick out the information of interest. Otherwise as we'll see what's returned from the request for a web page is essentially a massive text. That's really hard to navigate unless you parse the structure. Then we get to the more interesting aspects of it. We get to pick out the information we're actually interested in. And finally we get to take this information and save it to a file for future use. Now as we've outlined this logic we can think of this as pseudocode. So this is a code that will actually run. So if I open Python and I type request the UK data service web page, Python of course won't know what that means. So we need to convert pseudocode into the exact set of instructions that are interpretable by our programming language. So now we're going to switch to the COVID-19 examples. What we're going to do is take a look at a website that produces relatively real-time statistics on this disease. And we're going to see if we can collect both aggregate statistics and country-level statistics about the progress of this disease. The way we are conducting our web scraping is using a technology called Jupyter Notebooks. So this is an electronic document essentially that allows you to write narrative. So as you can see here we've got an introduction to our training materials. You can also write code so you can execute Python or lots of different programming languages code. And then you can also view the results. So think of a Jupyter Notebook as a merging of the code you use to produce your analysis. So if you're a social scientist you might use data, you'll have a do file, and then you'll have a journal article writing up the results. Think of the notebook as a combination of both. You can write the entire analysis in a single document. So what I will do is in the question box I'm going to try and write the link. So I'm going to send a link to all. So in the question box you should see a link to this notebook that you can run yourself on your machine. So you don't need to install anything. You can run it in your web browser and you can follow along with what I'm doing. However, if you only have one screen then just keep focusing what I'm doing because you'll be able to access this code once the webinar is finished. But if you have two screens feel free to work alongside me as I'm doing this. So I'll just give you a quick example of how Jupyter Notebooks work. So here we've got a piece of Python programming code. I can execute the code so now it's asking me for some input. Here's my name. Hello. Enjoy learning more. I certainly will. But first what we'll take a look at is the website of containing the COVID-19 material. So I'm not sure if this website is familiar. It's going a few years now and it provides demographic and environmental statistics about our planet in essence. So you can see a counter tracking the current world population. As you can see we're still increasing at a fairly scary rate. So it's got some interesting social science statistics. Since the COVID-19 crisis it started producing near real-time but certainly up-to-date statistics about this disease. So let's define the task. So we've gotten the first piece of information we need which is the web address of the web page we're interested in as you can see up here. Now let's locate the information we're interested in. So to begin with I am interested in capturing this statistic, this one, and this one. And I'd like to save these to a file. So we certainly need to locate that information. So visually we can see it's near the top of the web page. But of course that's how the web page is presented to us. What we need to do is look at the underlying code and powering this web page and find where the information is within that structure. So some of you may be familiar with this technique. I'm in Firefox so if I right-click I can go to view page source. And this will show me the HTML so that's the language that websites are built using. And I can see the underlying code of this web page. And obviously a lot of it is quite technical. A lot of this code is about how the content is presented or rendered to us. We could go looking. So we could say we could look for cases. We can see there might be lots of different occurrences of this keyword. It might actually be easier if we highlight the piece of information we're interested in. Right-click once more and then pick inspect element. Again this is in Firefox but there will be something similar for Chrome or Safari, Opera or some other browser that you're using. And now you can see this has taken me to the exact location within the HTML of this web page. So we can see there is a container so this div tag here. Div means divider or section in HTML. So we've got a divider here and within that we've got a header, h1, coronavirus cases. And then if we keep digging down there we can see the statistic that we actually need ourselves. And then if we keep looking we see that the first piece of information is in a div called id equals mainCounter-wrap. Okay well I see two more here. Yes okay if I want the debts it's under this div tag. And if I want the numbers of recovered patients it's here. So now we know the web address of the web page we're interested in. And within that web page we know the elements or tags that identify the pieces of information we're interested in. So now we have our two key pieces of information. What we can do then is actually implement the web scraping. So there's just some information if you want to learn more about Jupyter notebooks but let's get stuck in to the web scraping. So we have our necessary information let's actually request the web page. There's a slight preliminary step in Python which is basically loading in all the functions and all the modules we need to perform our web scrape. This is quite easy to do. As you can see we don't need a lot of modules to be able to do this. We need a module here for working with the operating system. That's just so we can navigate through folders on our machine. This is one of the key modules and as the name suggests the requests module is very good for requesting web addresses. We've got a CSV module so that's for working with comma separated values files. That's good for saving our results. Pandas is a very good data frame module in Python. There's equivalent modules and or you might be familiar with there's a daytime module that just allows us to capture today's date. That's usually good for naming files. Then the second key module for web scraping is called oddly beautiful soup. I'm not entirely sure why it's called that. That's a good piece of trivia. Once we request the URL we'll then use beautiful soup to actually parse the contents and allow Python to pick out the elements of interest. Because this is a Jupyter notebook and this is live coding so thankfully my Wi-Fi is holding up. I've just said import these modules and if it imports successfully tell me. Just give me a quick message to the console saying everything's been imported correctly. Now we can get to the actual scraping itself. We need to request the web page. As I've said this is analogous to opening a web browser and typing it in manually. Let's see how we can do it in Python itself. The first thing we do is we say what's the web address of the web page we're interested in. Here's the web page we're interested in and so this gets saved in our Python session. We basically say this variable called url holds this web address right here. Then what we do is we work from right to left in Python it can help. Let's call on the requests module. Within that module let's use the get method. As you can tell with Python it's quite English language based so get the url. Get that url and here's an option just in case I need to be redirected to the correct website it allows that. I stored the results of this command again in a variable called response. I haven't asked for the content of the web page just yet. I've actually just asked for its status code. Basically when you request the url you can get various codes returned to you that tell you whether you've done it successfully. So 200 means well done you've made a successful request. You may have seen others such as 404 or 400 if you've just done some of your own web scraping or even if you've just tried to access a website that no longer exists. A code that's in the 400s means that you've done something wrong. Maybe it's the wrong url you must have done something incorrectly. And there's lots of other codes in the 100s, 300s and 500s. But what we're looking for is getting 200 means you successfully requested the web page and getting anything in the 400s means there's been something on your end that hasn't worked correctly. If you're slightly confused at this point about how Python looks and its conventions of writing code, it takes a little bit of time. As I said we will be doing live coding demonstrations where we will take you through how Python works using social science examples but still the fundamentals. For now we'll just show you how to conduct web scraping. And it will take a little time to wash over you. Python is a language, like any language, it's all gobbledygook to begin with and you only understand how to say saloo or bon voyage etc. Eventually you will develop your vocabulary. So we can see that we've made a successful request, that's really good. Because this is a live coding demonstration I can make changes. Just to highlight how there was nothing special about the variable names that I chose, I've just chosen again some English language based variable names. I mean I want something that you can read and understand. So the variable web address now holds the URL of the web page and instead of calling it response I've just decided to call the new variable scrape result. And as you can see we get the same result which is really good. We've successfully requested the web page. So you may be wondering what exactly it is that we requested. I've just shown you the fact that we've made a successful request. What actually gets returned? So there's a text attribute of the response variable that we created. And if we take a look at that, so we'll just look at a sample. So this bit of the code here basically says just give me the first 1000 characters of the text attribute. Otherwise we would see the entire web page scrolling down and down and down. There's too much content. So you can see when we request the web page we don't get the visual rendering or the visual representation that we do through our web browser. What we get instead is the underlying text. So the source code that we had a look at earlier, we get that back into Python. And as you can see it'll be quite difficult to navigate this text. So if I say, yeah, show me cases. So if I search for that key term it does find it, but it finds another word here which I'm not terribly interested in, another word here, et cetera. So this is where the beautiful soup module comes in because it takes this massive text and it says, hey, this actually has an underlying structure that we can identify. So let's see how beautiful soup actually takes the text that's returned to us and makes it easily understandable and we can work with it. So what we do here is we use the soup method that comes with beautiful soup. What do we want to convert? So we want to take the text that's returned and we want to create that text as HTML. So it's a massive text, but we want to tell Python, hey, it's actually HTML, therefore it's a web page. And we do the same again, so we do it for a smaller sample of text and then we're going to take a look at the results. So again we've asked for the first 1000 characters, turn that into a beautiful soup variable, and here we can see the results. Now we see the different elements that make up a web page. We see different tags identifying different aspects of the web page. We see the hierarchical structure of the web page. You'll notice if we go back that we don't see the hierarchy here, it's just one long line of text essentially. But now that we parse it, we can see that it has a hierarchical structure and it is composed of different elements. So good, now we have the entire web page parsed as HTML and stored in this variable here. Let's see what we can do with it. Let's see if we can get those statistics that I pointed out earlier. So now we can use Python to get those statistics. So on the beautiful soup variable that we created, and again I've just called it soup underscore response. Earlier when we created that variable we could call it web page, we could call it results, so you can call it essentially what you want subject to some restrictions. So what I want here is to take the web page and then I want to call the find all method. So I'm looking for multiple elements at one time. I'm looking for all the div tags and I'm looking for div tags that have a unique ID of this value. So let's see what that returns. So I've said find all of those divs and as we can see it's returned a list. So any Python variable that begins with an open square bracket and closed square brackets means it's a list, so we have a list of results. So my command has found the first div tag. So here's the ID and successfully it has found the overall number of cases, statistic and heading that I want. But because we've asked it to find multiple divs, here we can see the second one. And again this one captures the deaths and then the third one down here which I can't quite see on my screen, and you mightn't either. Instead of visually inspecting, we could also say okay, I've asked you to find all of these sections. How many sections did you find? So we can call the length function, our method in Python. So how long is the list? And as we expect and as we want, there are three different sections that we found and those do correspond to the statistics of interest. And if for example we wanted to look at each element of the list separately, we could use a loop. So again for each section in the list called sections, print the section and I'm just printing some little blank spaces around us just for formatting. So you can see here now Python has said okay, individually show me each section that we found. And just to hammer home the point, we did create a variable called sections. So this needs to stay here. But this first variable name here, I'm saying for each element in a list. I could say for each chicken, which is nonsensical of course. But I can call it what I want. I'm just basically telling Python for each element in a list, print that element. You can probably tell what one of my favorite dinners is. So we're nearing the end of the scrape. You might hopefully be pleasantly surprised at how little code we have to write and hopefully how understandable it is in order to conduct some web scraping. So now we have our list. We want to pick out the actual statistics themselves because at the moment we're returning quite a lot of text. We're returning the div and the id, lots of other tags that we don't need. But the statistics of interest are contained within span tags. So we just need to locate those span tags and extract the text. And we do that as follows. So because it's a list, lists are ordered in Python, meaning that we can refer to the ordering in terms of stripping out information. So we know that the first div that we found has the cases heading. So zero here refers to the first item in this list, the first element in this list. That may seem silly. It's to do with computing science. So zero tends to count the first element in a list. One identifies the second. Two identifies the third, et cetera. If you've used R, for example, you'll have noticed that indexing begins at one, which is much more understandable for humans. But in Python, indexing begins at zero. So for the cases statistic, go to the first element in this list. Then again, I'm using the find method, find the span tag, and then within that span tag, give me the text attribute. And then I do a little bit of cleaning up. I say strip out the text, but then replace blank space with nothing and mightn't be able to see it. And then take out commas, because I want to be able to use the results as numbers, not just text. If I want to know the number of debts, for example, go to the second element of this list, find the span tag, extract the text, and do a little bit of cleaning up. So let's see if that works. It certainly does, but I'm realizing now you probably can't see it at the bottom. But I will prove that it works on the next slide. Don't worry. So the final task is to save the results. So we've created three variables, as you can see, cases, debts, and recoveries. And we want to write those variables and their values to a file. We're going to use a comma-separated values file. This is basically just a plain text file that separates its variables using commas. And it's really common on the web. If you're going to download data from the web, you're going to come across CSV files quite often. So the first thing is I want to create a folder to store my results. I'm going to use the OS module, as I said before. It stands for Operating System. I'm going to use the Make Directory method. And I'm going to create a folder called Downloads. And Downloads is just going to be located at the same level of my directory as where this script is. And I'll show you what that means in a second. And I'm wrapping this command in Try, Accept. In closing, in Try and Accept, what that means is try to run this command. But if for some reason it doesn't work, tell me. So move into this section and say I was unable to create the folder. And as you can see here, I've moved into the Accept portion because I've run the script previously. It's on my machine. The folder already exists. So if it's your first time, it will create the folder. If it's your second time, it will not create it and just skip ahead and tell you that it's already there. So there's a little bit more writing here. So what we do is I want to get today's date. So I use the DateTime module. I use the Now method. When I collect today's date from my operating system, I just format it in this manner. So year, month, day. I want to define some variable names for the top row of my CSV file. So I'm going to call them cases, debts, and recoveries. I'm going to define a file to actually store the statistics. So in my Downloads folder, I'm going to create a file called COVID-19 Statistics. Which has today's date as part of the file name and it's a CSV file. And then I want to define an observation. So an observation, and again, OBS can be called whatever you want. But an observation will be in column one. We'll have the cases variable in column two. We'll have the debts. And in column three, we'll have the number of recoveries. So what we do then is we open the file that we want to save. We open the file in write mode. And then basically we use the CSV module and we say, okay, to the top row. So the first row, write the variables. And then for the next row, write the observations. It's easier to see how this works when we open the file in Python. And as you can see, I've taken today's date and just formatted it in the way that I want using Python, which is quite good. So how do we know this worked? Well, manually, you can just click on the downloads folder on your machine, but you can actually use Python to see if this has worked. So I'm going to check whether A, the file was created and B, that we can actually open it and it contains the statistics. So we can use the list directory method. So it'll list all the files in a folder that you give it here. And okay, so we can see that the COVID-19 statistics file was created. You can see this folder contains another file that I was working on earlier and we're going to look at that file in just a moment. But let's actually open the file. So we know there's a file in the directory, but does it actually contain what we want? So we'll open the file. And this time, instead of writing to the file, we're just going to read the file. So we're going to import the contents. So read in the contents of the file, store it in a variable called data, and print the results. As you can see, it's not structured as a dataset that we're kind of familiar with, but we do have a file. It does have a structure. The variables and the values are separated by commas. And we can see that we've collected data about the COVID-19 crisis using web scraping techniques. The good thing is those core techniques that I've just shown you effectively carry over to almost all web scraping activities you do. Sometimes it can get more complicated, but you'll always use these foundational techniques. So I'm going to end with a slightly more complicated example. So we're going to go to a table on the coronavirus website and we're going to extract every row of that table. But actually, it's going to use much of the same techniques. So we'll get some of the preliminaries out of the way really quickly. We're importing modules and we're capturing today's date, so that's fine. We're going to request the web page. It's the same web page. Just to show you, we can exit the view source. And if we scroll to the end, here's the table that we're interested in. So it's a list of countries where the first row is the global statistics. Then you can see the table is ordered by a total number of cases. A lot of countries in the world are affected by this, unfortunately, so lots of results. And we can also see some are highlighted differently. So it also contains not just countries, but entities, I suppose, that have been affected. So here's a ship, the Zandam that was affected by the crisis. And there should be another one. Obviously, the diamond princess. So actually, this table doesn't just contain country-level information. It contains basically every major entity that's been affected by the coronavirus. So again, we know the web page we want. How do we locate the information? Well, let's inspect the element again. So let's look at the underlying source code. So if we kind of work our way back up, we can see that there is... Yeah, so this is what I'm looking for. There's a table tag, and it's identified by this value here, main table countries today. So this is the table I want to collect information from. So let's do the preliminary. So request the web page in general. You can type this into the question box if you want, but how do you know if the web page was requested successfully? You will get a status code of 200. So now let's find the table. So we're using the same techniques again. I visually inspected the web page, so I know that there is only one table, which has this ID. So I use the find method. I don't need to use find all. And don't worry if you use find all when you should use find. Python tells you pretty quickly which one you should use. So then in that table, I want to find all the rows. So in my table, find every row tag here and every row tag that has a blank style. And now we want to extract the information contained in each row. So basically what we're doing here is here's the way it gets slightly more complicated, but it's still okay. What we're doing is, okay, we want to create a blank list to store all of the country-level information. So we create our blank list, and then for every row in the table, we find all the columns. And then for every column in every row, we strip out the text from that column. So we go into every row. For every row, we find every column, and then we strip out the text. And we store the country-level information in its own list called country-info. And then for each time we go around the loop, we append the country-info to the bigger list here. And it's easier to see what I'm saying. So we'll print out the first three rows of the overall list. And again, apologies if it gets cut off a little bit, but we can see here we've got a list. And then within the list, we've got country-level rows. So we've got one row of country-level information. So again, as always, it's much easier to just save the results and then take a look at the file once we do that. So once again, try and create a downloads folder to find some variables. Because it's a table, we have many more variable names. We create a file to store the results and we print the file name just so we know what we're doing. So again, let's check if the file was created. So now we're going to use the pandas module, and I've just shortened pandas to pd when I've imported it. Again, English language-based, Python is great. We use the read underscore csv. So we read in the out file that I've defined. We just tell it how to understand that file and here give me a sample. So give me a sample of five rows in that data set. And again, apologies if it's cut it off a little bit. I can come out of this for you in a second. But you can see that it's randomly selected five rows from this table. And I can keep running this, and as you can see, it's giving me different results each time. So it's reading in the data set and then from the data set it's producing five different rows. So what if I wanted to look at an individual country? So I can say, right in the data set, show me a row where the country variable equals China. And if I did it for a Q8, it might be easier to see up here. Actually, I'll do it for my home country, the Republic. And you can get the statistics for a given country. So that's our COVID-19 example. It's obviously an incredibly seismic public health crisis that's dominating and will continue to dominate our lives. I've used it as an example, not as a craven attempt to bring topicality to this. Nor is it a particularly good example for learning I mean, it is an incredibly good example for learning web scraping, but I really do think that there are social science research projects that could leverage data like this. And in the wider notebook, I do point to other open data resources to do with COVID-19. So for example, the ONS surveys, the NACs, is releasing some open data about its triage and its 111 service. So there's lots of data you can get, some of it's available through web scraping approaches. So let's round up quickly before we get to the Q&A with what we consider the value of web scraping to be. So firstly, and I've stated this, it's a mature computational method. So lots of examples, lots of packages or libraries to help you. So we've seen the requests in beautiful soup. There's great help available online. Essentially, the learning curve is not as steep as other computational methods. And I hope you would agree with me on that point. Using computational rather than manual methods means that you can then start automating or scheduling your work. It means that you can then start automating or scheduling your data collection script. So you might think right, I'll take a quarterly snapshot of a local authorities website or I'll take a daily snapshot of COVID-19 data given how rapidly that situation develops. Of course, manually you could do this but you'd set a reminder in your outlook calendar or your phone and frankly, life gets in the way. I think it's worth repeating that the richness and the variety of data that's stored on web pages. And again, this is publicly available with some restrictions as I'll go on to. But there's a plethora of really, really rich information out there about social phenomena of interest. As I showed briefly earlier, my research topic is the charity sector. There's a volume of misinformation about which charities register, which ones disappear, their financials, their reports, their benefits, they provide everything. The web is awash with data about my topic. So I personally need to know web scraping approaches but I'd say it's probably creeping up on all of us quantitative or qualitative researchers. There's a lot of valuable information that we could get our hands on. So I think not only are computational methods good for accuracy, they can be executed in real time and it's reliable once you get your code working. You know the code should work in the same way every time. In comparison to you, manually highlighting and copy and pasting a table, you might get to the end of the table and copy and paste, et cetera. And even if Python and HTML is unfamiliar, as you've seen, you can spit the data out into a format that's familiar. So you can store it in a CSV file that you can open in Excel that looks familiar. You could perhaps have a database if you're a little bit more technical or you could just store everything in TXT files, just plain text files that you can import into a different software package. But of course, there are limitations to this method. One of the main ones is the fact that you may contravene the terms of service of a website. So by that I mean web pages tend to have... websites in general tend to have guides to their use. And these are essentially a contract between you and the website. So if you use the website, you agree to adhere to the terms of service. So in some instances, yes, the website might actually say, we completely prohibit web scraping. If we find you doing it, we have a legal basis for taking you to court, which is not good. So step one is you do look at the website and you have a look at the terms of service. The website we've just scraped thankfully does not prohibit web scraping, but in the terms of service, it does say that you can't use the data without their permission. So we can scrape it, but if we want to do analysis or we want to republish those statistics elsewhere, we need their permission. We can scrape and we can learn how to do it using this website. Just yet, we can't use their data. The related point is there's a lack of certainty around the legal basis. In the UK, there's not a specific law that prohibits web scraping or the use of data collected via web scraping, but there are laws that come into play. So an obvious one that gets asked quite frequently is copyright law or intellectual property. With the coronavirus example we just looked at, those are statistics or facts, essentially. Facts are not copyrighted, but the way those facts have been presented to us on that website can be. So that can be the copyright of the Worldometer organization. So you have to be careful about copyright issues. And the second major legal limitation or thing you need to think about is data protection. So if you're scraping information from or information about individuals, you have to comply with, well, currently, the GDPR, the General Data Protection Regulations. Of course, we're leaving. You're up reasonably soon, probably, so the UK will have a different data protection law. But in a sense, if you're scraping individual information, just because it's publicly available doesn't mean you've got no responsibility to how it's used in the future. So copyright law and data protection law come into play. But in the notebook, I do reflect on that there is an exemption. There's an EU directive that if it's for scientific work and you're a member of a research institution or a cultural institution, there can be exemptions. You can scrape data and it's protected under fair use for research purposes. One of the other major limitations is that webpages are frequently updated. So the HTML code can change reasonably frequently. So if you're looking for a table that has a certain ID, that ID might change or that table might be moved to a different web page or the web page itself might be deleted. So it's a lot of work maintaining your code, especially as I'm doing. If you're releasing it to others, you need to check that it works on a regular basis. Some websites may be advanced enough that they throttle or block automatic scraping, automatic scraping of their contents. My computer has a unique ID on the internet. It's IP address. It might say, yeah, we don't recognize your IP address or we've seen you before, you're not allowed to request this information anymore. So you need to keep an eye out for that. And web scraping and computational social science in general is dependent on your computing setup. You don't need high performance computing. You don't need oceans of RAM or hard drive space. But there is a kind of minimum that you need to have your machine set up in such a way that you can conduct these activities. We're going to provide guidance on that, but just be aware that you can't just turn on your laptop and, hey, press the start writing pipeline code. So very quickly, the ethical implications. So I'm coming from a research perspective. Many of you are as well, but even if you're in the private sector, there's still going to be some ethical guidelines you need to follow. So if you are, you know, in a research institution, obviously your web scraping is part of a wider research project. That needs to gain ethical approval from your institution. So you need to start thinking about, well, okay, if I'm collecting data about individuals, can they give their consent for their information to be scraped? They might have given their consent to the website to use their information, but have they given it to you, for example? Or have you contacted the website in advance to gain permission? Is there harm to the researcher or to the website from you scraping? And as we'll see, there actually can be. What are you going to do about data security? Are you going to scrape data and then store it on your institution's network? Is it going to be on Dropbox, which is outside the European Union, et cetera, et cetera? So there's lots of ethical implications that I hope you're familiar with as a researcher. Something specific to web scraping, which I think is quite interesting, is the impact on the website itself. So web scraping is computationally intensive, even on a small scale. So I'm sending a request. I'm using some of my bandwidth that my internet service provider gives me. The request goes to the machine that hosts the website. That takes some bandwidth to process because they need to send me back to the web page. So there's some computational resources that are used when you make a request. If you're really frequently running scripts, excuse me, and if you think about Google, for example, their scrapers are constantly going over the internet, finding websites, and parsing the structure websites, I mean, that's how Google works. It looks for links. And if you're doing that all the time, you can overload a server. So if you have, for me, let's say there's a charity, a reasonably small, it has a website that has interesting information. They won't have lots of computing resources. They'll have paid for a basic web hosting package. If I'm constantly sending requests to that server, requesting the web page, that's going to crash. And if individuals and organizations rely on that website for vital and timely information, you know, can you really say, well, I can make really good use of the data. I'm so sorry I crashed the website. So what I'm going to do with benefits claims, for example, and you crash the department for pensions and welfare's website, for example, what if you crash the British Red Cross's website because you're scraping it? So I think that's a really important ethical implication specific to web scraping that you need to consider.