 Hi everyone. Thanks for joining us. It's 4 p.m. I'm assuming people can hear me. If not, then please, similar to last week, let me know very, very quickly. Excellent. I'm Dermot McDonald. I'm a research associate at the UK Data Service with my colleague Julia. We've been developing some live coding demonstrations, particularly for social scientists, but for people from a wide disciplinary background. Basically, if you're new to programming, new to web scraping, new to Python, we've got stuff to help you. So we've got some live coding demonstrations today. We're on week number two. So we're focusing on collecting data from the web. So we're going to use a technique called web scraping. So we've got about 25-30 minutes of a live coding demonstration that you can follow along with as well. So Julia, my colleague will post the link into the chat. I'll see if I can do it as well. No, I can't. So Julia, if you could please post the link. Excellent. So if you look in the stream chat option on the right hand side of the stream, you'll see a link. If you copy and paste that link into a new tab in your browser. So I'll do it here. That will launch the Jupyter notebook containing the Python code we're going to use today. Fantastic. Excellent. So it's working for me. You should see something similar. It will take maybe 10-20 seconds to load up. So yeah, if you have a second screen or if you want to split your screen, then use this link. We'll give it a couple of seconds if you want to get set up. And what you do then is you can enter slide show mode. So you can see I'm hovering my cursor just up here. It looks like a bar chart. Basically the whole document reads like a notebook. So it's an electronic notebook. It contains some text and it contains code and it contains the results of code as well. So you can work through the document as it is, but it's easier to enter slide show mode. It just cleans it up a little bit and we can begin. Excellent. So just some introductory material. So today we're going to focus on using Python to collect data found on websites. So this is more colloquially or informally known as web scraping. So we're trying to scrape data from a website. And we're also going to try and just develop your computational thinking skills through some coding examples. So specifically when we say computational thinking, we're not really talking about technical ability. We're really talking about defining and solving problems. So it's using computers to try and solve human problems in a sense. And the problem we have is there is data that's really interesting. It's found on a website. One way of getting that data is to copy and paste manually. That doesn't really scale up. It takes a lot of work on your part. So we're going to look at how computational methods can help us collect data. If people are getting... Oh, I'll just take a quick... Yeah, so people are getting a binder inaccessible. That's strange because it is working for me. This is not my personal copy. Let me just try again. So I'll move it onto my screen here. Copy and paste. It seems to be loading for me. So I'm accessing that notebook through the web. It's not on my machine. Yeah, keep posting messages, unfortunately. Oh, I ignored the error message. Okay, I haven't seen the error message. So somebody's just posted a solution. If you ignore the error message, perfect. People are figuring it out. Excellent. Thank you. You're much more tech-literate than I am. Fantastic. Well done, guys. Okay, back to the presentation. Yeah, so the two aims, we'll look at how Python can be used to collect data from the web, and we'll just try and define what we're doing in terms of problem solving. So briefly today, I will make myself a little bit smaller so you can see the slides. And we'll make this a little bit smaller as well. So this has been designed as a kind of self-taught or self-directed teaching material. I'll go through almost all of it today, but there's a little bit more in the notebook that we don't cover. So you've got the link. You'll be able to work through this yourself. It takes about 30 to 60 minutes. Today, I'm assuming no knowledge of Python, no knowledge of web scraping will take you right from the beginning. But we did do a session last week specifically on Python. So you may find it useful to work through that at a different time as well. It's for anybody, so thank you for tuning in. And yep, today you'll understand the general approach for collecting data from web pages, but you specifically will be able to use Python for getting a web page, parsing its content, so that's understanding the structure, extracting the data you need, and then saving that data to a file for future use. So if you're new to Jupyter notebooks, as I said, it contains a lot of text. It contains results of analysis. But the meat and drink of it is the code. So you can run the code that's contained in the notebook. So basically all you need to do are look for cells that on the left hand side have this I n square bracket symbol. And then all you have to do is run or execute those cells. So now that we're in slideshow mode, I'm going to use the keyboard shortcut. So for me, that's shift and enter at the same time. Or you can use control and enter as well. So here we have some really basic Python code. So I do shift and enter, and it executes the code. So it asks me for my name. I give it my name, and it gives me a little gentle message to say good luck with what you're doing. So straight into it, what is web scraping? So it's a computational technique for capturing information stored on a web page. Computational is the key word, because as I said, you could manually go to a web page. So you could open your browser, put in the web address, and then when you see the content appear, you could start copying and pasting. You could highlight the text, you could right-click, copy, etc. And then maybe you paste that into an Excel file or a Word document or a TXT file. I'm sure you can understand the considerable disadvantages collecting data from the web manually would bring. There's the potential for human error. It's so time-intensive. One of the examples we'll see today is to collect data about charities. There are 160,000 in England and Wales alone. If I wanted all of their data on a certain topic, yeah, it's not really going to happen. So what's the general approach? So no matter what programming language you use, be it Python or Julia, PHP, there's so many that you can use, there is a general logic, there's a general approach for collecting data. So the first thing we need to do is, well, we need to decide on what we're interested in collecting. So let's say we can think of a web page. So the UK Data Service, for example, has its website here. So that's the location of the UK Data Services website on the web. So this is known as a link or URL or web address. So this is the location of the UK Data Service website on the internet. So we need to know that. Once we do, then we need to find information that we're interested in on that web page. So for example, there might be a paragraph that we're interested in extracting. There might be a table of statistics. There might be lots of images or videos. There may be links we want to collect on a web page. So we need to know where that information is stored. And that's a visual process, as we'll see. There's not really a computational or programmatic way of doing that. You do manually inspect the web page. So you need two key pieces of information, the web address of the website and the location of the information that you're interested in collecting. So then we're into the doing stage. So what we want to do then is request the web page using its address. And we're going to do that in Python. And then we're going to parse the structure of that web page using Python. So then we can actually work with its contents. And I'll explain what that actually means when we do it in practice. So once we actually parse the structure of the web page, then it's about getting the information we're interested in. And most importantly, we want to save our work so we don't have to redo the web scrape as well. So these kind of six basic steps that we've laid out is called pseudocode, or in a very simple way, it's an algorithm. You know, it's the rules we need to follow in order to collect data from the web. So we're going to look at two examples today, one real social science data example to do with charities. But to learn the techniques, we're going to use a much simpler example. It's a real website, but it has been designed for practicing web scraping. Okay, so where does this, where can we find this website? And so the web page is here at this web address. We can take a quick look again, if I was to open this in a new tab, I can get a look at the web page. So it's just a really simple text web page. It contains an extract from Herman Melville's classic English literature work, Moby Dick, and it's just I'm not sure which pages is from but it's certainly a section from Moby Dick. We can actually load that website in Python. Oh, yes, and if you're wondering how to I should have said at the beginning if you're following along on the binder link, it's the space bar to go forward and it's shift and space bar to go back. So I've just went back and space bar goes forward. So apologies. You can use the directional keys, but it's a bit tricky. So as I said, if I want to take a look at that website in Python, I can load in the module I need in Python. So there's an iframe module that we need. Then I can use the iframe method. I tell it the website I want to view and I just want to adjust the width and the height. So for example, just to show you that we can edit this, you know, now the width is not as large. So it's more difficult to see the web page. So this is a useful feature of Python and Jupyter notebooks that you can embed videos and images and other websites into the notebook. So good. So we've done the first thing we've said, okay, we know where the website is, we know its web address. Now we need to find the information we're interested in. In this example, it's really simple. We just want this big long paragraph here. This is what we want to scrape. So we want to ignore this header here and we just want the text. Okay, so we can visually see where that paragraph is. But that's not really what we need. So what we need is the underlying code that powers a web page. So that's called HTML. It's hypertext markup language. And that's basically a very simple programming language for creating web pages. And what HTML does is it describes the structure of a web page. And it does that by defining a series of elements. So paragraphs, tables, headers, images, links, all of these elements together make up a web page. And then each element is uniquely identified by what's called a tag. So a paragraph is identified by a p tag. A table is identified by a table tag. A top level header is identified by the h1 tag. This link here has just some extra quite useful information on HTML. So when you're trying to find information on a web page, it's not what you visually see. And when you load the website. So the fact that you know, this is the first thing I see doesn't actually tell me where that header is on the web page. What I need to do is actually look at the source code. So we need to do this manually and visually. So I'll open this in a new tab just to make it clear what I'm doing. So what we need to do so I'm using Firefox as my web browser. So what I want to do is I want to right click anywhere on this web page. And I want to go to view page source. What that does is it shows us the underlying code powering this website. And so this is what HTML looks like. It's a series of tags. So we've got a HTML tag saying this is the beginning of our web page. We've got some head tags. Usually within this you'll see metadata about the page. And for example, if you have Google Analytics tracking views of your page, there'll be some code here about Google Analytics. But for scraping information, what we're interested in is everything within the body tags. So there's an opening and a closing tag. And within that is the information we need. So we can see the header here, Herman Melville, Moby Dick, it's an H1 tag. And then the information we're interested in scraping is contained within firstly within a div tag, and that means section in HTML. And then the actual information itself is within these two p tags. So it's a really simple webpage. Everything we need is within this p tag here. So let's see how we can use Python to get that information. So we visually inspected. If you're using Chrome, it's similar, it's right click and view page source. Safari is a little bit different. I think there's a developer option you need to enable first. So I have a link here of instructions if you're on Safari. So again, I've shown you the actual source code as well. But I've just put it in the notebook. And here's a full HTML webpage. Excellent. We have our two key pieces of information, the web address and the tags on the webpage relating to the information we want to collect. So let's request the webpage. So we know how to do that using our browser. I've just shown you you copy and paste the link in you press enter, and the webpage appears in your browser. Python can do that for you. But there's just one quick preliminary step is we need to load in modules that help us do the web scraping. So Python comes with lots of functionality built in. So for example, if you wanted to do calculations, you know, right out of the box, and you know, Python knows what multiplication is. If you wanted to, you know, print a message, we saw that earlier. And you know, the print command already exists in Python. But lots of other commands and functionality need to be loaded in to Python. So we need these three key modules for scraping a webpage. So we need the OS module, which is good for working with your operating system. So it's to navigate, you know, between file folders to create new files, these kind of actions. For scraping webpage, we want the requests module. So that mimics, you know, copy and pasting the link into your web browser, and then beautiful soup, which we'll call soup. And that's for actually parsing the webpage. So soup tells Python, hey, you're not just working with text, you're working with an HTML file. And so I run this code, and then I print a little message just to say, you know, the modules have been loaded in. So let's actually request the webpage. So I need to tell Python, okay, where's this webpage found. So I define a variable called URL. And its value is the link to the webpage. Next, I use the requests module. From the request module, I use the get method. So you can see it's quite, you know, easy to understand in Python. If I want to get a webpage, I use the get method. What am I getting? I'm getting this variable here, which earlier we've defined as this web address. And then there's a little option to do with the get method, which is if we get tried, if the website tries to redirect us to the correct website, we allow that, but it'll work. It'll work without that so we can get rid of it. And then the last line of code here basically says, tell me if the request was successful. So a 200 code in response to requesting a webpage means, yep, we've sent you the webpage, everything has gone as expected. I'm sure just by using the web, you've come across 404 errors when you try and request the webpage. So that's an unsuccessful request. That means something on your end has gone wrong. So you haven't done it correctly. If you get a 500 or a 501, a 502, there's something wrong with the website itself. So maybe it's down for maintenance. Maybe there's an issue, etc. So we're looking for a 200 code. Good. So the request has worked, but you may be thinking, where's the webpage? I thought you said, we've requested it. So we have, but first we can just take a quick look at some of the metadata associated with our request. And so there's an attribute of the response variable called headers. And this is what the webpage gives back to us when we request it. You don't really need to know much about this. And there's a handy date field, which, as you can see here, tells you the last time that webpage was updated. And the content type can be interesting. You can see that we're working with a HTML file. It's possible to use Python to download files from the internet. So this might say something like Excel, or it might say JSON, or it might say a different file type. It might say PDF, for example. But in general, you don't really need to know much about the metadata. It's just important to know it's there. It just proves that the website has communicated with you. And it's talking back to you saying, here's some information about your request. So what's the actual content of our request? So we've successfully gotten the webpage. But where's the content? Where's the title? Where's the paragraph that we requested? So we've got something called the text attribute. As you'll remember earlier, we stored the results of our request in a variable called response. Now we could call that variable, whatever we want, scrape underscore results, chicken, it doesn't matter, it's just a variable name. But it stores all of the results from the request. So it stores, you know, whether it was successful or not, it stores the metadata as we've seen. And it also stores the webpage itself. So this is what we've actually requested. So give me the HTML behind that webpage. So we can kind of see, you know, the structure of the HTML, you know, we can see the opening HTML tag, you know, we can see the heading tag and we can see the paragraph tag. But it's a bit messy, as you can see. So Python understands that it's text and we can view the text. But it doesn't recognize it as HTML. And this is the next crucial step. This is what we need the beautiful soup module for. Beautiful soup then tells Python, Hey, this isn't just a blob of text. This is actually a webpage. And then once you know it's a webpage, you can start extracting the information you need. And this is what we call parsing the webpage. It's just basically, you know, identifying the structure of the webpage. So we're going to use the beautiful soup module and really good open source module for Python. And it provides a systematic way of navigating the elements of a webpage. So let's see how it works in practice. It's a one line bit of code. And so we're using the soup method. And into it, we're saying, Hey, take the text that we requested and parse it as HTML. So tell Python we're working with a HTML object. And then we're just going to view the new variable that we created. And voila. So we can see that now Python recognizes that a it's a hierarchical structure. So HTML is hierarchical, there's opening and closing tags, and the content runs from the top to the bottom. So now we can see Python understands that. You know, if we went back just to the raw text, you can see that Python doesn't understand that there are tags, it doesn't understand that there's a hierarchy, et cetera. But now that we've parsed it as a beautiful soup object, now Python knows we're working with HTML. So that's not just an academic, you know, piece of knowledge. Now we can actually use Python to navigate and extract the information of interest. So you'll remember that what we want is the text contained within a paragraph. And a paragraph is identified by the P tag. So what we're going to do is use beautiful soup. We're going to use the find method. And we're just going to find the P tag on the webpage, which we've called soup underscore response. We're going to save the results of our search in a variable called paragraph. And then we're just going to view the contents of that variable. And again, now we can see, okay, we've, you know, cut out the rest of the HTML. And now we have a variable called paragraph that just contains the P tag and the information within it. So we're almost there. So now the next thing is we know that there's a P tag we're interested in. How do we get the text from within the P tag? So again, we use the find method on the super response variable to capture the P tags. And now we need to extract the text like so. So again, the new variable we created paragraph that has a text attribute. And so I'm just going to take the text from within the P tags and store it in new variable called data. We'll just take a little look at the data right now. And that's actually almost all there is to it. Certainly with this example, so we've, you know, we've requested the webpage. We've parsed it as a HTML page in Python. We filtered out all the other elements of the webpage just so we have the paragraph we need. And then we've extracted the text contained in that paragraph. And as you can see here, now we have a new variable called data, which just contains the information from that paragraph. So we might want to save this for future use. So we'll do that very simply. We'll just create a brand new TXT file. And the first step is just create a variable that defines where the file is located. So basically, I'm just going to save the file where it currently, where we currently are. So this script is running somewhere online. And basically, I just want to create a new file at the same location as this code. Basically, in the same folder we're working in, create a new file called mobidickscrapedata.txt. So I want to do is I want to open this file that I've defined here. So open the file, open it in write mode. So write means save in computer science language. We just want to call that file F for short hand. That's all. And with that file, write the data to it. So we didn't get any outputs. We haven't asked Python to print anything to the screen. So what we need to do is check if this actually worked to see if I'm not just tricking you. So the simplest way to check is whether A, the file was created and B, the actual contents were written to it. And this is a good test here because I'm currently working with the online version. So the file is not on my machine. I can't go looking for it on my machine. So I need to actually use Python to go find it. So there's a method called listdir. So list directory. So basically this lists all the contents in the current folder. So this will check if the txt file was created. And yes, yes, that has. So that's really good. So we know Python has at least created the file. But remember the next step was opening the file and placing the data into it. So we're basically just doing the same as saving the file, but this time opening the file, but this time in read mode. So in importing mode, read in the contents of the file, store them in a variable called data and then print the variable data. And voila, we get back to the original data that we scraped. And again, just to show that with, you know, Python variable names, you know, can be whatever. I can store the results in something called this. So you've seen I haven't created that variable before. I'm not, you know, I'm not showing you what I've made previously. Again, I've opened the file. I've read the contents of the file, stored them in this variable. And I just want to look at the contents of this variable here. And I can just zoom out a bit. You can see it's, you know, it has collected. I have scraped all of that paragraph. I have saved it to a file. And I've had and I've been able to import the file once more. And voila, that's how we have successfully scraped a web page. So now for the rest of this, for about five or 10 more minutes, I'm going to show you a slightly more complicated example. It uses all the same techniques that we've done. But in reality, you know, trying to collect research relevant data from a web page, it carries a few extra tasks, basically. So the web page we were just working with was very simple. It had one paragraph. It made it very easy to find the data we were interested in. You know, most websites will have, you know, quite, you know, rich web pages. There'll be lots of data spread across the web pages that we want to collect. Or there might be multiple pages we want to scrape as well. We had one p tag, but there could be multiple p tags on a web page. We only want one of them. So how do we filter out all the irrelevant p tags and just get to the one that we actually need. And of course, there's lots of other tags we could work with as well. And lots of other, you know, potential issues. What happens if the web page is down? Does your script break? You know, can you control for that? Et cetera. So we're going to look at a piece of social data that makes it slightly more complicated to collect data. So for one of us, myself, web scraping is a tool for collecting data that I just can't get any other way. So I'm interested in UK charities in particular. And so I want to scrape data from charity web pages. You know, that data is not, you know, released in a file. It's not on an open data portal. For example, some information about charities I can only get from a web page. So it's a necessity to create research data for myself. So for this specific example, I'm interested in which policies a charity, you know, reportedly has in place. That sounds a bit dry. Why it's interesting to me is you can link that data to observed organizational outcomes. So we know which charities go bust. We know which charities, you know, break the law. That's all open data. So we've got an open data set. And if we can collect information, you know, on policies, for example, then we can explore correlation. So, you know, if it's if most charities who go bust actually have a risk policy, then, you know, what's the point of having one? Or maybe the purpose, the point of having a risk policy in itself might be a big step, but it forces you to think about risk, maybe those charities last longer, for example. So for my research area, there is a real need to collect this data. So we go through this example, I'll go through it quicker than the previous one because we're employing all the same steps, but I'll just stop at certain points where you'll see something different. So again, we need a couple of extra modules, but really the same ones as before. We need requests and OS. We need beautiful soup. And this is just the iframe module for embedding a webpage. So the only new module is called pandas. It's a data set module in Python. It's similar to the tidyverse in our, for example, if that's what you're used to using. So what we want to do, we want to identify the webpage I want to scrape data from. I'm going to focus on an incredibly well known charity, Oxfam, so we're going to try and get their data. And so basically the web address looks like this. So here's Oxfam's public data on the Charity Regulators webpage. So there's lots of, you know, administrative information, you know, Oxfam's a website, you know, if you want to get in contact, it's headquarters in Oxford. And then some information about what the charity actually does. So here's its charitable purposes. So the relief of poverty overseas aid and the groups that it purports to help and the type of activities it engages in. The information I'm particularly interested in is in the documents tab and way down here. Here we go. There's a little section called policies and under which are listed the policies this charity says it has. And again, you know, we can use Python to actually embed that webpage as well. So we know where it is visually. There's a documents tab. There's a heading called policies. But again, in terms of the source code, where is that list of policies? And it's under a section called PCG dash charity details, etc. So there's a div tag. The div tag has an ID or a class attribute with this value here. There's a header three. And then there's a series of span tags. The text within contains the policy we're interested in scraping. So let's do this bit quickly. We now know how to request a webpage. It's the exact same as we've done before. And, you know, define a variable capturing the web address, request the web address, and just check if it's successful. Fantastic. It was a successful request. Good. So Python now has the text of that webpage. Now we need beautiful soup. I'll just make this a little bit bigger. Again, we want to take the text that we requested, parse it as HTML. And because this is a more detailed website, you'll get much more information. So, you know, the webpage itself is more complicated. There's more code, as you can see here. It's not oceans of code, but you know, it's a more, it's a more complicated webpage. But again, we can see with Python, we've requested the webpage, and we've parsed it using the beautiful soup module. Okay. So we want to extract the policy information. And this is where it gets a little bit more complicated than we saw with the previous one. So we know that there's a div tag containing the policy information. We know that the div tag is identified using the class attribute and using this value here. The problem is there are multiple div tags with this class attribute. So we can't just use find. We have to use find all. So first, we need to find all the div tags, and then we need to filter through them. So this is similar to what we did before. Before we just used the find method. Now we're using find all. So find all div tags that match this condition. So find all the div tags where the class equals this value here. And then I'm just going to view how many sets of tags are returned. So there are seven div tags with this class attribute. And so one of those we know contains the policy information. In case you're wondering what the other ones are, I mean, this is one of the sections. So one, two, three, four, five, I think six and seven. And I think this has a different class attribute. So seven div tags that we need to work through. If we want to actually see the div tags themselves, remember, we've stored the results in a variable called sections. We can take a look at that variable itself. So we can see that there is a list of div tags. So the first one I found, yep, refers to the governing document. And the next one it finds is yet the one called other regulators, etc. So we know we've got a little bit more of a challenge, which is basically finding the correct section. So what we do is, okay, we define a search term. So I know that information I want is contained under a header called policies. So what I want to do is I want to define that as a search term. And then I want to loop through all of the div tags. So we have seven div tags. And for each tag, I want to check if that search term exists in that tag. So that's what we're doing here. So for every section in sections, again, this could be for for chicken in sections, it can be called whatever. So this is the variable name we defined earlier. But this is just the placeholder. And I can show you an example of what I mean in a second. So for every div tag, if the search term is found, basically tell me which div tag you founded in and show me the location of that div tag. If you if you don't find the search term, just keep going through the loop. So for each div tag, check if the search term is found, if not, move on to the next div tag. Probably easier if I just show the results of this. So I created a new variable here called policy section. And it contains the div tag I need. So that piece of code up here loops through each div tag. And when it finds the correct one, it creates a variable here containing the information. So I'll just spend a quick moment explaining what I mean by a what I mean by looping through a list. So I shut this up. There we go. Okay, I'm just trying to create a new cell. Let me just exit slideshow mode a second. Yeah, here we go. So I'll just show you really quickly how we can use a list. So let's say I have a list of fruits. So let's create a very quick list. So I have a list of values. So let's call it apple pear. And I might misspell this. And then yeah, there we go. Okay, so I've created a new list called fruits. And I can have a look at its contents. So you can see this list has three items. Three words are contained in this list. So if I wanted to figure out which fruit is listed first, now I know I can visually see that, but imagine the list had hundreds and hundreds of items. So I could say, let's comment this out, fruits.index pear. And that returns the value one. So what I've asked Python is tell me where in the list the value pear occurs. And that might seem like a strange result that it says one, even though I can visually see it's the second element in the list. The issue with Python is it starts counting from zero. So zero is the first item in the list, one is the second item in the list, two is the third item in the list, etc. So that's where Python gets a little bit tricky. So for example, where is Apple in the list? It's at position zero. And this is what's going on here. So we're looping through each of the div tags. And once it finds the correct search term, it finds the location in the list where that search term is found. So for us, it's found in position five. So it's the sixth div tag in the list of div tags that we downloaded. And then we extract that div tag based on its location. So I realized that's slightly, slightly trickier element that we hadn't that we didn't need to do previously. But now we know the correct div tag. So we now have a variable called policy section. And we know that that contains the information we need. Okay, so now we have the correct div tag similar to before. We know that that div tag contains span tags. Each span tag has the policy information. So again, we want to loop through each span tag. So for every tag that we can find, we want to extract the text and save it in a variable. And then what we want to do is we want to combine the charity name with the individual policy and then we want to append this information to a blank list up here. And again, once we run it, I think it will show much more clearly what I mean. So it gives me what looks like, you know, a long format data set. So if you can imagine, you know, a survey that, you know, repeatedly surveys the same people. So a row is not uniquely identified by Oxfam because Oxfam appears multiple times. A row is uniquely identified by the combination of the charity and the policy in question. So that's what I mean by a long format. So you can see we've gone through each individual span tag. We've picked out the text and stored it in a variable here. We've combined the charity name and the policy and we've added all of that into a blank list up here. So that's really good. We know in Python that we have the information we need. As social scientists and researchers in general, we probably want to actually put that data, you know, into a file into something a bit more friendly that we can load into SPSS or Stata, for example, or if that's what you're using. So we want to save the results. So the new thing we're going to do is use the pandas module. And so that's similar to the tidy verse in or it provides methods for working with data sets in Python. So what we want to do is we want to use the pandas module. We want to use the data frame method. And what we want to do is create two columns and so two variables called charity name and policy. And under each of these columns, we want to put the list that we created earlier. And again, it's probably much easier to see this in terms of the results. So now we've created something that looks a little bit more like what we're familiar with. So we've got a data set. So row one, we've got the charity's name and the specific policy itself. So now we want to take that data frame that we've created, and we want to save that to a file just like we did earlier. Exact same process, a variable that stores the location and the name of the file we want to save to. And this time we don't have to, you know, open the file and do, you know, a couple of lines of code. We've created a data frame and we can use the two CSV method. What are we doing? So taking the data that's stored here and putting it into this file and right here, I will just do the same steps as before. Check whether the file was created and check whether the actual contents exist also. So I'm going to list all the files and folders and we can see here. Yep. And hopefully, again, you'll believe me that if you go back in the code when we run, we run that command previously, you'll have noticed that that file didn't exist. So again, I'm not tricking you. It's not all set up to work. This is, you know, live coding. We are, you know, scraping things and saving things in real time. So we know the file exists. And again, we can use the pandas module. It's got a read underscore CSV method. So again, go to this file here, read in the data. And these are just some options about how to read it in. You don't need these. And here we go. So read in this file and store the contents in a variable called data. And here we are back where we began a moment ago. Voila! So what have we learned? I was going to say what have I learned? What have we learned? Excellent. So those were two fairly involved examples. So thanks for sticking with me. I always say it's a half hour, but it's always a little bit longer. But I think you have learned some really practical skills. So firstly, you've learned how to import modules into Python. This is crucial for social science work, you know, for computational social science. You'll need to download and install modules on your machine and then import them into Python. Thankfully, when you download Python itself, most of the modules we've used come as standard. You just have to import them into your Python session each time. Again, if you've used or this is similar to the library command for loading in functions. We've learned how to request and parse web pages. So two steps. Can we actually request the data we're interested in? And then can we parse the request that we've made as a HTML file? So beautiful soup. No idea why it's called that. Is the module you use to parse contents? And then we've done the non-trivial, slightly boring but really important task of reading and writing from files. This is a task you're going to engage in lots doing computational social science. And hopefully, you know, you've learned as well how to structure your code and to do this in a very kind of efficient and clear manner. Maybe I haven't gotten it across in a very clear way. But, you know, hopefully the way the code is written, you know, it's concise, but there's some helpful comments telling you what's going on. So this was a very, you know, very focused on the live demonstration. And we do have a web scraping series of webinars. So we did three and there's a couple of notebooks supporting what we've done where we go into much more detail, you know, about the ethics and the limitations of web scraping. So for today, we just focused on the practicalities. But, you know, so great power comes great responsibility. So you have to start thinking about things like data protection, website terms of service. So what do you actually allow to do with the website? That's actually a contract between you and the website, the terms of service. So it's important to read that. And the Charity Commission website we use today is the information is available under the open government license. So we're allowed to scrape that data, which is good. But there's lots of murky ethical issues. Should you scrape personal information? For example, people. So this is really interesting. But, you know, here are the 12 trustees on the board of Oxfam UK. So, you know, we don't get their age or the date of birth or their address or whatever. But we do get the other charities they're on the board of so we can then maybe start to pin down who these people are exactly. So, you know, there's some ethical issues as well about scraping data. But it's usually exciting. It's usually powerful means of collecting data. So the brief, you know, lesson, hopefully it's, you know, wetted your appetite. And as I said, we've got some free training materials that we did a couple of weeks ago. I'm sure some of you were probably on that. So thanks for coming back. Again, so, you know, we've got some, we've got three webinars that you can watch. And we've got some Jupyter notebooks as well that go into much more detail. So this websites one, you know, covers a lot of the ethics and limitations. Yeah. So value of web scraping limitations, ethical considerations. And there's an example where we capture COVID-19 data as well. I quite like this free book. Automate the boring stuff with Python. That's how I learned a lot of mine a couple of years ago, my Python abilities. Yeah. Chapter 12 is quite good on web scraping covers a lot of what I've covered, but there's, you know, a couple of extra cool techniques as well. And it's actually really good for working with files. So it's really good for using Python to work with PDFs, Google spreadsheets. And there's some, like, you know, some interesting things if you want to automate the sending of emails using Python, which is quite interesting also. So that is the 30 minute demonstration elongated to 45 minutes. So apologies. But again, hopefully you found that really useful. I can see a long list of questions. My lovely assistant Julia has been tackling those, but now I'm going to jump in as well. So I'm going to take first questions. So with less static pages. So yes. OK. So there's a really good question here. So there's a difference between a static and a dynamic web page. So this is a static web page. What that means is when you request it, all of the information in that web page is is is shown to you at the same time. So all of the source code is loaded in when you request the web page. Staying with the charity field. So Republic of Ireland Charity Regulator. So this is a different website. So some web pages are known as dynamic web pages. What that means is when you request the web page, some of the information is loaded in. But then as you click through the web page, additional information is brought in as you click through. And I'm sure we've, you know, we can think of examples where if there's a search function, for example, when you click on search, then, you know, a list of results magically appear underneath the search button. That's an example of a dynamic web page. So there's an Irish charity called Goal Sports Charity. Watch what happens the web page here. So see that? So some of the content is being dynamically loaded in. So when I initially request the web page, I'll do it again just to show you. Let's just do it one more time. So I'm going to look for this charity. Cool. I want to scrape its data. It will request the web page, but then it loads in some extra information. And this only occurs because I'm using web browser. So if I was to use Python, it wouldn't load in that extra information. So sorry. So this is a long winded way of saying, yes, there are ways of using Python to collect data from dynamic web pages. There's basically two approaches. One is you bypass the web page itself and you try and connect to the database that is providing the dynamic information. So that's possible with the Republic of Ireland charity regulator. There's actually an online database the web page is pulling information from and I can actually bypass the web page directly to the database. The second approach is to use a module called Selenium in Python. Basically, this mimics the launching of a browser. So this tricks the web page into thinking you're actually working with a browser and that you're actually clicking on elements on the page. So there are two solutions to working with dynamic web pages. It's not particularly difficult. I suppose it's just recognizing that it is a dynamic web page and then you have to find a solution. Cool. Oh, and yeah, automate the boring stuff is brilliant. Yeah, that's it's a it's an online course as well. Fantastic. And the actual. It's not free as a PDF. Perfect. OK. So there's an extra question here. So if you're interested in UK government data stored on web pages, as you'll probably notice the UK government website, you know, is is highly consistent. So, you know, all the charity data, all the web pages look very similar. And so is there a way of easily automating the scraping across many sites with different HTML structures? And yeah, so if all the if the underlying source code is quite different from page to page, yeah, then you have to tailor your code to each specific website. So for example, if I was to search for a different charity here, let's say I'm looking for any. So let's just click the Arts Foundation here. We can see it's web page looks the same. So I know that I can loop over a list of charity numbers, request the web page and I'll get the same elements every time in the web page and I can extract the same information. If the page looked different, then yes, you're I suppose your code does have to be different for each individual web page. There's probably elements that will remain the same. So the process of requesting is the same. The search term you're looking for might be the same. But no, if you've got different structured web pages, you need code tailored for each individual web page. We've got a very cheeky question is yes, you could scrape the automate the boring stuff web site if you wanted. I'm sure you could. That's quite clever. Yeah. And then as someone's pointed out with the UK government website, there's an online database known as an API that you can connect to instead. So that's really good. Yeah. Any other last questions? Oh, OK. Yeah. A very, very good question. So how would I go about scraping data from multiple links? So let's say a web page has a link has a list of companies and each company has a link to further information. Yeah. That's a really good question. Yeah. That's really possible. It's really doable. So for example, let's try and do it really, really quickly. Yeah. So let's yeah, let's request Oxfam's web page. Yeah. So we're here. We've done all that. Yeah. So let's request Oxfam's web page and yeah, parse it as HTML. It might take a while. Thankfully, I've run it on my machine. So let's do that really quickly. That was good to have a backup. So basically what you're doing is you're asking to find all of the A tags instead of the P tags. So A stands for link in HTML. So let's show you how that's done. So let's parse the web page and then let's, let's say let's find links instead. So let's create a new variable called links and then I want to find all the A tags. And I'm not interested in the differences between A tags just yet. I'm just trying to find all the links. Yeah. So as you can see here now instead of scraping paragraphs or sections, I found all of the links on the Oxfam Charity web page. A lot of these links, you know, aren't necessary. So this is a link to a survey, you know, that the regulator wants you to fill out. You know, there's lots of internal links to the web page. But we've got a link here. So on the Oxfam web page, there's a link to its record on company's house, which is the company regulator. So that information would be as we're gone from Oxfam. So let's just show you again. Yeah, perfect. So this is the link here. So we've found a link that looks like this and it corresponds to this link here, which takes you to the company house and regulators website. Voila. So yes. Yeah. So if you want to scrape data, if you want to scrape links from a web page, yeah, that's easily done as well. Now that we have a list of links, I could say for every one of these links, extract the information contained in the href tag. And for each of these links, I would use requests dot get and request each of those web pages in turn. So yeah, somebody's just said it. You can write some code that loops through a list of links. And in the web scraping for social scientists, course that we've done recently, yeah, we show you how to do that. And we show you how to write loops, which is good. So we're coming up to an hour. I'm happy to stay on for another couple of minutes if there's more questions. You can contact myself and Julia to do with anything to do with this work. So where do I put my contact details? So we both work at the University of Manchester. I think our details are on the Twitch page itself. But if you're just looking for me, you just kind of need to know how to spell my very Irish name and then you can find me on Manchester. And yeah, Julia has just posted some our Twitter links. So we check our Twitter links fairly often. So yeah, so if you can find my email address, I'm more than happy to discuss your individual, you know, scraping projects, what you're interested in doing as well. So on that note, I'm going to say good evening. Thanks for joining us. Thanks for giving up your time. Also, thank you to Julia as well, who's been fantastic. And next week, we'll be looking at how you collect data from an API. So that's instead of scraping a page, there's an online database. So there's a formal means of getting the data, but the code is slightly different. So I've been Dermot, you've been great. Thanks very much. And I will see you soon.