 Good afternoon and welcome to our webinar on computational methods for collecting data from the web. This is part of a larger UK data service training program and it's focused on new forms of data that social scientists may find productive in their research and teaching. So some upcoming webinars, we have Getting Data from the Internet and two more on web scraping and we have some past webinars that focus on agent-based modelling, which is a simulation-based method for social scientists. First, I hope everyone is well under the current circumstances and thank you to those joining us live during this challenging period. If you have any questions during the course of this, as Gillian said, type them into the question box and then at the end I'll read out every question and answer it to the best of my ability. You may have noticed that due to an understandable scheduling conflict, Professor Alistair Rutherford is unable to join us. Luckily for you, as you notice, I am one of the co-authors on the paper we will be discussing, so you are in capable hands. So the aim of today's webinar is to show you how we employed web scraping techniques to generate a new, linked administrative data set to evaluate a regulatory intervention. Specifically, we looked at the causal impact of attempts by the fundraising regulator to incentivize charities to contribute to the cost of regulation. So this is quite novel in the charity sector. Usually government funds regulators. In this case, the charities were funding the regulator. We wanted to test how effective that incentive was. And you can get the paper which is recently published at this DOI and the materials underpinning the paper. So the web scraper and the data analysis scripts are all available in a public repository also. Let's take a look at what we're going to talk about today. Of course, we'll describe what web scraping is as a social science research method. We'll discuss why it's valuable for social science research. Specifically in the context of our paper, we'll see what problem it solved. We'll also look at how we actually implemented it in practice. Which programming languages did we use? How did we actually run the script? And of course, most importantly, we look at the results it produced for our social science project. And we'll also make some general reflections and critical points about web scraping itself as a social science research method. I'll answer your questions and I'll point you to some further learning and teaching resources in this area also. So what is web scraping? Firstly, it's a computational technique for capturing information stored on a web page. Computational is the key word here because it is possible, as I'm sure you're aware, to manually collect data from a web page. You can highlight the words, the paragraphs, you can right click on the image and save to your machine. However, performing this task manually carries some considerable disadvantages, as I'll show you in our specific example. Web scraping is generally implemented using a programming script. By that I mean some executable code that's written in a programming language, so OR or Python, Java, C-sharp, etc. So there are software applications on the market that you can use to do the web scraping for you. In a future webinar, we'll show you how you can use Excel to perform a web scraping task, but to have full flexibility and power, you're going to use a programming language and you're going to write your own code. It is relatively simple to implement web scraping techniques using open source programming languages. You do not need to be highly computationally literate, you don't need to write screeds of code. Web scraping is quite a common task in computational social science with lots of documentation and lots of examples for you to learn from. So if you're feeling apprehensive at the outset, web scraping is a great entryway into computational social science. So I'll show you an example from our research area, which is the charitable sector here in the UK. We have a regulator of charities, the Charity Commission for England and Wales, and every month they provide a data extract. So this is a census of all registered charities. It also has their financial information and lots of other organisational characteristics. Now that data download is useful and it is used for productive research on the sector, but there is an abundance of extra information available elsewhere. So I can show you. So for example, the Charity Commission has a public facing website. So I'm just looking at a sample charity here, which is Oxfam. I'm sure it'll be a familiar organisation. We have an overview page. We have some financial information. We can see the charity's documents, its annual reports that it submits. We can get a list of trustees and we can see where that charity operates globally. Now why do we need this website? Well, in the data extract, we do have a lot of the information that's available here in the overview. So we can get its unique ID. We can figure out what type of activities it undertakes, and we can understand who the charity helps also. However, some really valuable information that we might be interested in would be the trustees of the organisation. And as we can see here, it also lists which other organisations a trustee is on the board of. And that's a fantastically interesting piece of information. We can look at networks of trustees, what do the networks look like amongst the most successful charities, charities that get into trouble, etc. And the financial information is also available in the data extract, but it actually documents themselves or not. So if we wanted to perform some text analysis of how a charity reports on its annual performance, we would need those documents themselves. So we've got a public website. We've got a data extract. So if we want to build a complete picture, a complete understanding of the charity sector, we need to collect data from multiple sources. So as I said, we could collect this data manually. We could go to Oxfam. We could right click on each of these links to the PDFs. We download them to our machine and we could do some content analysis or some sentiment analysis. But what if we're not just interested in Oxfam and we're interested in all humanitarian charities in England and Wales or we're interested in all 160,000 registered charities in England and Wales. Therefore, we need a different approach and that's where web scraping comes in. So why collect data from the web? So I've hopefully given you some reasons to do it. Web pages themselves are a really important source of publicly available information on social phenomena of interest. Unfortunately, the current COVID-19 public health crisis is an excellent case in point. The UK government has a dedicated website with some daily updates on actions individuals can take. The World Health Organization has a very informative website also produces daily PDFs known as situation reports. So web pages themselves do contain very, very rich and research valuable information that we might want to get a hold of. And as I've shown as well, web pages can store a range of different data types, including files, text, photos, videos, lists, etc. All which may be collected and marshaled for research purposes. And finally, once collected, the data can be reshaped into familiar formats. Data that's stored on a website will be in an unfamiliar format. It'll be in something called HTML. But it is possible using programming techniques to reshape that into a CSV file, for instance, and then link it to other sources of social science data. So the research problem we had that we thought web scraping would help us solve is the fundraising regulator for England and Wales. It's a statutory regulator of fundraising activities. So a charity that operates a shop and sells goods to raise money for the organization. Or if you give money by direct debit to a charity, those activities are now regulated by the fundraising regulator. And this is a fairly new development since 2016. It came in the wake of some prominent fundraising scandals that some of you may remember. These were aggressive targeting of donors. There was questionable data sharing practices amongst the larger charities. So there's a need for statutory regulation in this area. As I said previously, the regulator is partly funded by the charities and organizations it oversees. This is quite novel globally for charities for the charity sector. And of course it opens up opportunities for regulatory capture also. So organizations are not compelled. So there's no legal mandate to take money from charities to cover the cost of regulation. However, the regulator does expect certain types of organizations to pay a fee. So those organizations are the ones that spend £100,000 or more on raising funds for these for their organizations. So if you think of something that the largest most successful fundraising charities, you've got Cancer Research UK, Macmillan, for example, they spend quite a lot of money justifiably raising funds so that the charity can carry out its activities. So what the regulator did was say, for those charities that it expects to contribute, they would threaten to name and shame those organizations if they didn't pay this voluntary fee. So they were trying to put some pressure, they were trying to incentivize certain organizations and to pay that fee. So in our research project, we were interested in evaluating how successful this attempt to incentivize payment of the fee was, which presented us with our core challenge, collecting data on which charities have and have not paid this fee. So first, method, which is requesting data from the regulator directly. And they were very helpful. They provided us with a PDF file of all the charities that had paid the fee at the time we made the request, which was a couple of years ago. So however, the list already contained over 1000 records, which would require considerable manual input to extract the list of charities from the PDF and put it in a more tabular data format into a CSV file. And it's also meant that our list was essentially out of date once it was transferred to us. Therefore, we needed a different method to get a hold of the data. So the fundraising regulator operates a public website, and it lists a directory of all the charities that have paid this voluntary fee. And they also list expect to pay the fee, but haven't. And these are listed in red. So for example, alphabetically, we can see these are charities here in the UK, who have paid the levy. And they're across a range of different sizes. So there's small charities, they're large charities. So Aberystwyth University here has paid the levy fee as well. And how this is structured is we have multiple webpages as a list of charities that have paid the fee. And each individual charity has its own page also. So for this organization here above and beyond, we can see it's a large charity. It's paid. And the crucial second piece of information we need is its registered charity number also, because then that allows us to link back to the open data that's made available by the Charity Commission for England and Wales. So the problem is relatively simple to map out. We have a list of webpages. We have certain pieces of text information on those webpages. We would like to scrape and we would like to export into a more usable format, link to existing data, and perform our analysis. So let's take a look at the logic and skills behind web scraping as a general technique. So we begin with a web page that contains information we are interested in collecting. Therefore, we need to know the following. We need to know the location, which is also known as the URL or web address, where the web page can be accessed. So as I showed you previously, the directory can be accessed at the fundraisingregulator.org.uk forward slash directory web address. Once we know that, we then have to locate the information on the web page that we are interested in collecting. So where on that directory page was the information about whether it was a large charity, whether it had paid and its charity number. So then we need to find that information as well. Once we know those two pieces of information, then we need to do the following. We need to request the web page using the URL manually. That's equivalent to you just going into a web browser, copy and pasting in the URL and pressing enter. And the website then should appear in your browser. But we need to do that using a programming language. Once we request the URL, we need to parse the structure of the web page so your programming language can work with its contents. As I said previously, a web page is written in a language called HTML, which uses tags to delineate the elements that make up a page. So for example, there are tags that identify tables on web pages. There are tags that identify paragraphs and links and images etc. So you need to tell your programming language that we're working with a website. Once it knows that, then it can navigate the tags and it can extract the information that you are interested in. So for us with the fundraising regulator, we are looking for a paragraph tag or a P tag and we want to extract the text in between those P tags that give us the registered charity number and its levy payment status. And finally, we want to write this information to a file for future use. So we're going to extract the text. We want to format it in a tabular format and then we want to export that, save it as a CSV file or an Excel file or a .txt file that we can use at a later date. So with these general principles in mind for any web scrape, specific to us, we needed a programming script that will do the following. So we needed to iterate or loop through a list of web pages and extract the charity number and its levy status. So we needed the unique idea of an organization and whether it had paid or not. We needed to perform this task on a routine basis. So for us, we wanted to schedule this programming script to run on a monthly basis so we could get up-to-date lists of who had paid the levy. So we can take a quick look at the code itself and as I said, we will be releasing this code in a public repository quite soon. Make that look a bit bigger for you. So what we decided to do was write a web scraper in the Python programming language. Python is a general purpose, easy to learn, open source programming language. By general purpose, you can use it to create a web scraper, you can use it to create a website, an app, host a website, it's a general purpose programming language. It's easy to learn because it's a high level and what that means is that it uses its English language based in terms of its variable names, its function names, how you write it is a bit more like how you would write English, which is good and it's open source, it's free to download and it's free to change as well if you're so inclined. So our scraper specifically, as you can see up here, it only requires three separate packages or modules so it doesn't require lots of downloading additional packages, it uses a set of core Python packages to perform this task. Here's the pieces of information we need to know, we need to know the website and where the information can be accessed from and then we need to create a... So what we do then next is, once we know the URL that we're scraping, we do some project administration so we create a new output file, so we create a new CSV file containing these headers, so the Charity ID, the Charity name, what kind of regulatory type it is etc. So at the beginning of our scrape we set up an output file that we can write all of the scraped information to. Once we open that file what we do then is we construct the web address, so that's a base URL plus the web page we're interested in scraping and as you can see on the fundraising regulators website you know page one had 20 or so charities, page two had 20 or so charities etc. So we loop over those page numbers. This very small piece of code here is the command that actually requests the web page so that's again equivalent to you going into your browser and saying you know give me this web page. So Python does that for you in a very simple concise way. Then we need to parse the web page so basically here we say take the web page that we just requested and tell Python that it's an HTML structure and once it's an HTML structure then we can start accessing the tags and the information that we need. So this will be shared with you at a future date so we don't need to go into it in too much detail but essentially we're going through elements in the page and so we're looking for an element called charity title and then we're scraping the text from that element so that gives us the charity name for example and this is doing something similar if we find a list if it's a registered charity we cycle through the list and we pull out the information we need and then right here at the very end once we've collected all the information we need and we use a writer command in Python and we export all that information to the file we created earlier. So what we get at the end is a CSV file with one row of power organization that we find on the web page and we have its charity ID, its charity name and whether it paid the levy or not. So I realize it's better obviously to run this code so in the future webinars we will be showing you how to actually execute this code and you can follow along but for now just while we're tidying up this project we said we'd give you a look at the code how we did it and but you'll also be able to see for yourself quite soon what we did and we saved and shared all of this work on a publicly available repository which I'll show you in a moment as well. So what was the point of all of this? Well we had a social science research project we had what we thought was an important question the whole point was to generate data that can produce some insightful results. So we took that scraped data in the CSV file and we linked it to financial data from that data extract I showed you earlier and this gives us about 4,000 organizations on which we have detailed financial information. So these are the 4,000 that spend a certain amount that therefore we know exactly what they spent on their fundraising costs for a given year. What we did then is we exploited the sharp threshold by fundraising expenditure so remember the regulator said if you spend 100,000 pounds or more you're expected to contribute to the cost of regulation. What that gives us is a research design known as regression discontinuity and which is a quasi-experimental research design and regression discontinuity is suitable when a cutoff point is used to allocate units to a treatment or an intervention or not. So 100,000 pounds was set as a threshold. If you're above that threshold you're expected to pay. If you're below that threshold you're not expected to pay but of course you can if you want. What makes it a quasi- experimental research design is charities could not manipulate precisely which side of the threshold they fell on. Very wisely the fundraising regulator decided to use the 2014 set of accounts in order to set the threshold so charities could not go back two years into the past and amend their accounts to ensure they don't fall on the 100,000 pound side of the threshold. What that means is that close to this threshold charities around this threshold are essentially very similar with one difference which is one fell on one side and one didn't. So for example an organization that spent 90,000 pounds on fundraising is very similar to one that spent 110,000 pounds with the crucial exception of the latter being expected to make a voluntary contribution and the former not expected. So we conducted the analysis in Stata using the RD robust package by Kalaniko et al and so I encourage you to check that out. It's a very good package and very well documented. So you can see in this project we've combined Python and Stata quite easily but out much extra work at all. So what kind of findings did we produce? Well first there's a clear correlation between what you spend on fundraising and the probability of paying the voluntary fee. So if you look at the x-axis if you spend between 0 and 100,000 pounds the probability of paying that fee is about 20%. So about 20% of organizations within that fundraising expenditure band paid the fee even though they were not subject to being named and shamed by the regulator. But we can see at the threshold there's a considerable jump or discontinuity in the regression line. So above the threshold about 60-65% of organizations paid the fee and it's that discontinuity between the two regression lines that gives us our estimate of the causal impact of the threat to name and shame. If we expected the threat to be ineffective what we would see is the trend line left of the threshold carrying across over the threshold and onto the right-hand side of the x-axis. But instead we see a big jump just at the threshold which corresponds to the regulator's threat to name and shame. So we do get some quite convincing evidence that the regulator's threat to name and shame was very successful in getting charities to pay this voluntary fee. And how successful was it? Well we could use the results of the regression discontinuity to make predictions of what would have happened if they didn't threaten to name and shame. So the top gray line is exactly what they raised from charities. It was just shy of 1.7 million in its first year and we can see that the probability of paying increases again the more you spend on fundraising. If we made a very naive assumption that just at the left of the threshold where the line cuts the vertical line that's about 20%. If we carry that across we would have expected the regulator to raise about 400 000. That's obviously unrealistic because we know that the probability does increase with how much you spend. So basically if we can carry the upward trend line on the left-hand side over to the right we would have expected the regulator to raise about just shy of 800 000 pounds. So we conservatively estimate that this threat to name and shame covered about half of the regulator's budget. If they didn't do that they would have had half the money available to them for regulation in its first year. That introduces lots and lots of interesting substantive reflections. Is this an approach that will be successful going forward if every year the regulator has to say you can be named and shamed for not contributing to the cost of regulation I'm not sure that's a successful long-term approach but as a one-off intervention it does look to be a very successful approach. Charities do seem to be concerned with their reputation therefore they will pay this voluntary levy to avoid negative reputational damage. So let's summarize some of the pros and cons of web scraping in our context but also just more generally for social scientists. So I think hopefully one of the major pros and it's probably difficult for you to divine just me showing you some code but when you actually get to interact with code as we'll show you in future webinars it is a skill that's relatively easy to learn and web scraping can be performed in you know 30 40 lines of code all of which is very accessible and available online. Web scraping is a very easy to routinize or automate methods so this is particularly important when data are continuously updated so in our example and if a charity stopped paying the levy it would disappear from the website and we wouldn't have we wouldn't have a longitudinal data set where we capture the charity's participation it just simply would not appear on the website anymore. So for us we schedule the script to run every month and but if it was particularly important if we were talking again unfortunately about the coronavirus health crisis it's incredibly fast moving if you wanted to collect data you might have to run daily versions of your web scraper as well. It's not it's not too difficult to do that and we can talk about approaches for scheduling programming scripts. Just thinking about social science in general there's lots of public sector bodies lots of charitable bodies and shared data through their websites so if you want to get a hold of annual reports kind of monthly statistics and figures you're probably going to have to go to a website and you probably won't be lucky enough to go to a data portal and to get some nice clean formatted open data ready to be imported into Stata or SPSS or R for example and so while that poses a challenge the good thing is there's incredibly rich information out there stored on websites that can be productively used in your research and it's also quite easy to format data scraped from a web page it looks messy it looks like paragraphs and images and tables etc and but as our example showed it can be easily formatted to permit data linkage you might be interested in local authority statistics you could go to your local authorities website you could scrape some of that information and then it might be possible to use a postcode field to link to a large-scale social survey for example of course it's not all singing and dancing there are considerable ethical issues and particularly around personal data so as you saw in our example we can scrape information about trustees and that is publicly available information but of course thanks to GDPR it means even if it's publicly available as soon as I scrape and download personal data I have a responsibility for how it's processed and so that does bring me into the realm of you know data protection and of course getting ethical clearance for doing such research web scraping may also contraven the terms of use of a website lots of websites don't want you scraping and they might not explicitly stop you but it'll be in the terms and conditions of using a website you may not scrape this information repackage it share it onwards etc so you do need to be quite diligent and you do need to read the terms and conditions on a website and to avoid getting into trouble probably another obvious con is if the structure of the web page changes your script can stop working currently actually with the fundraising regulator they've just made a change and we've had to update the script hence why it's not available on the public repository we have just now we need to get it working again because it's been two years since we first collected the data of course as well some websites are up to quite up to speed and they can actually blacklist you so they can ban you from scraping the data so your computer has a unique ID when it connects to the internet it's known as its IP address a website can just block your IP address and not even let you request the page never mind interact with it you won't even see the web page so some websites are advanced enough to do that others may allow you to scrape the web page but they may put some kind of throttling in place where you can only request the web page once every 30 seconds or you may only be able to interact with it in a certain way it also requires good internet connectivity for extended periods of time there are ways around that but I'm sure like me your university laptop is tightly controlled by IT services it can be difficult to leave it open for more than 20 minutes before it cuts into sleep mode for example there are ways around this you can have your own personal server for example but it's worth bearing in mind that if you have a very intensive long running script you need to have a think about your computer setup as well so thank you very much for joining us I really appreciate giving up your time particularly at a period like this so I can see some questions coming in so I'm going to read those out one by one because I'm guessing you can't see those so the first question is there an alternative package in R? yes there is that's a really good point you should be quite agnostic about the programming language you want to use if you're an R user absolutely you can do all of this in R you can do all of your text mining in R you can do all of your connecting to the Twitter API in R absolutely for us we began with Python so we've continued to use it but the techniques are very similar so absolutely I'm not advocating Python it works well but if you want to use R you absolutely can we had a question as well about which part of your code was for scheduling that's very good in the snippet I showed you we didn't have this the scheduling code what we did is we had a slightly more advanced version of the script that script is kept on a Raspberry Pi which I'm sure some of you have heard of that's basically a very small portable personal computer you can leave the Raspberry Pi on all day every day and so that's what we did we moved the script to a Raspberry Pi and then we basically scheduled the script to run once a month as well and we also had on the side of this we had a Twitter bot which after every scrape would start saying these are the new charities that have just paid the levy as well so you can also connect your script to Twitter or Facebook updates if you want to do some kind of kind of knowledge exchange or some public information using your scrapes data as well and we have a question about to what extent does the speaker rely on UK copyright law to scrape and reuse and store the data that's a very good question as I've mentioned GDPR applies when you're scraping personal data but of course copyright law does apply as well to the scraped information which brings me back to my point that websites will have a terms of use and if you read the terms of use then that should give you an idea what further use is and repurposing you can make of the information provided on a webpage so yeah as a general principle don't create a web scraper and then figure out whether you're allowed to do that read the documentation on the website us specifically because we contacted the fundraising regulator to get data in the first place we did clear our web scraping activities with them so they were fine with the use of data we have a question about can you scrape all kinds of websites like social media pages for example excellent question tentatively yes if it's a if the website is available at a URL so if you can type that web address in and it'll appear on your browser then yes you can write a scraper with social media web pages they provide a different approach which is to use an online database and then you're allowed write code that connects to that online database so you don't actually have to scrape the page there's a database that contains public available information that you can write code to access so that's a really good point as well you don't always have to use web scraping and being brutally honest and it always helps to contact the website owner first and say can I actually have this data as a zip file that might save you quite a lot of trouble in the long term as well we have a question here do you advocate using a Jupyter Notebook since you are using both Python and Stata yes so for those of you who are unaware a Jupyter Notebook is an electronic document that allows you to mix live code the results of that code and some narrative information as well so it brings together basically your Stata syntax file which is your journal article to put it simply in the next two webinars we'll be covering how to use Jupyter Notebooks to perform web scraping so hopefully if you can join us in April I'll actually provide you with a Jupyter Notebook that you can run yourself to perform your web scraping but yes as a general principle it's good to keep all of your work in one portable transparent document that you can share I have a question here about what about MATLAB what about it no of course I haven't used MATLAB for web scraping so I can't say for certain if it allows you to perform that functionality I'll have a look feel free to contact me directly my details are on the page and I will look that up for you that's no trouble out of curiosity how long did it take to collect the data through web scraping for this paper I'll interpret that in a broader sense of how long it took to write the code two years ago when universities were on strike about our pensions was the two-week block when we learned how to do this so it took us about two weeks to learn Python in terms of the scrape itself it's reasonably quick if we're talking you know if we're talking about a thousand organizations split across 25 web pages you're talking about maybe a script that takes a minute to two minutes to run it doesn't take very long that's because you don't actually have to kind of stagger your scraping but if you connect to an online database or an API for example usually you do need to stagger you are requesting of the information so collecting data from a database can take quite a bit longer from a website it can be reasonably reasonably quick you won't need to leave your laptop on and go do something else for 20 minutes a very interesting question any extra considerations for using this technique with private sector companies so is this private sector companies performing the web scraping or is this scraping information from a private company's website if it's scraping from a private company's website that probably brings you into a bit more ethical issues the terms of use probably explicitly ban scraping that website so you do need to be careful if it's a private sector company performing the web scraping then similar to academic use I mean as long as you're complying with the terms of use of the website you're complying with copyright law and you're complying with GDPR then yes you're perfectly entitled to perform web scraping excellent and we've had again similar questions about people who prefer or if you use or absolutely use that you can use Stata for example Stata has a a new web scraping functionality in version 16 I haven't used it myself but when I get that version I'll absolutely try it as well so broadly speaking this is a research method you can employ across multiple programming languages or even things like Stata and SAS for example and again is there an ethics code for web scraping I mean that's an excellent point I don't think there is a specific document I've seen that says here are best principles for web scraping one might exist if one doesn't exist I think I will put something together that's a really excellent point so how easy or difficult is it to debug your web scraping code it's reasonably easy if you think there's an issue with your code you can insert try accept clauses so you can try to do something and if it doesn't happen you can tell the programming script to print out an error and tell you what's what's going wrong and you can also very simply say at each state of the each stage of the scrape and print what's happening so you can have lots of print statements which which you know show the results on your screen can show you how successfully the script is performing so debugging is reasonably easy it's not a it's not a very difficult computational task in that sense this is a really good point as well what if the information is in a PDF and you want to actually write a scraper to extract the information from the PDF excuse me yes that's entirely possible in Python there's a PDF converter package I think it's called that will actually understand the structure of a PDF and it'll extract the contents excuse me so yes absolutely you can use web scraping to collect information from within a document as well so it's not I suppose web scraping but it's it's using scraping techniques to extract information from PDFs files yeah and a similar similar question I think that we had previously can you apply web scraping to social media websites not web scraping itself but you can write very similar code that connects to the online databases of these companies so Facebook Instagram Spotify Wikipedia The Guardian these all have online databases called APIs that you can download data from so it's not scraping it's requesting information from online databases and our webinar on the 30th of April goes into detail on how you would do that as well nearing the end what are the benefits of using coding against I think pre-existing software packages so yes so as I said Excel has some functionality for collecting data from a website there are lots of web scraping programs that you can download from the internet for me because it's a relatively simple task to learn I think it's worth writing the code yourself and then having the flexibility to adapt it to control when it is executed to update it when websites change I'm a big advocate of writing the code yourself if even if it's just for the intellectual dexterity that it produces in you you know coding is a very challenging but rewarding task and I think it's worth engaging in in general yeah and then final question I think so can you link web scraping with sentiment analysis absolutely web scraping is a data collection technique once you've collected the data and put it into a CSV file for example you can import that file into well for me Python and I would use the NLTK package to perform some text mining and some sentiment analysis and I think it's text minor in the or software package so absolutely if you can think of web scraping as a data collection social science method which can then be used to produce data for text mining sentiment analysis network analysis conventional quantitative statistical modeling approaches etc absolutely okay one more one more I'll be good to you so how flexible can the programs be if you're scraping from a number of web pages in different formats that's a good question usually websites are built so the web pages of a website will be built with the same file extension so they will all be HTML files for example but of course some pages might be different that a web page that has a piece of functionality in it might have a different file extension for example so yes absolutely there can be some challenging situations where if you need to scrape five web pages they all have a different file extension it may be it may be slightly trickier to request those web pages and then scrape them so yes yeah it can it's not a it's not a problem I faced but it's certainly something that might be possible yep and finally one more can we use web scraping for qualitative data absolutely in fact that's probably most useful for collecting qualitative data so for us I know we collected the charity number and whether it paid the levy or not but those were pieces of text information you know we didn't scrape anything that was a number for example everything that we scraped you scrape it as a string you know as a series of characters so absolutely you could scrape entire paragraphs entire web pages of text and save those for content analysis at a future at a future point yes absolutely it's probably particularly good for qualitative information and your follow-up question yeah is there a bibliography yes on the 23rd of April in the next web scraping seminar that I'm going to host I'm going to show you how to actually do it and you can follow along and I will suggest lots of additional reading and some excellent resources so yeah don't worry about that excellent so I will show you a couple of links that you may find useful so we've got a public repository with the UK data service where you can find the materials underpinning this in future webinars quite shortly we will be publishing this webinar on our YouTube channel so you could watch again if you fancy you can also reach out and help ask ask some help with us you can also just contact me directly I'm happy to answer further questions or you can get in contact in general with UK data service we're on Twitter we're on Facebook and I can show you that repository really quickly yeah hopefully you can see this so at the UK data service github page and we now have a new forms of data repository so for example for the web scraping series we have three webinars you can find details about each webinar and then you can find the resources underpinning them as well so the slides from today are currently up and for the future webinars I'll be posting bibliographies Python programming scripts that you can amend yourself and yeah reading lists etc so we'll be publishing lots of self-directed materials here over the next couple of weeks so please join us