 Good, apologies, it was an issue on my end. So hopefully the only little road bump that apologies in the way was an issue on my end. So fantastic. Okay, thank you for joining us. My name is Deerman McDonald. I'm a research associate here at the UK Data Service. Welcome to my kitchen as well. Today is part of our kind of brand new coding demonstrations series. So the first one is introduction to Python for social scientists. We're gonna spend about, you know, 25 minutes going through a social science task that can be completed in Python. I'll take some questions at the end as well. Any comments, loads of you are doing it already. You can post it in the chat function. Otherwise, if you haven't signed up to Twitch, you can contact me on my Twitter handle, which I will, well, it's at Deermid Max for my first name and the first two letters of my second name. If not, I'll put my email contact details up on the screen at the end as well and you can contact me. With any questions or comments or criticisms. Yes, excellent. So today, we will look at Python as a tool for social science data analysis. So this script that we're gonna work through today is available online. I will show you the link if you want to follow along. Don't worry if you don't want to run the script at the same time as me. It's totally fine to just watch what I'm doing and then revisit the script later on as well. I will just show you where you can get the script if you want. So I'll just show you this link up here in the top toolbar. So on GitHub, the UK Data Service has a repository called CodeDemos that you can see up here. If you type that into your toolbar, you'll be able to follow along with what we're doing. And hopefully my lovely assistant, Julia, will type into the chat, this link here as well so you don't have to worry about typing it up. So Julia will post that link in the next couple of seconds. But for now, all you need to do is actually follow along with what I'm doing. So we've created an interactive coding document. It's called the Jupyter Notebook. Some of you may have seen this before. It's a way basically of combining code, results, and narrative into a single document. There's lots of ways of writing Python code. You can write it in a plain text file and save it as a Python file. There's tools called Spider, lots of other environments for writing Python. I quite like Jupyter Notebook. It's open source, so it's free. It comes with one of the main installations of Python, so there's nothing extra you need to do to use it. And hopefully, as you'll see today, it's quite a nice tool. It's quite well-formatted and it's very powerful. So today we've got two aims and I'll increase the size just so you can see it a wee bit. So we're gonna demonstrate how to use Python for a typical social science research activity. Today is about the kind of things social scientists need to do. That's very broadly defined. Also, if you're not from the social sciences, today will still be relevant for learning Python, but it's particularly useful for those of you who are social scientists. And we're also gonna try and cultivate your computational thinking as well through this example. The main task of being a programmer and it's the skill that really earns the big bucks, as they say, it's being able to solve problems. Yes, some of those problems are quite technical and quite advanced, but ultimately, programming is about writing code that solves some kind of problem. So we're gonna do that today as well. So I'm gonna enter a slideshow mode. So this should look nice and formatted. And I can adjust the size as we go. So if I see comments, then I can easily enlarge, make it smaller. So we'll settle on this just now. So all I've basically done is I've turned my Jupyter notebook into a presentation and you can interact with the Jupyter notebook as follows. So here's my first really simple Python program. I'm gonna run it. So I've pressed shift and enter on my keyboard. It's asked me for my name, I'll enter it. And it sends me a nice little message about enjoying learning Python. So you can see Python to begin with is a very English language-based programming language. So if you want to print a message to the screen, you use the print command. If you wanna get input from a user, you use the input method as follows. So let's get straight down to our social science research example. So Python's a really powerful tool. You may have heard of big data analytics, deep learning, neural networks, artificial intelligence. All these advanced kind of things can be done in Python. So it's got some really nice methods and techniques that we're gonna find very useful as social scientists. So for example, let's say a file exists on the web and instead of me opening my browser, I wanna use Python till I go get me that file, download it and save it somewhere on my computer. So I can write some code that does that. And then I can say, okay, I've downloaded a PDF. I don't really need to leave Python to take a look at it. I can actually ask Python to open it up for me itself. So this is a specific example from my research. I'm a charity sector researcher and here's one of the annual accounts for a charity based in England and Wales. So that's a really simple example. You can collect data from the web and next week we're gonna look at some techniques for doing that. Python's also really good for interactive visualizations. So here's one I wrote earlier. So all I'm doing is loading in something I created before. This is known as a Sanky diagram. It might be familiar to you. It's good for showing the flow of observations between categories of different variables. So for example, I've got some fictional data here. Small, medium and large organizations. The small organizations tended to participate in workshop three some participated in workshop two, et cetera. And then we can see whether those organizations gave us feedback on the workshop they attended. Again, completely fictional example but just shows some of the potential of Python. And it could do so much more obviously. There's natural language processing. So that's kind of text mining or trying to find the meaning in language. There's machine learning. So these are these kind of automated algorithms for unearthing patterns in data. And there's lots of other things you can do as well. But in anything, it's important that we learn to walk before we start running. So today we're gonna focus more on core data manipulation skills that many social scientists could really benefit from possessing. So there's both the programmatic aspect of what we're doing and there's the social science aspect as well. And Python allows us to do both, I would say, reasonably easily. So let's define the problem or the project that we wanna undertake. So a common social science research activity is creating a sampling frame. So it's officially defined as a list or other device used to define a researcher's population of interest. If you're a qualitative researcher, a sampling frame might be a list of documents that you then want to sample from and then start doing thematic analysis of. Maybe as a quantitative researcher, it's a list of organizations that you want to send your social survey to. But it's broadly defined. A sampling frame contains your units of analysis and you wanna work with those units. You wanna contact some for interviews, et cetera. But as I said before, programming is fundamentally about solving problems. So let's reframe what I've just been talking about and problematize creating a sampling frame. So let's say Jane is a sociologist. She's undertaking a mixed methods PhD. The research design involves sending surveys to individuals initially. So she's sent out three waves of surveys. She's gonna collect the results and then she wants to conduct semi-structured interviews with a subset of these respondents. So the surveys have been done. Now she wants to construct the sampling frame of everybody who was surveyed and then she wants to take a sample of those and invite them to participate in some interviews. So defining the problem is really important. It's the first thing that we do when we approach a computational social science task. But it can seem quite daunting if all I say is create a sampling frame. You won't really know where to begin, how to do that, especially with a programming language. So a key computational thinking skill is to decompose so it's to break down the problem into smaller steps. So with the sampling frame activity, what we wanna do is find the files containing the surveys. We wanna open each survey, extract the contents. Then we want to combine or append all of those survey responses together. That gives us a master file of everyone who responded and we call that our sampling frame. We obviously wanna save this. We don't wanna lose the fact that we've created a sampling frame and forget to press save, for example, if you were working with Excel. And then you wanna save the sampling frame as a separate file. Then we wanna take the sampling frame and we wanna take a random sample of those responses who basically then would act as the people we would contact to conduct our survey with or our interviews with. So decomposition is a critical skill. And you can see we've got these steps really for solving our problem. You can think of these steps as pseudo code. You may have heard that term. Basically that means Python can't understand what you mean by locate the files containing the data we need. But it is a key step in solving the problem and there are commands we can write in Python that will find the files that we need. So writing the pseudo code is an excellent first step for defining and decomposing the problem we want to solve. So let's get straight into it. There's much more material in the notebook about what Python is, what key terms are, what's a programming language in general. Let's get straight stuck into writing and editing some Python code. So when you launch Python, so through Jupyter Notebook or you open it up using another tool, it's got lots of functionality built in. So you can start doing calculations, you can start printing messages to the screen, you can get user input, et cetera. So for example, if you want to do relatively simple mathematical calculations, we can multiply these two numbers together here and Python will produce the result for us. Or we can print statements to the screen again. Here's a little message wrote to myself. Python returns that message to the screen. But for more serious computational social science work or just more serious programming in general, we need to bring in or import some additional functionality specific to what we're trying to achieve. And again, Python is English language based. It's been deliberately simplified so that it's easy to read and write. Certainly for English language speakers, it might be different if English is not your first language. Let's make this a tiny bit bigger, perfect. So if we want to import modules, some modules allow us to use extra techniques and functionality that's not included when we launch Python. So the first thing we'll do is import a module called OS. It stands for Operating System. It allows us to kind of work with our machine so we can find folders and files. You can create new folders, delete them, et cetera. We've got a module for working with comma separated values files. That's a common data sharing format on the web. We've got a module for working with data sets. It's called pandas, like the black and white bear. And then we've also got a datetime module as well. That's for useful tasks like, well, what's the exact time and date right now when I'm running my script? That can be quite useful as well. So I want to execute these four lines of code and to confirm that they worked, I just want to print out a little message saying, hey, you successfully imported the modules. And as you can see, Python returns the message. How do we know the modules have been imported? Well, Python is a sequential programming language, which means it executes the commands from the top, moves on to the next one, moves on to the next one, et cetera. So we only know that the modules have been imported correctly because otherwise it would not get to the print statement. So if the pandas module wasn't imported correctly, Python wouldn't get as far as the print statement and it would return an error instead of the print statement. So you can see how printing messages, while it's a very simple technique, is actually quite good for finding errors in your code. It's a process called debugging and we'll speak about that a bit more. So let's go back to the social science task at hand. So we want to find the files, so we want to find the survey responses on our machine. So a directory is another name for a folder. Directory tends to get used in computing science terminology. So usually we would do this manually, correct? So I'm using a Windows machine, I would click on the folders icon and I would navigate through the graphical user interface and I would try and find my files. The good thing is Python can actually do that for you. So the first task is to figure out where we currently are. That's a bit of a silly question. So what I mean is, where is the file that we are currently using located? So my Jupyter notebook, where is that currently saved? How are we using it? So we can use a command called os.getcwd and you can see Python returns some output along series of characters, which is called a string in Python and it tells me exactly on my computer where this file is located. So there's a users folder, there's my personal folder, there's a projects folder I've created, there's code demos folder and there's separately a code folder containing the Jupyter notebooks. So that's quite useful. Let's just unpack what Python is doing with the command here. So remember we imported a module called os, so operating system. As part of that module, there's a method called getcwd and that method returns the current working directory. So you can see the flow of a command from the module to the specific method you want to use. So it's module name, dot or period and then the method name that you want to use. Okay, so we know where we currently are on the machine, what's actually contained in the folder where we're currently located. So again, the os module is really good for this. There's a list directory method and if we run that command, it finds all the files and all the folders that are currently in the working directory. So you can see there's the PDF of the accounts that we downloaded previously, so that proves that there is a file that we downloaded. It's not a magic trick I've done. There's an images folder, which is not much relevance to this. There's a readme folder, which when you go on to the GitHub gives you instructions for how to use the code. There's a responses folder, which contains the survey responses. There's a sampling frame folder and here we have two Jupyter notebooks as well that I have created. So good, so we know how to find folders and we know how to reveal what folders contain. How do we look inside another folder specifically? So we can go back to the list directory method and then this time, we can pass in an argument to that method and the argument is a folder name. So as I showed previously, there is a folder name called responses, so now I want to look inside of that. So when I run the code, Python returns the contents of that folder. You can see that we're gonna be working with some real data sets today. We're gonna be working with open data referring to the 61, 71 and 1981 UK Census. So this is real individual level responses to those three censuses. It's available from the UK Data Service. There's a link to the license. It's open data, so it's free to use. So we're gonna pretend that the census was Jane's survey responses. So you'll see actually that we've got quite a lot of data if this really was Jane's project. So where exactly is the responses folder? So you'll notice in the previous command, I was simply able to say, show me everything in the responses folder and I didn't actually have to say exactly where the responses folder is. That's because Python is good at figuring out where files and folders are relative to where we're currently working. But if I wanted to know exactly on the machine where the responses folder is, we can use a new method called path.abspath. So that stands for absolute path. And again, we get something back that's quite similar to what we had before. So there's a C drive. So that's my hard drive on my computer. Again, a folder called users, et cetera, et cetera. And within the code folder is a folder called responses. So you can see that there's two ways referring to files on your computer. There's the absolute path, which is this long list starting on the hard drive of where something exactly is. Or there's the relative path, which is relative to where we currently are, where is the responses folder? Is it up one level or is it down one level? In a future coding demonstration, so the one at the end of the month focusing on your computational environment, we'll talk a lot more about relative and absolute paths. But I'm sure you can probably pick up already that relative paths are so much more useful. Let's say I get a new job, they give me a new laptop, it's not gonna have the same absolute path. So I really don't want to update this every single time I wanna find the same responses folder. So it's much easier to use a relative path. This goes for if you're collaborating across teams as well, if everyone has to change this link every single time on their machine, it gets really tedious and it just introduces the possibility of human error which we want to avoid completely. So we have a preference for relative paths over absolute ones. So finally, we're gonna create a folder to store the sampling frame file. So as you can see already, it's been created on my machine because I've been working through this code all day. If it didn't exist already, we would use the makedir command. So the make directory command. And what we pass it in is one argument which is the name of the folder we want to create. So now we get an error. So because my folder already exists, Python gives back a file exists error. Again, hooray for Python, it's English language based. It tells you the type of error you're experiencing. It cannot create this filer folder because it already exists. So because I know it exists, I'm just gonna blank out that command. So adding the pound or hash symbol tells Python, ignore that command. And instead I'll just list all the files and folders that are in the current working directory. And you can clearly see that the sampling frame folder exists. So we can move forward with creating the sampling frame. Probably a little bit nerdy. I know it's not the most interesting topic in the world working with your file system. It's absolutely crucial. It's really critical. I'm sure there's been cases during your thesis, during your postdoc, during your work, whatever it is you're doing, where you've moved machine or you've come back to something in a couple of months and you've thought, where the hell is that file? Is it in that folder? There is a raw data folder, but there is also raw data in a folder called data. It gets messy. You can actually use Python or any programming language for that matter. To set up your file system at the beginning and then you can be safe in the knowledge it works going forward. So let's get stuck into more of the content of the files. So let's open one of the census files. And to do that, we tell Python where to find it. So we've got a 1961 census file and that exists here in the responses folder. And then within that folder, there's a file called census 1961.csv. So when I run that command, I don't get any output because I haven't asked for any. Basically, all I've done in this line of code is define the variable called census 1961 underscore file. And that variable represents this file here. So I no longer have to write this out when I want to refer to the 1961 census file. I can just refer to this variable here. You'll notice that in Python, variable names are quite permissible. They can't start with a number but they can be unlimited in terms of how many characters, for example. So very, very silly, silly example. This will also work. So if I copy that across and then let's say if I print the location of each of these variables. So if I say that and I say that, you can see that they represent the same value. So the census 1961 file variable captures where this file is stored and the chicken chicken variable also captures where this file is stored. So your variable naming convention is largely up to you but I'm sure you can see the dangers of picking silly names or names that are not representative of what the actual variable contains. So let's move on to actually opening up the file and reading in the data and taking a look at it. So you'll note that earlier we imported a module called pandas and then we just abbreviated that to PD. And again, that just saves us typing pandas all the time. So we can refer to the pandas module as PD. So from that module, we want to use the read underscore CSV method because we have a CSV file. We need to give this method the location of the file we want to open. So we've stored that location in this file name in this variable name here. And then we've got some extra arguments just telling pandas how to actually interpret the contents and to just avoid creating an extra variable that we don't want. So let's run this command here. And this is instructive, so we get a warning. So you notice before we got an error. So if Python returns an error in your code, it means that command has not worked. If you get a warning, it means the command has worked, but there's something you should probably follow up with. So this specific warning here basically just says there are some variables in the file that have a mix of data types. What that means is maybe a given column has a mix of numbers and then also has some qualitative information. You'll know from working with data in general, if you have a variable, every value of that variable should be the same type. So it should be all numbers or it should be all text or all categories, for example. But because it's a warning, we know that that command worked. So we're just about to use pandas now for actually examining the contents of the file. So pandas has a neat little method called sample and into that method, we pass an argument, which is a number. So I'm gonna say show me 10 random observations from the 1961 census data. And we can see here pandas is quite nicely formatted. It's maybe not as nice as something like SPSS or Stata. I hesitate to say Excel. I don't think we should be really using that as social scientists. So pandas is quite nice. This then shows us a nice, rows by columns view of the underlying census data. You can see there's an extra column here, which doesn't have a name. Basically, this is the row ID. So row 233,769 contains the observation for this person here. So person 387, et cetera. To prove that at randomly samples, we can keep running this code and you'll see we keep getting a different, well usually a different set of observations from that data set, which is quite cool. So pandas not only allows us to read data in, but it then allows us to explore the contents of the data itself. So remember, we talked about decomposing a problem. So we have three files we wanted to find and read and work with. So we're gonna work with one file in order to solve that problem. We've just demonstrated that we can do it. So now we're gonna apply the solution to the other two census data sets that we have also. So we're just gonna do this really quickly. You can see I define two more variables that point to the two extra census files. And again, I use the pandas module and the read CSV method to read those in into Python. And again, I get the warning. It's not an error, so that's totally fine. The census data should have been read in to Python. So now we move on to the next stage in our research activity. So that's creating the sample frame. So we have three separate files. We really want all those observations in one file and then we want to randomly sample from that big file. But one thing to mention just before we move on to tackling that problem is if you've used SPSS or Stata, for example, or lots of other software applications, you'll notice you can only load one data set into the application at one time. But you've noticed with Python, we've created multiple variables to store multiple data sets. So Python and also R are really good for this. So with a programming language, you can hold lots of data sets in memory at any one time, dependent on how big the data is. And that makes it really good for what we're about to do here, which is creating a single data set from lots of other smaller ones. So the first thing we'll do is we never want to overwrite existing data sets. So we have a variable here, census 1961 data. This captures the records relating to that census data set. We don't want to overwrite that. So let's create a new variable here called census all data. And initially, it's just going to be equal to all the records that are in the 61 file. And again, we just want to take a little random sample of observations just to make sure that our data set variable has been created correctly. Okay, so now we've got a variable to hold our sampling frame. So let's now combine or append the records from the other data sets to the bottom of the new one that we've created. So again, pandas has a really nice method called append. And into the append method, we pass, well, we pass one argument. And that argument is now a list. So we can see that at the one time, we can append more than one data set. So let's see how that works. What that's equivalent to is actually duplicating the append command. So if we didn't pass these data sets in here as a list, then instead we would have to have separate commands. So you can see where Python as a programming language is quite efficient as well. It knows that we might have to append multiple data sets together. And that allows us to do that using one command. So we're not duplicating effort. So that's a nice programmatic aspect of Python. So now we have to perform a quick robustness test. So we've combined three different data sets into one. Well, arithmetically, we should know that the sum of all the records in each of the three data sets should add up to a total in our new sampling frame. So the sampling frame should equal the total of each individual census data set added together. So how do we know how many records are in a given data set? So there's a really simple, again, Python method called len, L-E-N. That's short for length. So that basically counts the number of rows in our data set. So how many rows are in the new sampling frame that we created? Well, we can see 101.562.660 million records. So it's quite big. So we've got three separate census data sets. We've combined them together and now we have over a million records. So for example, how many were in the 1961 file? Again, we'll call len method and we can see there were 500,000. The open census data sets are basically a 1% sample of the total census. So that kind of makes sense. So in 1961, there were roughly about 50 odd million people in the UK who participated in the census. So we could, I suppose, calculate the length of each individual data set and then add those together, but actually we can do it all in one line in Python, which is really cool. So basically we can evaluate whether the number of observations in this data set is equal to the number in this data set plus the number in this plus the number in this. You can see Python gives us back a true or false value. So it doesn't give us back a sum. Basically, we're asking Python, can you tell us whether it's true or false and that the number of observations here totals the sum of these ones here? This is called Boolean logic after George Boole, a mathematician. Basically a Boolean variable only has two values. It's either true or it's false and we can use those Boolean values then to evaluate whether something is the case or not. And then that helps us to structure our code so we can say, well, if the new census data set does contain all the observations from the previous three, now we can save it. If it doesn't, maybe we can ask Python to print an error message to the screen, for example. But that's something that brings us into more intermediate kind of area, so we'll move on to that in a different coding demonstration. So we can see now we've got a million and something observations, far too many people in contact about participating in follow-up interviews. So let's take a random sample. So again, pandas is really good. We can use the sample method. That takes an argument called frac, so fraction. So what fraction of the data set do you want to sample? So I want a 1% sample, so that's 0.01 fraction. So that's, I want 0.01 proportion of all the observations in this data set here. So I'll take the random sample and then I'll calculate how many observations are now in the random sample. So it'll always return a 1% sample. So if I keep rerunning this command, the number or the actual observations in the random sample, of course will change but the actual number will always stay the same because I'm asking for 1% all the time. Now this brings us to a little aspect of Python which is really good as well. So this is another way of performing the same task. So as I said, pandas is the module that has the sample method. So I could also say this, I could say from the pandas module, from the data frame class, use the sample method, use the sample method on this data set and take a 1% random sample. So you can see again, we get the same answer we had previously but it's a little bit longer to write the code. What's really good about Python is once you create a variable of a certain type, so this census all data variable is a pandas data frame variable. So it's a certain type of variable in Python. Python knows it's a pandas variable and that means I can use pandas methods on that variable directly. So again, I don't have to go right in pandas, it's in this class, it's this method and I need to apply it to this data set. Again, I can just shorten it down, I can say I have a data set. Because it's a data set, I can do certain methods to it so I can call the sample method directly on the data set. Very long-winded way apologies of saying that but this is a really time-saving device by Python again. It's clever, it knows what type of variable you're working with and then that allows you to use certain methods. Right, so we're at the final task in solving our problem. So once again, we can use the pandas module to simplify saving a file for us. So in pandas again, we've got a lovely two underscore CSV method. It takes an argument which is basically the file so we wanna create a new file in the sampling frame folder. We're gonna call it census samplingframe.csv and again, there's an extra option saying we just don't want an extra variable called index as well. So we're saving the full list of census data and we're also gonna save the random sample that we took as well. With Jupyter notebooks, you'll probably notice there's a little asterisk here. That means that it's currently executing the Python code. So when that asterisk disappears and it's replaced with a number, that's how you know the command has finished running and that's not present in all types of Python environments. So sometimes you're actually unaware whether your Python command is currently working in the background. Let's say you've got something like, 100,000 websites to scrape, that's gonna take quite a while. Without that little asterisk telling you that things are still kind of ongoing, you might cancel the script too soon, for example. This command is taking quite a while because it's just finished and because as you've seen, it's a million and a half records, that's quite a lot. It's taking a bit of time to write that file to my computer. But you can see that it's just finished. So how can we tell it worked? I mean, so I can go onto my computer itself, I can click on the folder icon, I can go searching for it. By now you should probably realize Python's really good for this. I know where the sampling frame folder is. I can just ask Python to list the contents of that folder. Voila, so you can see that the two files I wanted to create certainly exist. At this stage I don't know if they actually contain what I want them to contain. So there's an extra step. It's just worth checking if the contents actually exist. And as you can see, we're coming back around to a similar technique we saw earlier. We use pandas again. We read in a CSV file. I'm gonna read in the sampling, the random sample this time. And I'm gonna look at 10 observations from that file. And again, keep running it, keep giving me different observations, et cetera. I obviously could just take away the sample method. I could just call the dataset directly itself. That creates an enormous file. So it'll only display up to a certain amount. So it'll display the first five. It'll have some ellipses here to say that there's a lot of records in between. And it'll also show the final five records as well. But there are also other Python commands and tools that you can use to actually dive into data sets and look at them in a different way. So voila, we've successfully solved our sampling frame research problem. So let's have a quick little reflection on what we've learned. So we've learned how to import modules. So Python has a lot of flexibility. It has a lot of functionality that you can use straight out of the box, which is really good. But really for scientific work, you're gonna have to import extra techniques. But as you saw, that's really easily done. And a lot of those modules get downloaded to your machine when you actually download Python itself. But in a couple of weeks, I'll show you how to actually install new modules on your machine so you can use them in Python. We learned how to navigate, create and delete folders. Again, super boring, very important task. And you can use Python to navigate through your directory structure. And again, we prefer relative paths because that makes it easier to collaborate with people and it allows our project to move across machines and move across time and move across project team members. What you're probably most interested in as a social scientist is how to manipulate data. It's the kind of key data science, computational social science skill. And we've just seen using one module, Pandas. Again, an open source free module, really easy to use. And we can do some fantastic things with some big data sets. And hopefully you've learned a little bit how to structure your code. So as I said, it's sequential, but that's a bit obvious. Your variable names are quite important. They have to have a meaning. They need to really communicate what the variable contains. You don't wanna write lots of extraneous comments. You want it to be obvious from your commands what's happening. For example, here I don't want to write lots of comments saying in this block of code, blah, blah, blah, blah. Excuse me. So it is good to write comments, but actually using a Jupyter notebook, you can see it's easier to write a bit of text preceding the commands, explaining what's actually happening. So just before I finish and take a couple of questions, so loads of them are coming in, thank you very much. I realize that's probably 35 minutes. So apologies. You've probably undertaken many of these tasks before. We're not advocating that you use Python for every single social science research task or activity, but if you were to do this manually, you would say, right, okay, I need to create the folder. So I'll right click somewhere on my machine. I'll choose the new folder option. I'll type in the folder name. Okay, that's fine. That's not a lot of extra work. Okay, then I will left click on every file name. I'll probably open the file in Excel. I'll do control A, and then I'll copy that into a new file, and then I'll open the other two census files, copy and paste, et cetera, et cetera. But that kind of avoids the advantages of using a programming language, or just in general, just be more computational about your work. As you can see, the code we have here is scalable. So if I had 10 census files, or if I had 100,000, I could adjust my code very slightly so that it would loop over all the census files and it would import them all into Python. So with an extra line or two of code, I can handle hundreds of thousands of files, relative to my computer being able to have that much memory to process that many files. Excuse me. But my code is scalable. More files, more folders, et cetera, my code will be able to deal with that. And there's also a reproducibility argument. I'm sure you've all been there before where in a couple of months, your supervisor says, can you reproduce that table in that chapter, or can you find those quotes again that perfectly encapsulate that theme that you're talking about in one of your chapters? And I'm sure you've gone back into folders and you've said, oh, where is that again? Is that in data underscore final, underscore final, underscore final? Sounds very silly, but we've all been there as well. Using code and using efficient, concise code means you can reproduce what you're doing with yourself in the future, across project teams, et cetera. So code is very good for reproducibility as well. And the final argument really is automatability. So maybe census information becomes available on a more frequent basis. That's a silly example, but with the COVID-19 crisis, the understanding society social survey is now gonna produce monthly findings through the UK data service. That's a monthly reminder you're gonna have to set in your Outlook calendar, for example, to okay, check whether the understanding society data set is now available. You could use code to automatically check whether there's an update and if it is to download it. That might be exactly possible with understanding society, but there are lots of data sets out there on the web that you can write code to routinely collect. So you don't have to worry about setting reminders or stuff like that. But I'm gonna stop talking now because that was plenty. I'm gonna work through some of your questions, but I can see my very helpful assistant, Julia, has been answering some of them as well. As I said, this Jupyter notebook is available online. It's got much more text around the command so you can do some self-directed learning. So is there any homework? It's one of the questions. The only homework, I would say that there is, is to just go to the notebook that we've created, either download it to your machine, or you'll see that there's a link as well so that you can run the notebook in your web browser. No installation necessary. Walk through that and then I do point out some really useful resources. There's a couple of free books online. There's a book that you should buy, which is really good. It's about social science and Python programming. I don't get any commission from that. I just think it's a fantastic book. So yeah, no homework except just check out the notebook and just have fun. Can you import a data set from SPSS to Python? Yes, you can. I think Pandas can certainly import data from Stata and I think you can do it for SPSS as well. You can also adjust your workflow so you could use SPSS if you have it on your machine to export the data as a CSV file and then you can import that. But it should be possible to import SPSS as well. Let's take a couple of more questions. Will you be using beautiful soup library later in the course? Yes, we will. Good question. So next week, we're gonna look at how Python can be used to scrape data from a website. That involves requesting the website itself and then once we've gotten the website contents, we need to understand those. We need to work with those and yes, the beautiful soup library makes sense of web pages. So yes, we'll be doing that next week. We will use beautiful soup for that. Perfect. Thank you, Julia. How do I run Pandas? Yeah, very good question. So you'll have seen earlier in the script. I can go back to it really quickly. Basically, all you have to do is make sure you import it into Python first. That's always the key thing. If, yes, let's demonstrate. So in this command here, I'm importing all the modules I need. But for example, let's say I comment out that command. So Python won't run that command. It skips over it. It imports the other ones. So if then later on, we want to read in one of the census files. Oh, I need to run the import command again. Yep. So Pandas should no longer be imported into my Python session. Yeah, so I think actually, unfortunately my notebook is remembering that Pandas is already in Python. But if it wasn't, you know, so if I did this and then I tried to use the Pandas module, it wouldn't work, for example. So if you want to use Pandas, or you want to use any other module that you need for your work, you just need to import it here. Yes, so somebody's commented, yeah, it's the Phil Brooker book that we recommend. Yeah, absolutely. Thank you, Julia. I can see the chatbot is stopping some of you from swearing by the looks of it. I'm sure you're not actually deliberately swearing. There's a question about where you can find information regarding different packages. So they're pros and cons. That's a really good question. So the version of Python I have is called the Anaconda distribution. So things like Pandas and NumPy, as you've just mentioned, PsiPy. So these are modules for scientific computing. They come as standard. You don't necessarily have to install those on your machine. Really the two areas I get information about packages are the package documentation itself. So if you just do a search engine search for the package name, so NumPy or Pandas, there'll be documentation you can kind of read about the functionality it provides or go to Stack Overflow. I think Julia might have published the name or the link. It's an excellent help forum for programming tasks. And that'll usually explain why Pandas might be good for one thing, but not for another. So actually Stack Overflow is probably the main source of help I would use for Python, actually. But it's also a bit trial and error. Frankly, you just need to Google, or sorry, you just need to use a search engine, look for the problem you're trying to solve and then read through some of the solutions. Some will use Pandas, some will just use the CSV module, some will use the OR programming language. It's really up to you. Yes, so we have a question about, I was running my Jupyter notebook from my machine, but you could have run it online if you wanted. There are some instructions for running it as well. Julia's posted the link to the GitHub, the code demos. There's an installation document, which will tell you how to install Python and Jupyter on your machine. It's really easy. It'll take a couple of minutes to download. It's really easy to run. So you don't need a lot of programming knowledge to use Jupyter notebooks whatsoever, which is really good. Couple more questions if you want. Oh, okay, are there any security concerns I should know about with Jupyter? Nothing that jumps out. Again, it's an application that's installed on your machine, so it's similar to downloading maybe Dropbox or whatever you can think of, lots of applications. It's open source. It's reasonably well maintained. I mean, it's an incredibly popular open source programming tool. I don't know of any specific vulnerabilities. So when I was running Jupyter notebook on my machine, my laptop is connected to the Wi-Fi, but Jupyter notebook isn't. So the data files are all on my machine. My commands are all on my machine, et cetera. So nothing major in terms of security, no. Just the usual security of when you're downloading or moving data up and down from the web. That's what you need to be particularly careful about. Okay, so do you have any suggestions on using the Google Translate API? The next webinar series that I'll be doing in probably July or August will be about social media data. So we'll be connecting to Twitter, Facebook APIs. But we may also look at things like YouTube's API. We may look at Spotify. I didn't realize Google Translate had an API. If it does, I'm gonna look into that. I'm happy to take a look at that, see if it's possible to write a little bit of code. So that's a good suggestion. Is there a command to list the loaded modules? Yes, I think there is. I didn't put it in, apologies. Yes, yes. Let me see if I can do that really quickly, actually. Yeah, so I'm back in the notebook. Yeah. You can definitely check modules are installed on your machine using the pip freeze command. Yes, you can see on my machine here that I've installed lots of different modules. A lot of these come pre-installed with Python. So don't worry, I haven't done anything major to make my machine really good for Python programming. You know, I don't know it off the top of my head, but I'm gonna add that into the code. Thank you so much, I will look that up. Ah, sorry, you meant something else, was it? Oh, you mean, can you use something else for writing your Python code? Yes, you can. So I write some of my Python code in something called Sublime Text. It's a notepad editor. So I've got a script here that downloads fortnightly statistics from our GitHub repository so I can see how many people view the materials. And it looks exactly like a Jupyter notebook, except it's very hard to write meaningful comments in between the code. So I've got some comments just saying, okay, I'm gonna do this task. I'm gonna create some files, et cetera. But I can't actually see the results of what I'm doing. So if I run this script here, you can see I've asked Python to print out some print statements, but you can see this is a really unfriendly way of checking what my commands are actually doing. But yes, you can write Python in notepad and you can write it in notepad++. You can write in lots of different tools. I quite like Jupyter notebooks. There's lots out there. It's totally up to you what you wanna use. There's one called spider, S-P-Y-D-E-R and that comes with the Anaconda distribution as well. Yeah, okay, I'll take one more question if that's okay. It's about security again. So can you use Jupyter offline? So this person works with government data under the Official Secrets Act and doesn't wanna risk uploading or downloading data from the web. Yes, so when you launch Jupyter Notebook on your machine, you can see it has this funny web address here beginning with localhost. What that basically means is my computer is acting like a server. So when you wanna request a web page, for example, it's HTTPS and then the address of the web page, localhost just basically means the files I want to use are actually on my machine. So yes, Jupyter Notebook runs on your machine. It doesn't connect to the web in order to run. So as long as you have the data on your machine, then yes, there's no security risk in that perspective. No, you're just working on your laptop as you normally would, for example. Yeah, and there's someone here, really good comment. Jupyter Notebook can be a bit confusing to start with. I completely agree with that. You're right, it takes a little bit of practice and you use the IDLE, which is another tool for writing Python, that's totally fine. Nope, Jupyter Notebooks are something I particularly enjoy writing because they're good for sharing. It would be quite tough to turn this to make you guys read this as opposed to reading lots of comments explaining the code. So, but yeah, it's your work. If you want to use Python, if you want to use R, if you want to use whatever it is you want to use, that's totally fine. Yeah, I'm not going to push for you to use a particular type. Yeah, okay. So as we've established, the only homework is just viewing this notebook here when you get a chance. As I said, I've written a little bit more about what a programming language is, some key terms that are worth knowing, what's the difference between code, programming, a script, a shell, all these types of things. I've got some suggestions for things to read. Again, most of which are free. They're online articles, Brooker's book, yeah, you don't need to publish. Code Camp is really good. It's another set of Jupyter Notebooks. Lisa Talia Ferry's book is open access. So if you click on that link, there's a really excellent kind of reference book to using Python, which I quite like because it's really specific to your tasks. You're like, okay, I want to use a list. You can just go straight to the list chapter. It's really good. So Julia's posted the links to our GitHub repository. I'll also, in a couple of days, we'll let you know when this recording is available, and we'll also send you a link to all the material. So don't worry if you didn't get it taken down during this. I really, silly, silly forgot to log myself into Twitch, so I actually can't comment in the chat, which is why Julia is doing the commenting for me. But yeah, we'll be in contact, plenty of help. You can contact me directly as well. I'm totally happy to do that also. So thanks for joining me. It went on a bit longer. It's a pilot. We'll probably ask you for an evaluation. Be as critical as you want. Thank you so much for giving up your time, given the situation that we're in. And yeah, good luck with your programming journeys. Bye-bye.